2023-08-18

cs.CV

cs.CV - 2023-08-18

Revisiting Skin Tone Fairness in Dermatological Lesion Classification

paper_url: http://arxiv.org/abs/2308.09640
repo_url: https://github.com/tkalbl/revisitingskintonefairness
paper_authors: Thorsten Kalb, Kaisar Kushibar, Celia Cintas, Karim Lekadir, Oliver Diaz, Richard Osuala
for: 评估皮肤疾病分类算法的公平性，因为皮肤疾病的表现可能因皮肤色调而异常。
methods: 使用Individual Typology Angle（ITA）来估计皮肤色调，并对皮肤疾病分类算法进行公平性分析。
results: 对ISIC18 dataset进行了四种ITA-based皮肤色调分类方法的比较，发现这些方法之间存在很大的不一致，表明ITA-based皮肤色调估计方法存在风险。此外，研究发现ISIC18 dataset的不具有多样性，限制了其作为公平性分析的测试平台。

Abstract
Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.

摘要
（注意：以下是简化中文翻译，不是正式文件翻译） Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across different skin tones. However, the lack of skin tone labels in public datasets hinders the creation of a fair classifier. To date, such skin tone labels have been estimated before fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images, taking into account the lightness and yellow-blue tints. These angles are then categorized into skin tones that are subsequently used to analyze fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies, demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we recommend further research on robust ITA estimation and diverse dataset acquisition with skin tone annotation to facilitate conclusive fairness assessments of artificial intelligence tools in dermatology. Our code is available at https://github.com/tkalbl/RevisitingSkinToneFairness.

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

paper_url: http://arxiv.org/abs/2308.09624
repo_url: None
paper_authors: Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, Safwan Wshah
for: 这个论文目的是提出一个具有更好的地理构造抽取和对比硬件组合的地图地标准化方法，以提高地图地标的精度和稳定性。
methods: 这个方法使用了一个增强的地理构造抽取模组（GLE），以及一个对比硬件组合（CHSG）来增强模型的训练。
results: 实验结果显示，这个方法在跨区评估中取得了最新的州际顶峰表现 ($16.44%$, $22.71%$, and $17.02%$ 不包括极点转换)，并且保持了与现有最新顶峰相同的同区表现。

Abstract
Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Recent works achieve outstanding progress on CVGL benchmarks. However, existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. Our preliminary work introduced a Geometric Layout Extractor (GLE) to capture the geometric layout from input features. However, the previous GLE does not fully exploit information in the input feature. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features. To fully explore the LS techniques from our preliminary work, we further propose Contrastive Hard Samples Generation (CHSG) to facilitate model training. Extensive experiments show that GeoDTR+ achieves state-of-the-art (SOTA) results in cross-area evaluation on CVUSA, CVACT, and VIGOR by a large margin ($16.44\%$, $22.71\%$, and $17.02\%$ without polar transformation) while keeping the same-area performance comparable to existing SOTA. Moreover, we provide detailed analyses of GeoDTR+.

摘要
cross-view geo-localization (CVGL) 算法估算图像的位置，通过与数据库中geo-标记的空中图像进行匹配。现有研究取得了CVGL标准吗的出色表现。然而，现有方法仍然在跨区评估中表现不佳，其中训练和测试数据来自完全不同的区域。我们认为这是因为无法提取视觉特征的几何布局，以及模型过拟合低级细节。我们的前期工作中引入了几何布局提取器（GLE），但前一版GLE未能充分利用输入特征中的信息。在这项工作中，我们提出了GeoDTR+，它包括改进的GLE模块，可以更好地模型视觉特征之间的相关性。此外，我们还提出了对LS技术的进一步发展，即对比难样本生成（CHSG），以促进模型训练。我们的实验表明，GeoDTR+在跨区评估中达到了CVUSA、CVACT和VIGOR等标准吗的最佳结果（无极化转换），同时保持与同区表现相似的水平。此外，我们还提供了GeoDTR+的详细分析。

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

paper_url: http://arxiv.org/abs/2308.09622
repo_url: None
paper_authors: Ozge Mercanoglu Sincan, Necati Cihan Camgoz, Richard Bowden
For: The paper aims to improve the accuracy of sign language translation (SLT) by incorporating contextual information into the translation process.* Methods: The proposed method uses a multi-modal transformer architecture that combines low-level video features, recognized sign glosses, and contextual information from previous sign sequences to generate spoken language translations.* Results: The proposed approach significantly improves SLT performance compared to baseline methods, with nearly double the BLEU-4 scores. The results are evaluated on two datasets: BOBSL and SRF.

Abstract
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information, especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ~1.2M sequences, and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

摘要
签语翻译（SLT）是一项复杂的任务，旨在从手语视频中生成口语句子，两者都有不同的语法和单词顺序。从神经机器翻译（NMT）的角度来看，直接使用手语短语-口语句子对的训练翻译模型是最直接的方法。然而，人类 intérpretes 具有很强的上下文理解能力，特别是 для手语 interpretations，其词汇量可能比其口语对应的更小。基于人类翻译的直观思路，我们提出了一种新的多模态 transformer 架构，用于在上下文意识下进行翻译任务。我们使用上下文来解决视觉较弱的征略，并且使用 complementary transformer encoders，即：(1) 视频编码器， capture 视频中帧级的低级特征; (2) 识别编码器， model 视频中识别的手语词汇; (3) 上下文编码器， capture 前一个手语序列的上下文。我们将这些编码器的信息在最终的 transformer decoder 中结合，以生成口语翻译。我们在最近发布的 BOBSL 数据集上进行了评估，该数据集包含约 1.2M 个序列，以及 SRF 数据集，该数据集在 WMT-SLT 2022 挑战中出现。我们发现，使用上下文信息可以大幅提高翻译性能，相比基eline方法的 BLEU-4 分数增长约 20%。

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

paper_url: http://arxiv.org/abs/2308.09618
repo_url: None
paper_authors: Lojze Žust, Janez Perš, Matej Kristan
for: 本研究目的是提供一个多样化的海上障碍物检测数据集，以便进一步推动海上障碍物检测领域的进步。
methods: 本研究使用了一个新的海上�anoptic障碍物检测benchmark，名为LaRS，该benchmark包含了湖泊、河流和海洋等多个环境下的场景。
results: 本研究通过对27种semantic和panoptic segmentation方法的评估，发现了一些性能杂化和未来研究方向。同时，本研究还提供了一个在线评估服务器，以便对海上障碍物检测方法进行对比和评估。

Abstract
The progress in maritime obstacle detection is hindered by the lack of a diverse dataset that adequately captures the complexity of general maritime environments. We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas. Our major contribution is the new dataset, which boasts the largest diversity in recording locations, scene types, obstacle classes, and acquisition conditions among the related datasets. LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture, amounting to over 40k frames. Each key frame is annotated with 8 thing, 3 stuff classes and 19 global scene attributes. We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions. To enable objective evaluation, we have implemented an online evaluation server. The LaRS dataset, evaluation toolkit and benchmark are publicly available at: https://lojzezust.github.io/lars-dataset

摘要
“水域障碍物探测的进步受到水域环境的多样性不足所阻碍。我们提出了首个水域综合障碍物探测比赛LaRS，包括湖泊、河流和海洋的场景。我们的主要贡献是新的数据集，其中包括不同的录制地点、场景类型、障碍物类型和采样条件，与相关的数据集相比，展现了最大的多样性。LaRS包含了4000帧每帧标注的关键帧，共有9个前置帧，以利用时间Texture，总共有40k帧。每个关键帧都被标注为8个物类、3个物品类和19个全局场景特征。我们报告了27种semantic和综合障碍物分类方法的结果，以及一些性能检验和未来研究方向。为了实现公正评估，我们在线上进行了评估服务器。LaRS数据集、评估工具箱和比赛是公开 disponibile的：https://lojzezust.github.io/lars-dataset”

Far3D: Expanding the Horizon for Surround-view 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09616
repo_url: None
paper_authors: Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, Xiangyu Zhang
for: 本研究旨在提高3D对象检测范围，特别是长范围检测，以便减少成本和提高效率。
methods: 本研究提出了一种新的稀疏查询基于框架，称为Far3D。该框架利用高质量2D对象假设生成3D适应查询，并引入了视角意识汇集模块，以兼顾不同视野和比例的特征捕捉。此外，该研究还提出了范围调整的3D推净方法，以解决查询错误的协传和长范围任务中的稳定问题。
results: 该研究在Argoverse 2数据集上达到了最佳性能水平，覆盖150米的广泛范围，超过了一些LiDAR基于的方法。此外，Far3D还在nuScenes数据集上表现出了superior的性能。代码即将公布。

Abstract
Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. Meanwhile, Far3D exhibits superior performance compared to previous methods on the nuScenes dataset. The code will be available soon.

摘要
近些时候，从卫星视图图像中的3D物体检测已经做出了明显的进步，主要是因为其低于部署成本。然而，大多数工作都主要集中在近距离检测上，而长距离检测则尚未得到足够的探索。扩展现有方法直接覆盖长距离范围 poses 计算成本过高和稳定性不稳定。为了解决这些限制，这篇论文提出了一种新的稀疏查询基本框架，名为 Far3D。通过利用高质量的2D物体假设，我们生成了3D适应查询，这些查询与3D全球查询衔接。为了有效地捕捉不同视图和缩放下的特征，我们提出了一种视角意识汇集模块。此外，我们提出了范围调整的3D排除方法，以解决查询错误的传递问题和长距离任务中的稳定性问题。特别是，Far3D在Argoverse 2数据集上达到了SOA性能，覆盖150米的各种距离，超过了一些激光雷达基于的方法。同时，Far3D在nuScenes数据集上表现出色，比前一些方法更高。代码即将发布。

Language-guided Human Motion Synthesis with Atomic Actions

paper_url: http://arxiv.org/abs/2308.09611
repo_url: https://github.com/yhzhai/atom
paper_authors: Yuanhao Zhai, Mingzhen Huang, Tianyu Luan, Lu Dong, Ifeoma Nwogu, Siwei Lyu, David Doermann, Junsong Yuan
for: 文章主要目的是提出一种语言引导人体运动合成技术，以解决人体行为的自然复杂性和多样性导致的合成问题。
methods: 该方法基于分解人体动作为原子动作的思想，并采用了一种课程学习策略来学习原子动作的组合。在学习过程中，文章首先将复杂的人体动作分解为一组原子动作，然后使用这些学习到的原子动作来组合新的动作，从而提高了对新动作的适应性。此外，文章还引入了一种课程学习训练策略，通过逐渐增加的面积矩阵来促进原子动作的组合。
results: 文章通过了广泛的实验，包括文本引导人体运动和动作引导人体运动任务，证明了ATOM模型的效果。具体来说，ATOM模型能够生成符合人体动作规律的文本引导人体运动序列，并且能够更好地适应新的动作。

Abstract
Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences.

摘要
<>translate "ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition." into Simplified Chinese>Here's the translation:使用ATOM（原子动作模型）来解决这个问题，将动作 decomposes into atomic actions，并使用学习 atomic action 的CURRICULUM learning策略来学习atomic action的组合。Here's the breakdown of the translation:* 使用ATOM (使用ATOM)： Uses ATOM.* 解决这个问题 (解决这个问题)： Solves this problem.* decomposes into atomic actions (decomposes into atomic actions)： Decomposes into atomic actions.* 并使用学习 (并使用学习)： And uses learning.* atomic action composition (原子动作的组合)： Atomic action composition.I hope this helps! Let me know if you have any other questions.

On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

paper_url: http://arxiv.org/abs/2308.09610
repo_url: https://github.com/tdemin16/continual-layernorm-tuning
paper_authors: Thomas De Min, Massimiliano Mancini, Karteek Alahari, Xavier Alameda-Pineda, Elisa Ricci
for: 降低折冲学习方法中的计算成本，以维持竞争力性的表现。
methods: 将注重层 normalization 为每个持续学习任务，并在推断时根据任务特定的键和预训模型的输出选择参数。
results: 在 ImageNet-R 和 CIFAR-100 上实验显示，我们的方法可以与 {状态域} 的表现相互匹配，而且计算成本较低。

Abstract
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts, drastically reducing catastrophic forgetting. However, there is a tradeoff between the number of learned parameters and the performance, making such models computationally expensive. In this work, we aim to reduce this cost while maintaining competitive performance. We achieve this by revisiting and extending a simple transfer learning idea: learning task-specific normalization layers. Specifically, we tune the scale and bias parameters of LayerNorm for each continual learning task, selecting them at inference time based on the similarity between task-specific keys and the output of the pre-trained model. To make the classifier robust to incorrect selection of parameters during inference, we introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time. Experiments on ImageNet-R and CIFAR-100 show that our method achieves results that are either superior or on par with {the state of the art} while being computationally cheaper.

摘要
现代无重复练习 continual learning 方法利用视Transformers的特点学习任务特定提示，以减少忘记性。然而，有参数数量和性能之间的交易，使得这些模型变得 computationally 昂贵。在这个工作中，我们希望减少这个成本，保持竞争力。我们实现这一点通过重新访问和扩展一种简单的转移学习想法：学习任务特定 normalization layers。特别是，我们在每个 continual learning 任务中调整层 normalization 的缩放和偏移参数，在推理时根据任务特定的键和预训练模型输出来选择。为了让分类器在推理过程中正确选择参数，我们引入了一个两阶段训练过程，在第一阶段优化任务特定参数，然后在第二阶段使用同样的选择过程来训练分类器。实验表明，我们的方法在 ImageNet-R 和 CIFAR-100 上 achieves Results that are either superior or on par with {the state of the art} while being computationally cheaper。

Language-Guided Diffusion Model for Visual Grounding

paper_url: http://arxiv.org/abs/2308.09599
repo_url: None
paper_authors: Sijia Chen, Baochun Li
for: solves the problem of visual grounding, a cross-modal alignment task, in a generative way.
methods: uses a language-guided diffusion framework called LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide.
results: achieves superior performance on five widely used datasets, validating the effectiveness of the proposed framework.

Abstract
Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at \url{https://github.com/iQua/vgbase/tree/DiffusionVG}.

摘要
Visual grounding（视觉附加）任务需要显式跨模态对齐，即要在图像区域和语言短语之间进行Semantic对应。现有的方法通过单步visual-text reasoning来完成这些任务，其性能需要大量的 anchor和复杂的多模态融合模块，导致复杂的框架难以训练和过拟合特定场景。worse,这些once-for-all reasoning mechanisms是无法不断调整查询区域匹配的。相反，在这篇论文中，我们提出了一种迭代的理解过程，即通过降噪模型来进行语义导向的Diffusion VG框架，具体来说是语言指导降噪框架（LG-DVG）。它通过逐步降噪查询对应的真实框， conditional on查询语义来训练模型。我们在五个广泛使用的dataset上进行了extensive experiments，并证明了我们提出的框架在解决视觉附加任务的cross-modal对齐任务中的超越性。代码可以在 \url{https://github.com/iQua/vgbase/tree/DiffusionVG} 上获取。

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

paper_url: http://arxiv.org/abs/2308.09593
repo_url: https://github.com/yunhanwang1105/GazeTech
paper_authors: Yunhan Wang, Xiangwei Shi, Shalini De Mello, Hyung Jin Chang, Xucong Zhang
for: 这篇论文探讨了深度学习技术在过去十年的快速发展，以及这些技术在人工智能和计算机视觉领域的应用。
methods: 这篇论文提出了多种不同的机制，包括软运算、硬运算、两眼不对称、特征分离、旋转一致和对称学习。大多数方法将单一脸部或多个区域作为输入，但这篇论文强调了基本架构的探讨。
results: 这篇论文发现，对 ResNet 架构进行一些简单的参数调整可以超过大多数现有的州OF-the-art方法在三个popular dataset上的关照眼动测量性能。经过广泛的实验，我们发现了适当的步长数、输入图像分辨率和多区域架构对关照眼动测量性能的影响，并且这些影响随着输入脸部图像质量而改变。我们在三个dataset上取得了顶尖性能，分别为ETH-XGaze 3.64、MPIIFaceGaze 4.50和Gaze360度关照眼动测量错误9.13。

Abstract
With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.

摘要
随着深度学习技术在过去十年的快速发展，面部基于的视线估计吸引了计算机视觉和人机交互研究领域的广泛关注。吸引人的方法被提出，包括软注意力、硬注意力、两个眼睛差异、特征分离、旋转一致性和对比学习。大多数方法以单个脸或多个区域作为输入，然而基本的视线估计建 architecture尚未得到全面探讨。在这篇论文中，我们发现了一个关键的事实：调整一些简单的ResNet架构参数可以超越大多数现有的状态对 gaze estimation 任务的表现。通过我们的广泛的实验，我们结论了 that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance, and their effectiveness depends on the quality of the input face image。我们在三个 dataset 上 obtain 了最佳性能，具体来说是ETH-XGaze的3.64、MPIIFaceGaze的4.50和Gaze360度的9.13度视线估计误差。

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

paper_url: http://arxiv.org/abs/2308.09592
repo_url: https://github.com/rese1f/stablevideo
paper_authors: Wenhao Chai, Xun Guo, Gaoang Wang, Yan Lu
for: 这篇论文是为了解决Diffusion-based方法在视频编辑中的问题，即如何在保持已有对象的外观不变的情况下，编辑视频。
methods: 这篇论文使用了文本驱动的Diffusion模型，并引入了时间关系来保持对象的外观一致性。具体来说，它使用了一种新的间帧传播机制，将一帧的外观信息传播到下一帧。
results: 该论文的实验结果表明，与现有的视频编辑方法相比，其方法可以 достичь更高的编辑质量和一致性。

Abstract
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. Our code is available at \href{https://github.com/rese1f/StableVideo}{this https URL}.

摘要
Diffusion-based方法可以生成真实的图像和视频，但它们在现有视频中编辑已有对象的时候困难保持对象的外观在时间上的一致性。这使得扩散模型在实际的视频编辑场景中无法应用。在这篇论文中，我们解决了这个问题，通过添加时间相关性来改进现有的文本驱动扩散模型，使其能够在编辑过程中保持对象的外观一致性。我们开发了一种新的间帧传播机制，基于层次表示来传播一帧到下一帧的外观信息。然后，我们建立了基于这种机制的文本驱动视频编辑框架，即StableVideo，可以实现一致性感视频编辑。我们的实验表明，我们的方法可以具有优秀的编辑效果，相比之前的视频编辑方法，我们的方法在质量和量上都达到了更高的水平。我们的代码可以在以下 GitHub 上下载：https://github.com/rese1f/StableVideo。

O^2-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

paper_url: http://arxiv.org/abs/2308.09591
repo_url: None
paper_authors: Yubin Hu, Sheng Ye, Wang Zhao, Matthieu Lin, Yuze He, Yu-Hui Wen, Ying He, Yong-Jin Liu
for: 该研究旨在提出一种新的完整物体3D重建方法，解决RGB-D视频中物体遮挡问题。
methods: 该方法利用预训练的扩散模型填充2D图像隐藏部分，然后使用这些填充图像优化神经隐藏表示法对每个实例的3D重建。
results: 实验结果表明，该方法在ScanNet场景中可以达到状态之arte accuracy和完整性在对象级RGB-D视频重建中。

Abstract
Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos.

摘要
干扰是RGB-D视频中3D重建中的一个常见问题，常常阻塞对象的完整重建，是一个持续的问题。在这篇论文中，我们提出了一种新的框架，利用2D扩散模型来填充对象隐藏部分的表面。 Specifically，我们使用预训练的扩散模型填充隐藏在2D图像中的部分。然后，我们使用这些填充后的图像来优化每个实例的神经蕴涵表示。由于创建填充面积需要很多人工参与，我们采用了人类在循环的策略，即很少需要人工参与来生成高质量的填充面积。此外，视频通常从有限的视角拍摄，因此一些对象部分可能会被完全隐藏。为了保证恢复这些隐藏的部分，我们开发了一种层次网络架构，用于预测签名距离场，利用不同频率域的位置编码，并保持整体的平滑性。此外，我们还采用了基于CLIP的语义一致损失，以导引表面从未看到的摄像头角度。实验结果表明，我们的提出的框架在SceneNet场景中实现了state-of-the-art的准确性和完整性。

Deep Equilibrium Object Detection

paper_url: http://arxiv.org/abs/2308.09564
repo_url: https://github.com/MCG-NJU/DEQDet
paper_authors: Shuai Wang, Yao Teng, Limin Wang
for: 本文提出了一种新的Query-based对象检测器（DEQDet），通过设计深度平衡解码器来进行对象检测。
methods: 本文使用了一种叫做深度平衡解码器（DEQ），它模型了查询向量精细化为稳定有意义的表示，然后使用简单的FFN头直接预测对象的位置和类别。
results: 对比基eline，本文的DEQDet在MS COCO数据集上达到了$49.5$ mAP和$33.0$ AP$_s$的性能，并且需要较少的内存和训练时间。

Abstract
Query-based object detectors directly decode image features into object instances with a set of learnable queries. These query vectors are progressively refined to stable meaningful representations through a sequence of decoder layers, and then used to directly predict object locations and categories with simple FFN heads. In this paper, we present a new query-based object detector (DEQDet) by designing a deep equilibrium decoder. Our DEQ decoder models the query vector refinement as the fixed point solving of an {implicit} layer and is equivalent to applying {infinite} steps of refinement. To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement. Accordingly, we are able to incorporate refinement awareness into the DEQ training with the inexact gradient back-propagation (RAG). In addition, to stabilize the training of our DEQDet and improve its generalization ability, we devise the deep supervision scheme on the optimization path of DEQ with refinement-aware perturbation~(RAP). Our experiments demonstrate DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart (AdaMixer). In particular, our DEQDet with ResNet50 backbone and 300 queries achieves the $49.5$ mAP and $33.0$ AP$_s$ on the MS COCO benchmark under $2\times$ training scheme (24 epochs).

摘要
干ifen-based object detectors directly将图像特征解码成对象实例的集合，使用一组学习的查询向量。这些查询向量通过一系列解码层进行逐渐精细化，然后直接使用简单的FFN头预测对象的位置和类别。在这篇论文中，我们提出了一种新的查询基于object detector（DEQDet），通过设计深度平衡解码器来实现。我们的DEQ解码器模型查询向量精细化为 fixes point解决的{implicit}层，等于应用{无限}步的精细化。为了更加准确地捕捉对象解码中的查询向量精细化，我们采用了两步卷积的平衡方程。因此，我们可以在DEQ训练中引入精细化意识，并通过不确定梯度反propagation（RAG）进行训练。此外，为了稳定DEQDet的训练和提高其通用能力，我们提出了深度监视 schema在DEQ的优化路径上进行反制�ubble~(RAP)。我们的实验表明，DEQDet在比特笔COCO数据集上 converge faster，占用更少的内存，并 achieve better results than基eline对手（AdaMixer）。具体来说，我们的DEQDet使用ResNet50底层和300个查询可以在$2\times$ 训练方案下 achieve $49.5$ mAP和$33.0$ AP$_s$。

Decoupled conditional contrastive learning with variable metadata for prostate lesion detection

paper_url: http://arxiv.org/abs/2308.09542
repo_url: https://github.com/camilleruppli/decoupled_ccl
paper_authors: Camille Ruppli, Pietro Gori, Roberto Ardon, Isabelle Bloch
For: 早期检测 próstate cancer 的重要性，以便有效的治疗。* Methods: 使用多parametric Magnetic Resonance Images (mp-MRI) для诊断肿瘤。使用 Prostate Imaging Reporting and Data System (PI-RADS) 标准化评估肿瘤癌变性。* Results: 通过利用weak metadata 和多个标注者每个样本的 metadata confidence，定义metadata confidence，提高了lesion detection的准确率，在公共PI-CAI挑战数据集上提高了3%的AUC。Here’s the English version of the paper’s abstract, with the three key information points highlighted:* For: Early detection of prostate cancer is crucial for efficient treatment.* Methods: The paper uses multi-parametric Magnetic Resonance Images (mp-MRI) for lesion detection, and leverages the Prostate Imaging Reporting and Data System (PI-RADS) to standardize the interpretation of prostate MRI.* Results: The proposed contrastive loss function, which combines metadata of varying confidence with unannotated data, improves lesion detection accuracy by 3% on the public PI-CAI challenge dataset.I hope this helps!

Abstract
Early diagnosis of prostate cancer is crucial for efficient treatment. Multi-parametric Magnetic Resonance Images (mp-MRI) are widely used for lesion detection. The Prostate Imaging Reporting and Data System (PI-RADS) has standardized interpretation of prostate MRI by defining a score for lesion malignancy. PI-RADS data is readily available from radiology reports but is subject to high inter-reports variability. We propose a new contrastive loss function that leverages weak metadata with multiple annotators per sample and takes advantage of inter-reports variability by defining metadata confidence. By combining metadata of varying confidence with unannotated data into a single conditional contrastive loss function, we report a 3% AUC increase on lesion detection on the public PI-CAI challenge dataset. Code is available at: https://github.com/camilleruppli/decoupled_ccl

摘要
早期检测 проstate 癌的诊断是非常重要，以便进行有效的治疗。多 Parametric Magnetic Resonance Image (mp-MRI) 广泛用于肿瘤检测。 Prostate Imaging Reporting and Data System (PI-RADS) 已经标准化了 prostate MRI 的解释，并定义了肿瘤癌性的分数。PI-RADS 数据ready availability from radiology reports，但是受到高度的Inter-reports variability。我们提议一种新的对比损失函数，利用weak metadata 和多个注解者per sample，并利用Inter-reports variability by defining metadata confidence。通过将metadata of varying confidence 和无注释数据 combine into a single conditional contrastive loss function，我们在public PI-CAI challenge dataset上发现3%的AUC提升。Code is available at: https://github.com/camilleruppli/decoupled_cclNote: "prostate" is translated as " проstate" in Simplified Chinese, which is the standard translation.

Uncertainty-based quality assurance of carotid artery wall segmentation in black-blood MRI

paper_url: http://arxiv.org/abs/2308.09538
repo_url: https://github.com/miagrouput/carotid-segmentation
paper_authors: Elina Thibeau-Sutre, Dieuwertje Alblas, Sophie Buurman, Christoph Brune, Jelmer M. Wolterink
for: 这个论文目的是为了应用深度学习模型到大规模数据集中，并自动进行质量控制。
methods: 这个论文使用的方法包括自动化的某些血液MRI图像的分割，以及测试时数据增强和Monte Carlo dropout来估计模型预测结果的不确定性。
results: 研究发现，包括不确定性测量不会下降分割质量，不确定性指标可以用来衡量分割质量，并且可以检测参与者级别的低质量分割。这种自动质量控制工具可能会使得深度学习模型在大规模数据集中应用更加广泛。

Abstract
The application of deep learning models to large-scale data sets requires means for automatic quality assurance. We have previously developed a fully automatic algorithm for carotid artery wall segmentation in black-blood MRI that we aim to apply to large-scale data sets. This method identifies nested artery walls in 3D patches centered on the carotid artery. In this study, we investigate to what extent the uncertainty in the model predictions for the contour location can serve as a surrogate for error detection and, consequently, automatic quality assurance. We express the quality of automatic segmentations using the Dice similarity coefficient. The uncertainty in the model's prediction is estimated using either Monte Carlo dropout or test-time data augmentation. We found that (1) including uncertainty measurements did not degrade the quality of the segmentations, (2) uncertainty metrics provide a good proxy of the quality of our contours if the center found during the first step is enclosed in the lumen of the carotid artery and (3) they could be used to detect low-quality segmentations at the participant level. This automatic quality assurance tool might enable the application of our model in large-scale data sets.

摘要
aplicación de modelos de aprendizaje profundo a conjuntos de datos a gran escala requiere medidas para la calidad automática. Hemos desarrollado anteriormente un algoritmo completamente automático para la segmentación de las paredes de la arteria carótida en imágenes de resonancia magnética negra, y nuestro objetivo es aplicar este método a conjuntos de datos a gran escala. Este método identifica las paredes de la arteria carótida en 3D en secciones centradas en la arteria carótida. En este estudio, investigamos hasta qué punto la incertidumbre en las predicciones del modelo sobre el lugar de la contornación puede servir como un substituto para la detección de errores y, por lo tanto, como una medida de calidad automática. Expresamos la calidad de las segmentaciones automáticas utilizando el coeficiente de semejanza de Dice. La incertidumbre en las predicciones del modelo se estima utilizando either dropout de Monte Carlo o augmentación de datos en tiempo de prueba. Encontramos que:1. Incluir mediciones de incertidumbre no degrade la calidad de las segmentaciones;2. Los métricos de incertidumbre proporcionan un buen sustituto de la calidad de nuestras contornaciones si el centro encontrado durante la primera etapa está en el lumen de la arteria carótida;3. Pueden ser utilizados para detectar segmentaciones de baja calidad en el nivel de participante.Este herramienta de calidad automática podría permitir la aplicación de nuestro modelo en conjuntos de datos a gran escala.

Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

paper_url: http://arxiv.org/abs/2308.09534
repo_url: https://github.com/shaunyuan22/cfinet
paper_authors: Xiang Yuan, Gong Cheng, Kebing Yan, Qinghua Zeng, Junwei Han
for: 提高小对象检测精度
methods: 使用Coarse-to-fine RPN和Feature Imitation learning
results: 在大规模小对象检测标准 benchmark 上达到领先的性能

Abstract
The past few years have witnessed the immense success of object detection, while current excellent detectors struggle on tackling size-limited instances. Concretely, the well-known challenge of low overlaps between the priors and object regions leads to a constrained sample pool for optimization, and the paucity of discriminative information further aggravates the recognition. To alleviate the aforementioned issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning. Firstly, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through the dynamic anchor selection strategy and cascade regression. Then, we equip the conventional detection head with a Feature Imitation (FI) branch to facilitate the region representations of size-limited instances that perplex the model in an imitation manner. Moreover, an auxiliary imitation loss following supervised contrastive learning paradigm is devised to optimize this branch. When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A, underscoring its superiority over baseline detector and other mainstream detection approaches.

摘要
Recent years have seen tremendous success in object detection, but current state-of-the-art detectors still struggle with detecting small objects. The main challenge is that the prior knowledge and object regions have low overlap, leading to a limited sample pool for optimization, and the lack of discriminative information makes it even more difficult to recognize the objects. To address these issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.First, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through dynamic anchor selection and cascade regression. Then, we add a Feature Imitation (FI) branch to the conventional detection head to enhance the region representations of size-limited instances in an imitation manner. Moreover, we design an auxiliary imitation loss based on supervised contrastive learning to optimize this branch.When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on large-scale small object detection benchmarks SODA-D and SODA-A, outperforming baseline detectors and other mainstream detection approaches.

Improving 3D Pose Estimation for Sign Language

paper_url: http://arxiv.org/abs/2308.09525
repo_url: None
paper_authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
for: 这篇论文目标是提出一种基于前向运动学和神经网络的3D人姿重建方法，以实现快速和准确地测量人姿。
methods: 该方法利用了前向运动学（FK）和神经网络，将2D图像中的关键点检测转化为3D人姿。首先，使用神经网络预测人 JOINT 的旋转和肢长，然后将这些预测结果与骨骼约束相结合，通过实现了 PyTorch 中的 FK 层，以获得准确的3D人姿。
results: 通过量化和质量评估，该方法与 MediaPipe 相比，在每个 JOINT 的位势误差和视觉效果上具有显著的优势。此外，该方法还能够在不同的数据集上进行泛化。 PyTorch 实现在 CPU 上运行时间为100-200毫秒（包括 CNN 检测）。

Abstract
This work addresses 3D human pose reconstruction in single images. We present a method that combines Forward Kinematics (FK) with neural networks to ensure a fast and valid prediction of 3D pose. Pose is represented as a hierarchical tree/graph with nodes corresponding to human joints that model their physical limits. Given a 2D detection of keypoints in the image, we lift the skeleton to 3D using neural networks to predict both the joint rotations and bone lengths. These predictions are then combined with skeletal constraints using an FK layer implemented as a network layer in PyTorch. The result is a fast and accurate approach to the estimation of 3D skeletal pose. Through quantitative and qualitative evaluation, we demonstrate the method is significantly more accurate than MediaPipe in terms of both per joint positional error and visual appearance. Furthermore, we demonstrate generalization over different datasets. The implementation in PyTorch runs at between 100-200 milliseconds per image (including CNN detection) using CPU only.

摘要
We evaluate the method through both quantitative and qualitative means, demonstrating that it is significantly more accurate than MediaPipe in terms of both per joint positional error and visual appearance. Additionally, we show that the method generalizes well across different datasets. The implementation in PyTorch runs at between 100-200 milliseconds per image (including CNN detection) using CPU only.

Denoising Diffusion for 3D Hand Pose Estimation from Images

paper_url: http://arxiv.org/abs/2308.09523
repo_url: None
paper_authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
for: 本研究旨在解决从单个图像或序列中提取3D手势的问题。
methods: 提出了一种新的端到端框架，使用了扩散模型来捕捉数据分布，并遵循了机械约束来确保生成的姿势是有效的。
results: 研究表明该方法可以在提取2D单手图像到3D的过程中提供状态的最佳性能，并在使用序列数据时通过添加Transformer模块来进一步提高准确性。

Abstract
Hand pose estimation from a single image has many applications. However, approaches to full 3D body pose estimation are typically trained on day-to-day activities or actions. As such, detailed hand-to-hand interactions are poorly represented, especially during motion. We see this in the failure cases of techniques such as OpenPose or MediaPipe. However, accurate hand pose estimation is crucial for many applications where the global body motion is less important than accurate hand pose estimation. This paper addresses the problem of 3D hand pose estimation from monocular images or sequences. We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes. Moreover, we enforce kinematic constraints to ensure realistic poses are generated by incorporating an explicit forward kinematic layer as part of the network. The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D. However, when sequence data is available, we add a Transformer module over a temporal window of consecutive frames to refine the results, overcoming jittering and further increasing accuracy. The method is quantitatively and qualitatively evaluated showing state-of-the-art robustness, generalization, and accuracy on several different datasets.

摘要
<> traduced the text into Simplified Chinese.<>人体手势估计从单个图像中有很多应用。然而，对全息体姿估计方法通常是在日常活动或动作上训练的。因此，手部之间的细节互动被不好地表示，特别是在运动中。我们在技术如OpenPose或MediaPipe的失败案例中看到了这一点。然而，准确的手势估计对许多应用来说非常重要，特别是在全息体姿的运动不那么重要。这篇论文Addresses the problem of 3D手势估计从单个图像或序列中。我们提出了一种新的终端框架，用于3D手势回归。该框架利用了扩散模型，这些模型在生成目的上表现出了优秀的能力。此外，我们加入了显式前向几何层，以确保生成的姿势是真实的。提议的模型在将2D单手图像提升到3D的任务中提供了状态机器的性能。当序列数据可用时，我们将Transformer模块添加到 consecutives帧中，以修复结果，超越晃动，并提高准确率。方法被量测量和质量上评估，显示了状态机器的稳定性、泛化能力和准确率在多个不同的数据集上都达到了顶峰。

Leveraging Intrinsic Properties for Non-Rigid Garment Alignment

paper_url: http://arxiv.org/abs/2308.09519
repo_url: None
paper_authors: Siyou Lin, Boyao Zhou, Zerong Zheng, Hongwen Zhang, Yebin Liu
for:这篇论文targets the problem of aligning real-world 3D garment data, which is beneficial for applications such as texture learning, physical parameter estimation, and generative modeling of garments.methods:The proposed method leverages intrinsic manifold properties and uses two neural deformation fields, one in 3D space and another in intrinsic space, to achieve coarse-to-fine alignment. The coarse stage performs a 3D fitting, and the refined stage uses a second neural deformation field for higher accuracy.results:The method achieves accurate wrinkle-level and texture-level alignment and works well for difficult garment types such as long coats. The project page with more information and results is available at https://jsnln.github.io/iccv2023_intrinsic/index.html.

Abstract
We address the problem of aligning real-world 3D data of garments, which benefits many applications such as texture learning, physical parameter estimation, generative modeling of garments, etc. Existing extrinsic methods typically perform non-rigid iterative closest point and struggle to align details due to incorrect closest matches and rigidity constraints. While intrinsic methods based on functional maps can produce high-quality correspondences, they work under isometric assumptions and become unreliable for garment deformations which are highly non-isometric. To achieve wrinkle-level as well as texture-level alignment, we present a novel coarse-to-fine two-stage method that leverages intrinsic manifold properties with two neural deformation fields, in the 3D space and the intrinsic space, respectively. The coarse stage performs a 3D fitting, where we leverage intrinsic manifold properties to define a manifold deformation field. The coarse fitting then induces a functional map that produces an alignment of intrinsic embeddings. We further refine the intrinsic alignment with a second neural deformation field for higher accuracy. We evaluate our method with our captured garment dataset, GarmCap. The method achieves accurate wrinkle-level and texture-level alignment and works for difficult garment types such as long coats. Our project page is https://jsnln.github.io/iccv2023_intrinsic/index.html.

摘要
我们解决了对真实世界3D资料的衣服对齐问题，这有助于训练文本、物理参数估计、生成衣服等应用程序。现有的外部方法通常会进行非静态迭代最近点，并很难跟踪细节，因为 incorrect closest matches 和静态约束。而内部方法基于函数对应则可以生成高品质对应关系，但它们在衣服塑形中是不可靠的，因为衣服塑形是非静态的。为了 achiev wrinkle-level 和 texture-level 对齐，我们提出了一个新的两阶段方法，利用内部构造特性，并使用两个神经塑形场，在3D空间和内部空间中进行塑形。在粗细阶段，我们利用内部构造特性定义一个构造塑形场，并将其调整为一个函数对应。我们进一步将这个函数对应进行精确调整，使用第二个神经塑形场。我们评估了我们的方法，使用我们捕捉的衣服Dataset，GarmCap。我们的方法可以实现细节级和纹理级对齐，并在困难的衣服类型，如长大袍，也能够运作。我们的项目页面是。

Learnt Contrastive Concept Embeddings for Sign Recognition

paper_url: http://arxiv.org/abs/2308.09515
repo_url: None
paper_authors: Ryan Wong, Necati Cihan Camgoz, Richard Bowden
for: 本研究旨在提供一种基于contrastive learning的、weakly supervised的签语嵌入方法，用于将手语语言与口语语言相互连接。
methods: 我们提出了一种学习框架，用于 derivation LCC（学习对应概念）嵌入 для手语语言，该方法基于手语语言的语言标签来训练词语表示。我们还开发了一种基于word embedding的概念相似损失函数，以便使用NLP方法中的word embedding来创建更好的手语嵌入。
results: 我们的方法可以在WLASL和BOBSL数据集上实现状态实验室的键点基于手语识别性能。

Abstract
In natural language processing (NLP) of spoken languages, word embeddings have been shown to be a useful method to encode the meaning of words. Sign languages are visual languages, which require sign embeddings to capture the visual and linguistic semantics of sign. Unlike many common approaches to Sign Recognition, we focus on explicitly creating sign embeddings that bridge the gap between sign language and spoken language. We propose a learning framework to derive LCC (Learnt Contrastive Concept) embeddings for sign language, a weakly supervised contrastive approach to learning sign embeddings. We train a vocabulary of embeddings that are based on the linguistic labels for sign video. Additionally, we develop a conceptual similarity loss which is able to utilise word embeddings from NLP methods to create sign embeddings that have better sign language to spoken language correspondence. These learnt representations allow the model to automatically localise the sign in time. Our approach achieves state-of-the-art keypoint-based sign recognition performance on the WLASL and BOBSL datasets.

摘要
在自然语言处理（NLP）的 spoken语言中，词嵌入已经被证明是一种有用的方法来编码词语的含义。手语是一种视觉语言，需要手势嵌入来捕捉手语的视觉和语言 semantics。不同于许多常见的手语识别方法，我们注重于显式地创建手势嵌入，以bridging the gap between手语和口语。我们提出了一种学习框架，用于 derivation LCC（学习对应概念）嵌入 для手语，这是一种弱监督对比方法来学习手势嵌入。我们训练了一个词库，基于手语视频的语言标签来学习嵌入。此外，我们开发了一种概念相似损失，可以利用 NLP 方法中的词嵌入来创建更好地与口语相匹配的手势嵌入。这些学习的表示允许模型自动在时间上local化手语。我们的方法在 WLASL 和 BOBSL 数据集上实现了状态的键点基于手语识别性能。

ResQ: Residual Quantization for Video Perception

paper_url: http://arxiv.org/abs/2308.09511
repo_url: None
paper_authors: Davide Abati, Haitam Ben Yahia, Markus Nagel, Amirhossein Habibian
for: 加速视频识别，如semantic segmentation和人体 pose estimation，通过利用帧间重复性。
methods: 利用低位量编码，基于差异网络活动的差分 residuals 的特性，提出 Residual Quantization 模型。
results: 比标准quantization和现有高效视频识别模型 superior performance 在准确性和位宽之间的权衡。

Abstract
This paper accelerates video perception, such as semantic segmentation and human pose estimation, by levering cross-frame redundancies. Unlike the existing approaches, which avoid redundant computations by warping the past features using optical-flow or by performing sparse convolutions on frame differences, we approach the problem from a new perspective: low-bit quantization. We observe that residuals, as the difference in network activations between two neighboring frames, exhibit properties that make them highly quantizable. Based on this observation, we propose a novel quantization scheme for video networks coined as Residual Quantization. ResQ extends the standard, frame-by-frame, quantization scheme by incorporating temporal dependencies that lead to better performance in terms of accuracy vs. bit-width. Furthermore, we extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video. We demonstrate the superiority of our model, against the standard quantization and existing efficient video perception models, using various architectures on semantic segmentation and human pose estimation benchmarks.

摘要

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

paper_url: http://arxiv.org/abs/2308.09475
repo_url: None
paper_authors: Hongqiu Wang, Lei Zhu, Guang Yang, Yike Guo, Shichen Zhang, Bo Xu, Yueming Jin
for: 该研究旨在提高Robot-assisted surgery的质量，通过实时器段化来促进手术 Navigation和医学教育。
methods: 该研究提出了一种新的 Referring Surgical Video Instrument Segmentation (RSVIS)任务，通过自动将文本描述与视频帧相关联，以提高手术器段化的精度。为此，我们提出了一种新的 Video-Instrument Synergistic Network (VIS-Net)，以学习视频水平和工具水平的知识，从而提高性能。同时，我们还设计了一种图像基于关系意识模块 (GRM)，以模型多modal信息（文本描述和视频帧）之间的关系，以便提取工具级别信息。
results: 我们的方法在两个RSVIS dataset上进行验证，实验结果表明，我们的VIS-Net可以significantly outperform现有的参照分割方法。

Abstract
Robot-assisted surgery has made significant progress, with instrument segmentation being a critical factor in surgical intervention quality. It serves as the building block to facilitate surgical robot navigation and surgical education for the next generation of operating intelligence. Although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.

摘要
机器人协助手术已经做出了重要进步， инструмен特征分割成为手术干预质量的关键因素。它作为手术机器人导航和下一代操作智能的基础阶段，对手术教育也具有重要意义。 although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.Here's the translation in Traditional Chinese:机器人协助手术已经做出了重要进步， instrumente特征分割成为手术干预质量的关键因素。它作为手术机器人导航和下一代操作智能的基础阶段，对手术教育也具有重要意义。although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.

Quantitative Susceptibility Mapping through Model-based Deep Image Prior (MoDIP)

paper_url: http://arxiv.org/abs/2308.09467
repo_url: None
paper_authors: Zhuang Xiong, Yang Gao, Yin Liu, Amir Fazlollahi, Peter Nestor, Feng Liu, Hongfu Sun
for: 解决量子感知图像推导中不同对象的扫描参数下的偏振转换问题
methods: 提出了一种新的无需训练的模型基于深度学习方法，称为MoDIP（模型基于深度图像优先）
results: 在不同扫描参数下，MoDIP可以 excellently 普适化解决量子感知图像推导问题，并且比传统的迭代方法和深度学习方法有32%的准确率提升，同时也比传统的DIP方法33%更快速，可以在4.5分钟内完成3D高分辨率图像重建。

Abstract
The data-driven approach of supervised learning methods has limited applicability in solving dipole inversion in Quantitative Susceptibility Mapping (QSM) with varying scan parameters across different objects. To address this generalization issue in supervised QSM methods, we propose a novel training-free model-based unsupervised method called MoDIP (Model-based Deep Image Prior). MoDIP comprises a small, untrained network and a Data Fidelity Optimization (DFO) module. The network converges to an interim state, acting as an implicit prior for image regularization, while the optimization process enforces the physical model of QSM dipole inversion. Experimental results demonstrate MoDIP's excellent generalizability in solving QSM dipole inversion across different scan parameters. It exhibits robustness against pathological brain QSM, achieving over 32% accuracy improvement than supervised deep learning and traditional iterative methods. It is also 33% more computationally efficient and runs 4 times faster than conventional DIP-based approaches, enabling 3D high-resolution image reconstruction in under 4.5 minutes.

摘要
supervised学习方法的数据驱动方法在量子频谱地图（QSM）中的杂项倒转问题中有限的应用。为解决这种泛化问题，我们提议一种没有训练的模型基于无supervised方法called MoDIP（模型基于深度图像先验）。MoDIP包括一个小型、未训练的网络和数据准确优化（DFO）模块。网络在进行杂项倒转时 converges to an interim state，作为图像规范化的隐藏先验，而优化过程检查QSM杂项倒转的物理模型。实验结果表明MoDIP在不同扫描参数下的QSM杂项倒转问题中具有出色的泛化性。它在异常脑QSM问题中表现稳定，比超出深度学习和传统迭代方法32%以上的准确率提高。它还比折衔DIP基于方法33%更快，可以在4.5分钟内完成3D高分辨率图像重建。

Data augmentation and explainability for bias discovery and mitigation in deep learning

paper_url: http://arxiv.org/abs/2308.09464
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła
for: 这个论文的目的是研究深度神经网络中的偏见问题，并提出了一些方法来减少偏见对模型性能的影响。methods: 这个论文使用了一些方法来减少偏见，包括：Explainable AI、Style Transfer Data Augmentation、Targeted Data Augmentations和Attribution Feedback。results: 这个论文的研究结果表明，通过使用这些方法可以减少偏见对模型性能的影响，并提高模型的准确率。

Abstract
This dissertation explores the impact of bias in deep neural networks and presents methods for reducing its influence on model performance. The first part begins by categorizing and describing potential sources of bias and errors in data and models, with a particular focus on bias in machine learning pipelines. The next chapter outlines a taxonomy and methods of Explainable AI as a way to justify predictions and control and improve the model. Then, as an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model. Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback. Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

摘要
As an example of a laborious manual data inspection and bias discovery process, a skin lesion dataset is manually examined. A Global Explanation for the Bias Identification method is proposed as an alternative semi-automatic approach to manual data exploration for discovering potential biases in data. Relevant numerical methods and metrics are discussed for assessing the effects of the identified biases on the model.Whereas identifying errors and bias is critical, improving the model and reducing the number of flaws in the future is an absolute priority. Hence, the second part of the thesis focuses on mitigating the influence of bias on ML models. Three approaches are proposed and discussed: Style Transfer Data Augmentation, Targeted Data Augmentations, and Attribution Feedback.Style Transfer Data Augmentation aims to address shape and texture bias by merging a style of a malignant lesion with a conflicting shape of a benign one. Targeted Data Augmentations randomly insert possible biases into all images in the dataset during the training, as a way to make the process random and, thus, destroy spurious correlations. Lastly, Attribution Feedback is used to fine-tune the model to improve its accuracy by eliminating obvious mistakes and teaching it to ignore insignificant input parts via an attribution loss. The goal of these approaches is to reduce the influence of bias on machine learning models, rather than eliminate it entirely.

Accelerated Bayesian imaging by relaxed proximal-point Langevin sampling

paper_url: http://arxiv.org/abs/2308.09460
repo_url: None
paper_authors: Teresa Klatzer, Paul Dobson, Yoann Altmann, Marcelo Pereyra, Jesús María Sanz-Serna, Konstantinos C. Zygalakis
for: 这个论文提出了一种新的加速 proximal Markov chain Monte Carlo 方法，用于在具有凸凹性的图像反问题中进行 Bayesian 推断。
methods: 该方法使用了一种 Stochastic Relaxed Proximal-Point 迭代法，该迭代法可以看作是一种停止推断方法，并且有两种不同的解释。对于满足某些条件的模型，该方法等价于一种停止推断方法，而对于不满足这些条件的模型，该方法等价于一种 Leimkuhler-Matthews 离散方法。
results: 该paper提供了一种非对称的 accelerated proximal Markov chain Monte Carlo 方法，该方法可以在凸凹性的图像反问题中获得加速的速度，并且可以避免对于非凸凹性模型的偏度。在不同的实验中，该方法都能够达到更高的速度和更好的准确性，比如图像恢复问题中的 Gaussian 和 Poisson 噪声。

Abstract
This paper presents a new accelerated proximal Markov chain Monte Carlo methodology to perform Bayesian inference in imaging inverse problems with an underlying convex geometry. The proposed strategy takes the form of a stochastic relaxed proximal-point iteration that admits two complementary interpretations. For models that are smooth or regularised by Moreau-Yosida smoothing, the algorithm is equivalent to an implicit midpoint discretisation of an overdamped Langevin diffusion targeting the posterior distribution of interest. This discretisation is asymptotically unbiased for Gaussian targets and shown to converge in an accelerated manner for any target that is $\kappa$-strongly log-concave (i.e., requiring in the order of $\sqrt{\kappa}$ iterations to converge, similarly to accelerated optimisation schemes), comparing favorably to [M. Pereyra, L. Vargas Mieles, K.C. Zygalakis, SIAM J. Imaging Sciences, 13, 2 (2020), pp. 905-935] which is only provably accelerated for Gaussian targets and has bias. For models that are not smooth, the algorithm is equivalent to a Leimkuhler-Matthews discretisation of a Langevin diffusion targeting a Moreau-Yosida approximation of the posterior distribution of interest, and hence achieves a significantly lower bias than conventional unadjusted Langevin strategies based on the Euler-Maruyama discretisation. For targets that are $\kappa$-strongly log-concave, the provided non-asymptotic convergence analysis also identifies the optimal time step which maximizes the convergence speed. The proposed methodology is demonstrated through a range of experiments related to image deconvolution with Gaussian and Poisson noise, with assumption-driven and data-driven convex priors.

摘要

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

paper_url: http://arxiv.org/abs/2308.09436
repo_url: None
paper_authors: Nikolas Ebert, Didier Stricker, Oliver Wasenmüller
for: 该论文主要用于提高医疗或制药过程中的不间断清洁监测。
methods: 该论文使用了一种新型的变体自注意机制，称为高效全局自注意机制，以解决当前自动化检测技术面临的主要挑战。
results: 在公共可用的 AGAR 数据集上，该网络表现出了与当前状态艺术的超过当前最佳性能。此外，通过在 COCO 和 LIVECell 数据集上进行进一步的实验，我们证明了该方法的任务独立性。

Abstract
Many medical or pharmaceutical processes have strict guidelines regarding continuous hygiene monitoring. This often involves the labor-intensive task of manually counting microorganisms in Petri dishes by trained personnel. Automation attempts often struggle due to major challenges: significant scaling differences, low separation, low contrast, etc. To address these challenges, we introduce AttnPAFPN, a high-resolution detection pipeline that leverages a novel transformer variation, the efficient-global self-attention mechanism. Our streamlined approach can be easily integrated in almost any multi-scale object detection pipeline. In a comprehensive evaluation on the publicly available AGAR dataset, we demonstrate the superior accuracy of our network over the current state-of-the-art. In order to demonstrate the task-independent performance of our approach, we perform further experiments on COCO and LIVECell datasets.

摘要
许多医疗或药品生产过程有严格的清洁监测指南。这经常包括人工计数微生物在杯中的劳动密集任务，由训练过的人员进行。自动化尝试经常遇到主要挑战，如规模差异、低分离、低对比等。为解决这些挑战，我们介绍AttnPAFPN，一种高分辨率检测管线，利用一种新型变体的 transformer 机制，即有效全球自注意机制。我们的流线型approach可以轻松地与大多数多尺度对象检测管线集成。在公共可用的 AGAR 数据集上进行了全面的评估，我们示出了我们网络比现状态的最高精度。为了证明我们的方法是任务独立的，我们在 COCO 和 LIVECell 数据集上进行了进一步的实验。

Can ultrasound confidence maps predict sonographers’ labeling variability?

paper_url: http://arxiv.org/abs/2308.09433
repo_url: None
paper_authors: Vanessa Gonzalez Duque, Leonhard Zirus, Yordanka Velikova, Nassir Navab, Diana Mateus
for: 这个论文的目的是提出一种新的深度学习 segmentation 方法，以便考虑医生对 ultrasound 图像的注意力和不确定性。
methods: 该方法使用了 ultrasound 图像中的信任度映射 (CM)，以帮助深度学习 segmentation 网络生成更加准确和多样化的预测结果。
results: 研究结果表明，使用 ultrasound CM 可以提高 Dice 分数、改善 Hausdorff 和平均表面距离，以及减少孤立像素预测。此外，研究还发现， ultrasound CM 可以改善真实 interpolations 的惩罚，从而提高深度学习 segmentation 的准确性。

Abstract
Measuring cross-sectional areas in ultrasound images is a standard tool to evaluate disease progress or treatment response. Often addressed today with supervised deep-learning segmentation approaches, existing solutions highly depend upon the quality of experts' annotations. However, the annotation quality in ultrasound is anisotropic and position-variant due to the inherent physical imaging principles, including attenuation, shadows, and missing boundaries, commonly exacerbated with depth. This work proposes a novel approach that guides ultrasound segmentation networks to account for sonographers' uncertainties and generate predictions with variability similar to the experts. We claim that realistic variability can reduce overconfident predictions and improve physicians' acceptance of deep-learning cross-sectional segmentation solutions. Our method provides CM's certainty for each pixel for minimal computational overhead as it can be precalculated directly from the image. We show that there is a correlation between low values in the confidence maps and expert's label uncertainty. Therefore, we propose to give the confidence maps as additional information to the networks. We study the effect of the proposed use of ultrasound CMs in combination with four state-of-the-art neural networks and in two configurations: as a second input channel and as part of the loss. We evaluate our method on 3D ultrasound datasets of the thyroid and lower limb muscles. Our results show ultrasound CMs increase the Dice score, improve the Hausdorff and Average Surface Distances, and decrease the number of isolated pixel predictions. Furthermore, our findings suggest that ultrasound CMs improve the penalization of uncertain areas in the ground truth data, thereby improving problematic interpolations. Our code and example data will be made public at https://github.com/IFL-CAMP/Confidence-segmentation.

摘要
评估静止影像中的横截面面积是评估疾病进程或治疗效果的标准工具。现有的深度学习分割方法可以高度依赖专家的注释，但是现有的解决方案受到镜像质量的限制，镜像质量呈扁平不均，受到物理镜像原理的影响，包括吸收、阴影和缺失边界，这些问题通常会在深度方向加剧。本文提出了一种新的方法，使得ultrasound分割网络能够考虑医生uncertainty，生成具有类似专家预测的各种性的预测。我们认为，在静止影像中添加realistic的variability可以减少过于自信的预测，提高医生对深度学习横截面分割解决方案的acceptance。我们的方法可以在低计算开销下提供CM的确定性，直接从图像中计算。我们发现，低值在确定性图中与专家标签不确定性相吻合。因此，我们提议使用ultrasound CM来补充网络。我们在四种state-of-the-art神经网络和两种配置下测试了我们的方法：作为第二个输入通道和作为损失的一部分。我们在3D静止影像 datasets中评估了我们的方法，结果显示，ultrasound CM可以提高 dice分数、改善 Hausdorff和平均表面距离，并减少孤立像素预测。此外，我们发现，ultrasound CM可以优化问题 interpolations，因此提高了对真实值数据的惩罚。我们的代码和示例数据将在https://github.com/IFL-CAMP/Confidence-segmentation中公开。

Self-Supervised Single-Image Deconvolution with Siamese Neural Networks

paper_url: http://arxiv.org/abs/2308.09426
repo_url: None
paper_authors: Mikhail Papkov, Kaupo Palo, Leopold Parts
for: 这篇论文旨在提出一种基于深度学习的图像恢复方法，以提高图像的清晰度和锐度。
methods: 该方法使用了自适应掩模神经网络，通过直接从数据中学习雨花点质量而不需要手动设置参数。
results: 实验结果表明，该改进的框架可以在3D微型scopy图像恢复任务中高效地提高图像的清晰度和锐度，并且超过了之前的状态艺术恢复方法。

Abstract
Inverse problems in image reconstruction are fundamentally complicated by unknown noise properties. Classical iterative deconvolution approaches amplify noise and require careful parameter selection for an optimal trade-off between sharpness and grain. Deep learning methods allow for flexible parametrization of the noise and learning its properties directly from the data. Recently, self-supervised blind-spot neural networks were successfully adopted for image deconvolution by including a known point-spread function in the end-to-end training. However, their practical application has been limited to 2D images in the biomedical domain because it implies large kernels that are poorly optimized. We tackle this problem with Fast Fourier Transform convolutions that provide training speed-up in 3D microscopy deconvolution tasks. Further, we propose to adopt a Siamese invariance loss for deconvolution and empirically identify its optimal position in the neural network between blind-spot and full image branches. The experimental results show that our improved framework outperforms the previous state-of-the-art deconvolution methods with a known point spread function.

摘要
“影像重建问题的逆问题是由于未知的噪音特性而困难。经典的迭代复原方法将噪音放大，需要精确地选择参数以取得最佳的对称性和粒子。深度学习方法可以灵活地设定噪音的参数，并将其直接从数据中学习。现在，自我监督的盲点神经网络在影像复原中获得了成功，但它们仅适用于2D图像在医学领域中。我们解决这个问题，使用快速傅立叶变换来提高3D显微镜复原任务的训练速度。此外，我们提议运用对称损失来复原，并评估其在神经网络中的最佳位置。实验结果显示，我们改进的框架比前一代的复原方法更高效。”Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09421
repo_url: https://github.com/cskkxjk/mononerd
paper_authors: Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, Deng Cai
for: 提高单目3D检测器的性能，使其能够更好地检测远离和受阻物体。
methods: 使用Signed Distance Functions (SDF)模型场景，然后将这些表示转化为Neural Radiance Fields (NeRF)，并使用卷积渲染来恢复RGB图像和深度图。
results: 在KITTI-3D benchmark和Waymo Open Dataset上进行了广泛的实验，并得到了效果报告。

Abstract
In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.

摘要
在单目3D探测领域，广泛采用场景几何准确来提高探测器的性能。然而，许多现有工作直接使用这些准确，如计算深度图并将其投影到3D空间中。这种直接方法会导致3D表示的稀疏性，特别是对于远距离和遮挡的对象。为了解决这个问题，我们提出了MonoNeRD检测框架，可以对场景进行精度的3D几何和占用率的推测。具体来说，我们使用signed distance functions（SDF）来建模场景，并将这些表示转换为神经辐射场（NeRF）。然后，我们使用卷积渲染来恢复RGB图像和深度图。我们知道，这是第一个在M3D中应用卷积渲染的工作，并证明了隐式重建的潜在优势。我们在KITTI-3D标准测试集和Waymo开放数据集进行了广泛的实验，并证明了MonoNeRD的有效性。代码可以在https://github.com/cskkxjk/MonoNeRD中找到。

Metadata Improves Segmentation Through Multitasking Elicitation

paper_url: http://arxiv.org/abs/2308.09411
repo_url: None
paper_authors: Iaroslav Plutenko, Mikhail Papkov, Kaupo Palo, Leopold Parts, Dmytro Fishman
for: 这个论文主要是为了提高深度学习方法中的 semantic segmentation 性能。
methods: 这个论文使用了通道调制机制，将metadata作为 convolutional network 的输入，以提高 segmentation 结果。
results: 论文表明，通过使用metadata，可以提高 segmentation 结果，而且这种方法具有低成本和可以与现有模型结合使用的优点。

Abstract
Metainformation is a common companion to biomedical images. However, this potentially powerful additional source of signal from image acquisition has had limited use in deep learning methods, for semantic segmentation in particular. Here, we incorporate metadata by employing a channel modulation mechanism in convolutional networks and study its effect on semantic segmentation tasks. We demonstrate that metadata as additional input to a convolutional network can improve segmentation results while being inexpensive in implementation as a nimble add-on to popular models. We hypothesize that this benefit of metadata can be attributed to facilitating multitask switching. This aspect of metadata-driven systems is explored and discussed in detail.

摘要
这里的metadata是对生物医学影像的常见伴侣。然而，这个潜在强大的额外讯号仍然受到深度学习方法中的有限使用，尤其是在 semantic segmentation 方面。在这里，我们通过将metadata作为构成元件加入卷积网络中，研究 metadata 在 semantic segmentation 任务中的影响。我们证明了 metadata 作为卷积网络的额外输入，可以提高分类结果，而且实现起来相对便宜。我们推测这个metadata的好处可以归因于促进多任务转换。这个metadata-驱动的系统的内部运作和讨论在详细的文章中进行了详细的探讨。

Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization

paper_url: http://arxiv.org/abs/2308.09391
repo_url: https://github.com/zzwdx/medic
paper_authors: Xiran Wang, Jian Zhang, Lei Qi, Yinghuan Shi
for: 本研究旨在解决频繁出现在目标领域中的领域转换问题，特别是当源领域和目标领域有不同的类型时。
methods: 本研究使用多个一对一分类器来定义决策边界，并将外围样本拒绝为未知样本。然而，在许多情况下，正样本和负样本的类别数量不均衡，导致决策边界偏向正样本，从而导致已知样本在目标领域中的误分类。本研究提出了一种基于元学习的框架，即双重MEta-learning with joint DomaIn-Class matching（MEDIC），它同时考虑了多个领域和类别的梯度匹配，以找到一个总体可靠的、Balanced的决策边界。
results: 实验结果表明，MEDIC不仅在开集enario下超过了先前的方法，同时也保持了相对较好的闭集泛化能力。 Code可以在https://github.com/zzwdx/MEDIC中找到。

Abstract
Domain generalization (DG) is proposed to deal with the issue of domain shift, which occurs when statistical differences exist between source and target domains. However, most current methods do not account for a common realistic scenario where the source and target domains have different classes. To overcome this deficiency, open set domain generalization (OSDG) then emerges as a more practical setting to recognize unseen classes in unseen domains. An intuitive approach is to use multiple one-vs-all classifiers to define decision boundaries for each class and reject the outliers as unknown. However, the significant class imbalance between positive and negative samples often causes the boundaries biased towards positive ones, resulting in misclassification for known samples in the unseen target domain. In this paper, we propose a novel meta-learning-based framework called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers gradient matching towards inter-domain and inter-class splits simultaneously to find a generalizable boundary balanced for all tasks. Experimental results demonstrate that MEDIC not only outperforms previous methods in open set scenarios, but also maintains competitive close set generalization ability at the same time. Our code is available at https://github.com/zzwdx/MEDIC.

摘要
域外泛化（DG）是为了解决域shift问题而提出的，域shift问题是指源领域和目标领域存在统计上的差异。然而，现有的大多数方法不能考虑到预测领域和目标领域中的类不同的常见情况。为了解决这个不足，开放集领域泛化（OSDG）作为更实际的设定，用于识别未经见过的类。一种直观的方法是使用多个一对一分类器来定义决策边界，并将异常值作为未知样本拒绝。然而，难以平衡的类划分常导致决策边界偏向正样本，从而导致在未经见过的目标领域中误分类已知样本。在这篇论文中，我们提出了一种基于元学习的框架，即双向MEta-learning with joint DomaIn-Class matching（MEDIC）。该框架同时考虑到域间和类间的梯度匹配，以找到一个泛化性能良好的、对所有任务均匀的决策边界。实验结果表明，MEDIC不仅在开放集领域中表现出优于前方法，还能够保持与闭set泛化能力相当的竞争力。我们的代码可以在https://github.com/zzwdx/MEDIC上找到。

Diffusion Models for Image Restoration and Enhancement – A Comprehensive Survey

paper_url: http://arxiv.org/abs/2308.09388
repo_url: https://github.com/lixinustc/awesome-diffusion-model-for-image-processing
paper_authors: Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, Zhibo Chen
for: 本文旨在为图像修复（IR）领域提供一份总结，尤其是在使用扩散模型（Diffusion Model）进行图像修复方面。
methods: 本文总结了最新的扩散模型基于的图像修复方法，包括学习框架、条件策略、模型设计等方面，并对现有的方法进行评价。
results: 本文对现有的扩散模型基于的图像修复方法进行了全面的总结，并提出了五个可能的未来研究方向，包括样本效率、模型压缩、质量评价等方面。

Abstract
Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, "whether diffusion model can boost image restoration". To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.

摘要
Image restoration（IR）是低级视觉领域中不可或缺的和挑战性的任务，旨在提高受到不同类型的降低的图像主观质量。最近，扩散模型在视觉生成AIGC中取得了显著进步，因此提出了一个直观的问题：“扩散模型能否提高图像restoration”？为了回答这个问题，一些创新的研究尝试将扩散模型与图像restoration任务结合，从而实现更高的性能。然而，一份全面和激进的扩散模型基于图像restoration的评价还缺乏。在这篇论文中，我们是首次提供了一份全面的扩散模型基于图像restoration的评价，涵盖学习 парадиг，条件策略，框架设计，模型化策略和评价。具体来说，我们首先介绍了扩散模型的背景，然后介绍了在图像restoration中使用扩散模型的两种常见方法。接着，我们将扩散模型在图像restoration和匿名/实际图像restoration中的创新设计进行分类和强调，以便激励未来的发展。为了评价现有方法的全面性，我们列举了一些常用的数据集、实现细节和评价度量。此外，我们还对开源方法进行了对比评价，包括图像超解、压缩和填充等三个任务。最后，根据现有方法的局限性，我们提出了五个未来研究的挑战和可能性，包括样本效率、模型压缩、降低度量和不变学习。

DReg-NeRF: Deep Registration for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.09386
repo_url: https://github.com/aibluefisher/dreg-nerf
paper_authors: Yu Chen, Gim Hee Lee
for: 本研究的目的是解决基于物体中心的场景下多个NeRF的注册问题，而不需要人工标注关键点。
methods: 我们提出了一种基于transformer架构的DReg-NeRF方法，首先提取NeRF中的占据网格特征，然后使用自我关注和相关关注层来学习对应NeRF块之间的关系。与现有的点云注册方法不同，我们的方法不需要任何人工标注，并且使用表面场所supervise协调对应点云。
results: 我们在测试集上评估了我们的方法，与现有的点云注册方法相比，我们的方法在mean $\text{RPE}$和mean $\text{RTE}$上都有大幅度的提升，分别为9.67度和0.038。

Abstract
Although Neural Radiance Fields (NeRF) is popular in the computer vision community recently, registering multiple NeRFs has yet to gain much attention. Unlike the existing work, NeRF2NeRF, which is based on traditional optimization methods and needs human annotated keypoints, we propose DReg-NeRF to solve the NeRF registration problem on object-centric scenes without human intervention. After training NeRF models, our DReg-NeRF first extracts features from the occupancy grid in NeRF. Subsequently, our DReg-NeRF utilizes a transformer architecture with self-attention and cross-attention layers to learn the relations between pairwise NeRF blocks. In contrast to state-of-the-art (SOTA) point cloud registration methods, the decoupled correspondences are supervised by surface fields without any ground truth overlapping labels. We construct a novel view synthesis dataset with 1,700+ 3D objects obtained from Objaverse to train our network. When evaluated on the test set, our proposed method beats the SOTA point cloud registration methods by a large margin, with a mean $\text{RPE}=9.67^{\circ}$ and a mean $\text{RTE}=0.038$. Our code is available at https://github.com/AIBluefisher/DReg-NeRF.

摘要
尽管神经辐射场（NeRF）在计算机视觉领域最近受欢迎，但是多个NeRF的注册问题还尚未得到了充分的关注。与现有的工作不同，我们的NeRF2NeRF基于传统优化方法，需要人工标注关键点，我们提出了DReg-NeRF来解决对象中心场景下NeRF注册问题。经过训练NeRF模型后，我们的DReg-NeRF首先从NeRF中提取特征。然后，我们的DReg-NeRF利用了转换器架构，包括自我注意和跨注意层，以学习NeRF块对之间的关系。与现状最佳点云注册方法相比，我们的解耦对应点不需要任何真实重合标签。我们在Objaverse上构建了一个新的视图合成数据集，并在其上训练我们的网络。在测试集上评估时，我们的提议方法与最佳点云注册方法相比， Mean RPE=9.67度和Mean RTE=0.038。参考代码可以在https://github.com/AIBluefisher/DReg-NeRF中找到。

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

paper_url: http://arxiv.org/abs/2308.09383
repo_url: None
paper_authors: Hoonhee Cho, Hyeonseong Kim, Yujeong Chae, Kuk-Jin Yoon
for: 本研究旨在实现没有类别标签和对应图像的事件基于对象识别。
methods: 我们提出了一种结合对象识别和图像重建的共同形式，首先将事件重建为图像，然后通过对比语言-图像预训练（CLIP）进行对象识别，并使用类别导向吸引损失和类别无关排斥损失将文本特征与重建图像的视觉特征相吸引。我们还提出了一种可靠的数据采样策略和本地-全局重建一致性来促进两个任务的共同学习。
results: 我们的方法在预测和重建质量方面具有明显的优势，并且可以进行零参数对象识别。我们的项目代码可以在 \url{https://github.com/Chohoonhee/Ev-LaFOR} 上下载。

Abstract
Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at \url{https://github.com/Chohoonhee/Ev-LaFOR}.

摘要
Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at \url{https://github.com/Chohoonhee/Ev-LaFOR}.Here's the translation in Traditional Chinese:当事件为 sparse 和噪音时，对象识别成为极其困难，当 paired images 和 category labels 不存在时。在这篇文章中，我们研究了无标签事件基于的对象识别，其中 category labels 和 paired images 都不可用。为了解决这个问题，我们提出了一个组合的形式，即对象识别和图像重建的联合运算。我们的方法首先从事件中重建图像，然后通过 Contrastive Language-Image Pre-training (CLIP) 进行对象识别，这样可以通过充分的图像背景来提高识别的精度。由于类别信息是重建图像中的重要元素，我们提出了类别导引吸引损失和类别无关排斥损失，将文本特征与重建图像中的视觉特征相连接。此外，我们还引入了可靠的抽样策略和本地-全球重建一致性，以提高两个任务之间的联合学习。为了提高预测精度和重建质量，我们还提出了一个原型基于的方法，使用无标签图像。实验结果显示了我们的方法的超越性和可扩展性。我们的项目代码可以在 \url{https://github.com/Chohoonhee/Ev-LaFOR} 上获取。

Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package

paper_url: http://arxiv.org/abs/2308.09375
repo_url: https://github.com/behnoodrasti/hysupp
paper_authors: Behnood Rasti, Alexandre Zouaoui, Julien Mairal, Jocelyn Chanussot
for: 本文提供了一个概述先进和传统混合分析方法的评 comparison，以及三种不同类型的 linear unmixing 方法的性能对比。
methods: 本文使用了先进的 Image processing 和机器学习技术，包括超vised、semi-supervised 和 blind linear unmixing 方法。
results: 实验结果表明，不同类型的 unmixing 方法在不同的混合分析场景下有不同的优势，以及一个开源的 Python 基于包可以在 https://github.com/BehnoodRasti/HySUPP 上下载。

Abstract
Spectral pixels are often a mixture of the pure spectra of the materials, called endmembers, due to the low spatial resolution of hyperspectral sensors, double scattering, and intimate mixtures of materials in the scenes. Unmixing estimates the fractional abundances of the endmembers within the pixel. Depending on the prior knowledge of endmembers, linear unmixing can be divided into three main groups: supervised, semi-supervised, and unsupervised (blind) linear unmixing. Advances in Image processing and machine learning substantially affected unmixing. This paper provides an overview of advanced and conventional unmixing approaches. Additionally, we draw a critical comparison between advanced and conventional techniques from the three categories. We compare the performance of the unmixing techniques on three simulated and two real datasets. The experimental results reveal the advantages of different unmixing categories for different unmixing scenarios. Moreover, we provide an open-source Python-based package available at https://github.com/BehnoodRasti/HySUPP to reproduce the results.

摘要
干扰像素通常是Materials的纯谱混合物，由于干扰数据探针的低空间分辨率、双折射和场景中Materials的密切混合，导致干扰像素的混合。拆分分析计算机程序可以Estimate the fractional abundance of endmembers within the pixel.根据对Endmember的知识，线性拆分可以分为三个主要类别：有监督、半监督和无监督（盲目）线性拆分。图像处理和机器学习技术的进步对拆分具有重要影响。本文提供了高级和传统拆分方法的总览，并对这些方法进行了 kritische 比较。我们在三个 simulated 和两个实际数据集上测试了不同拆分方法的性能。实验结果表明不同拆分类型在不同拆分场景中具有优势。此外，我们还提供了一个开源的 Python 基于 package，可以在 https://github.com/BehnoodRasti/HySUPP 上下载。

paper_url: http://arxiv.org/abs/2308.09369
repo_url: https://github.com/sguttikon/SFSS-MMSI
paper_authors: Suresh Guttikonda, Jason Rambach
for: 这篇论文旨在探讨多modal的拼接和扩展对全景场景理解的可能性。
methods: 该论文提出了一种基于transformer的跨Modal融合架构，以AddressExtreme object deformation和全景图像扭曲。
results: 该论文在三个indoor全景视图数据集上进行了广泛的测试，并达到了当前领域最佳的mIoU性能：60.60%在Stanford2D3DS（RGB-HHA）、71.97%在Structured3D（RGB-D-N）以及35.92%在Matterport3D（RGB-D）。

Abstract
In recent years, the research community has shown a lot of interest to panoramic images that offer a 360-degree directional perspective. Multiple data modalities can be fed, and complimentary characteristics can be utilized for more robust and rich scene interpretation based on semantic segmentation, to fully realize the potential. Existing research, however, mostly concentrated on pinhole RGB-X semantic segmentation. In this study, we propose a transformer-based cross-modal fusion architecture to bridge the gap between multi-modal fusion and omnidirectional scene perception. We employ distortion-aware modules to address extreme object deformations and panorama distortions that result from equirectangular representation. Additionally, we conduct cross-modal interactions for feature rectification and information exchange before merging the features in order to communicate long-range contexts for bi-modal and tri-modal feature streams. In thorough tests using combinations of four different modality types in three indoor panoramic-view datasets, our technique achieved state-of-the-art mIoU performance: 60.60% on Stanford2D3DS (RGB-HHA), 71.97% Structured3D (RGB-D-N), and 35.92% Matterport3D (RGB-D). We plan to release all codes and trained models soon.

摘要
近年来，研究社区对全景图像（panoramic image）表现出了很大的兴趣，这些图像可以提供360度方向的视角。多种数据模式可以被 fed，并可以利用不同特征进行更加robust和 ricn scene解释，以实现潜在的潜力。然而，现有的研究主要集中在pinhole RGB-X semantic segmentation。在本研究中，我们提议一种基于 transformer 混合模型来bridging the gap between multi-modal fusion和全景场景理解。我们使用扭曲感知模块来处理极端对象扭曲和全景图像扭曲，并进行交叉模式交互以便Feature rectification和信息交换，以便在bi-modal和tri-modal流水线中传递长距离上下文。在使用四种不同模式类型的三个indoor panoramic-view数据集进行了经验详细测试后，我们的技术实现了状态的最佳mIoU性能：60.60%在Stanford2D3DS（RGB-HHA）、71.97%在Structured3D（RGB-D-N）以及35.92%在Matterport3D（RGB-D）。我们计划在不久将所有代码和训练模型公开。

Overlap Bias Matching is Necessary for Point Cloud Registration

paper_url: http://arxiv.org/abs/2308.09364
repo_url: None
paper_authors: Pengcheng Shi, Jie Zhang, Haozhe Cheng, Junyang Wang, Yiyang Zhou, Chenlin Zhao, Jihua Zhu
for: 本研究旨在提出一种基于无监督学习的点云注册方法，以解决实际中点云注册问题中的受限 overlap 问题。
methods: 本方法基于一个名为 Overlap Bias Matching Network (OBMNet)，它包括两个Integral component： overlap sampling module和 bias prediction module。这两个组件可以捕捉点云重叠区域的分布，并预测点云共同结构的偏置系数。然后，我们将OBMM与邻居地图匹配模块结合，以精确地挖掘对应关系。
results: 实验结果表明，我们的方法在各种数据集上达到了与当前注册方法相比较显著的改进。

Abstract
Point cloud registration is a fundamental problem in many domains. Practically, the overlap between point clouds to be registered may be relatively small. Most unsupervised methods lack effective initial evaluation of overlap, leading to suboptimal registration accuracy. To address this issue, we propose an unsupervised network Overlap Bias Matching Network (OBMNet) for partial point cloud registration. Specifically, we propose a plug-and-play Overlap Bias Matching Module (OBMM) comprising two integral components, overlap sampling module and bias prediction module. These two components are utilized to capture the distribution of overlapping regions and predict bias coefficients of point cloud common structures, respectively. Then, we integrate OBMM with the neighbor map matching module to robustly identify correspondences by precisely merging matching scores of points within the neighborhood, which addresses the ambiguities in single-point features. OBMNet can maintain efficacy even in pair-wise registration scenarios with low overlap ratios. Experimental results on extensive datasets demonstrate that our approach's performance achieves a significant improvement compared to the state-of-the-art registration approach.

摘要
点云注册是许多领域的基本问题。实际上，注册点云的重叠部分可能很小。大多数无监督方法缺乏有效的初始评估重叠，导致注册精度下降。为解决这个问题，我们提议一种无监督网络Overlap Bias Matching Network（OBMNet），用于部分点云注册。具体来说，我们提出了一个插件式的Overlap Bias Matching Module（OBMM），包括两个基本组成部分：重叠采样模块和偏好预测模块。这两个组成部分用于捕捉重叠区域的分布和预测点云共同结构的偏好系数，分别。然后，我们将OBMM与邻居地图匹配模块集成，以robustly确定对应关系，解决单点特征之间的混淆。OBMNet可以在对点云注册方式进行对比时保持效率。实验结果表明，我们的方法在评估中达到了与当前注册方法相比较显著的提升。

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

paper_url: http://arxiv.org/abs/2308.09363
repo_url: https://github.com/mlvlab/ovqa
paper_authors: Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J. Kim
for: 这个论文的目的是提出一个新的benchmark，以评估视频问答模型的泛化能力。methods: 这个论文使用了一种新的GNN-based soft verbalizer来提高模型对不常见答案的预测能力，同时也对现有的开放端VIDEOQA模型进行了修改，以便进一步考虑罕见和未看过的答案。results: 该论文的实验结果表明，使用GNN-based soft verbalizer可以进一步提高模型的泛化能力，特别是对罕见和未看过的答案。此外，修改了现有的开放端VIDEOQA模型也可以提高其表现。

Abstract
Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA.

摘要
视频问答（VideoQA）是一项复杂的多模态逻辑任务。相比多选视频问答，开放式视频问答的目标是Answer questions without restricting candidate answers。然而，大多数前一代视频问答模型将开放式视频问答视为一种分类任务，即将视频-问题对 clasified into a fixed answer set，即关闭词汇，这会导致模型偏向常见答案，而忽略非常见答案。我们因此提出了一个新的标准测试集，开放词汇视频问答（OVQA），以衡量视频问答模型的通用性。此外，为提高模型的通用性能力，我们引入了一种新的图神经网络（GNN）基于的软化词语izer，该算法可以通过聚合相似词语的信息来提高对不常见答案的预测。我们的基线和精度分析表明，我们的GNN基于的软化词语izer可以进一步提高模型的性能，特别是对于不常见答案。我们希望，我们的标准测试集OVQA可以为评估视频问答模型的通用性提供指南，并且能启发未来的研究。代码可以在https://github.com/mlvlab/OVQA 中找到。

Multi-scale Target-Aware Framework for Constrained Image Splicing Detection and Localization

paper_url: http://arxiv.org/abs/2308.09357
repo_url: None
paper_authors: Yuxuan Tan, Yuanman Li, Limin Zeng, Jiaxiong Ye, Wei wang, Xia Li
for: 批处 multimedia 图像中的剪辑检测和定位问题，即检测两个受控图像之间的剪辑操作并定位剪辑区域在两个图像上。
methods: 我们提出了一种多尺度目标意识框架，将特征提取和相关比较作为一个独立的管道进行联合处理，并设计了一种目标意识注意机制，使模型可以同时学习特征和相关比较。
results: 我们的方法在多个标准数据集上进行测试，与现有方法相比，具有更高的检测精度和更好的灵活性，并且可以有效地处理缩放变换。

Abstract
Constrained image splicing detection and localization (CISDL) is a fundamental task of multimedia forensics, which detects splicing operation between two suspected images and localizes the spliced region on both images. Recent works regard it as a deep matching problem and have made significant progress. However, existing frameworks typically perform feature extraction and correlation matching as separate processes, which may hinder the model's ability to learn discriminative features for matching and can be susceptible to interference from ambiguous background pixels. In this work, we propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline. In contrast to previous methods, we design a target-aware attention mechanism that jointly learns features and performs correlation matching between the probe and donor images. Our approach can effectively promote the collaborative learning of related patches, and perform mutual promotion of feature learning and correlation matching. Additionally, in order to handle scale transformations, we introduce a multi-scale projection method, which can be readily integrated into our target-aware framework that enables the attention process to be conducted between tokens containing information of varying scales. Our experiments demonstrate that our model, which uses a unified pipeline, outperforms state-of-the-art methods on several benchmark datasets and is robust against scale transformations.

摘要
《受限制的图像拼接检测和地点Localization（CISDL）是 multimedia 的基本任务，检测两个可疑图像之间的拼接操作并将拼接区域分别显示在两个图像上。现有的方法通常将特征提取和相关匹配视为两个独立的过程，这可能会阻碍模型学习特征特异的匹配特征，同时也可能受到杂质背景像素的干扰。在这种情况下，我们提出了一种多尺度目标意识框架，将特征提取和相关匹配作为一个统一的管道进行处理。与先前的方法不同，我们设计了一种目标意识的注意机制，同时学习特征和相关匹配。我们的方法可以有效地促进相关块之间的协同学习，并且可以同时促进特征学习和相关匹配的协同发展。此外，为了处理缩放变换，我们引入了多尺度投影方法，可以轻松地 интеGRATE到我们的目标意识框架中，使注意过程可以在不同的尺度上进行。我们的实验表明，使用统一管道的我们模型，在多个标准数据集上表现出色，并且对缩放变换具有鲁棒性。》

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

paper_url: http://arxiv.org/abs/2308.09346
repo_url: https://github.com/jiazheng-xing/gghm
paper_authors: Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, Yong Liu
for: 本文提出了一种新的几何匹配框架（GgHM），用于解决几何捕捉异常分类问题。
methods: 本文使用图 neural network 引导构建任务特有的特征，并对这些特征进行内部和间部特征相关性优化。然后，本文提出了一种混合匹配策略，将帧级匹配和元组级匹配相结合，以便将视频匹配到多种样式。最后，本文还提出了一种可学习的细致时间模型，以增强视频特征的时间表示，为匹配过程建立更加坚实的基础。
results: GgHM 在多个几何捕捉数据集上实现了一致性好于其他挑战性基准点，证明了我们的方法的有效性。

Abstract
Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM.

摘要
<>Translate the given text into Simplified Chinese.<>前方法主要集中在设计空间时间关系模型或复杂的时间对Alignment算法。尽管其 promise 的结果，但它们忽略了类prototype构建和匹配的价值，导致每个任务中类似类型的识别性不够。在这篇论文中，我们提出了GgHM，一个新的框架，具有图导向混合匹配。具体来说，我们在类prototype构建过程中，通过图 neural network 的引导，显式地优化内类和外类特征相关性。接着，我们设计了混合匹配策略，将帧级匹配和元组级匹配结合起来，以便在多样化风格下分类视频。此外，我们还提出了一个可学习的紧凑时间模型，以强化视频特征的时间表示，以建立更坚实的匹配基础。GgHM在多个复杂的基准下显示了稳定的改进，证明了我们的方法的有效性。代码将在https://github.com/jiazheng-xing/GgHM上公开。

Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

paper_url: http://arxiv.org/abs/2308.09345
repo_url: https://github.com/robert-graf/readable-conditional-denoising-diffusion
paper_authors: Robert Graf, Joachim Schmitt, Sarah Schlaeger, Hendrik Kristian Möller, Vasiliki Sideri-Lampretsa, Anjany Sekuboyina, Sandro Manuel Krieg, Benedikt Wiestler, Bjoern Menze, Daniel Rueckert, Jan Stefan Kirschke
for: This paper aims to develop and evaluate methods for translating spinal MR images to CT images, with a focus on accurately delineating posterior spine structures.methods: The study uses a combination of landmark-based registration and image-to-image translation techniques, including paired and unpaired methods such as Pix2Pix, DDIM, and SynDiff. The authors evaluate the performance of these methods using PSNR and Dice scores.results: The study finds that paired methods and SynDiff exhibit similar translation performance and Dice scores on paired data, while DDIM image mode achieves the highest image quality. The 3D translation methods outperform the 2D approach, providing anatomically accurate segmentations with improved Dice scores and avoiding underprediction of small structures like the spinous process.

Abstract
Background: Automated segmentation of spinal MR images plays a vital role both scientifically and clinically. However, accurately delineating posterior spine structures presents challenges. Methods: This retrospective study, approved by the ethical committee, involved translating T1w and T2w MR image series into CT images in a total of n=263 pairs of CT/MR series. Landmark-based registration was performed to align image pairs. We compared 2D paired (Pix2Pix, denoising diffusion implicit models (DDIM) image mode, DDIM noise mode) and unpaired (contrastive unpaired translation, SynDiff) image-to-image translation using "peak signal to noise ratio" (PSNR) as quality measure. A publicly available segmentation network segmented the synthesized CT datasets, and Dice scores were evaluated on in-house test sets and the "MRSpineSeg Challenge" volumes. The 2D findings were extended to 3D Pix2Pix and DDIM. Results: 2D paired methods and SynDiff exhibited similar translation performance and Dice scores on paired data. DDIM image mode achieved the highest image quality. SynDiff, Pix2Pix, and DDIM image mode demonstrated similar Dice scores (0.77). For craniocaudal axis rotations, at least two landmarks per vertebra were required for registration. The 3D translation outperformed the 2D approach, resulting in improved Dice scores (0.80) and anatomically accurate segmentations in a higher resolution than the original MR image. Conclusion: Two landmarks per vertebra registration enabled paired image-to-image translation from MR to CT and outperformed all unpaired approaches. The 3D techniques provided anatomically correct segmentations, avoiding underprediction of small structures like the spinous process.

摘要
Methods: 这是一个回顾性研究，由伦敦病理学会approved的审核委员会批准。该研究将T1w和T2w MR图像系列转换为CT图像系列，共计n=263对CT/MR系列图像对应。使用了landmark-based registration来对图像对对应。我们使用"peak signal to noise ratio"（PSNR）作为质量指标，并对在家测试集和"MRSpineSeg Challenge"的Volume进行评估。使用了一个公共可用的分割网络将synthesized CT数据分割，并评估了Dice分数。Results: 2D对应方法和SynDiff在对应数据上 exhibited similar translation performance和Dice分数。DDIM图像模式实现了最高的图像质量。SynDiff、Pix2Pix和DDIM图像模式在对应数据上都表现了类似的Dice分数（0.77）。对于脊梗磁旋转轴，至少需要两个附加的landmark per vertebra进行 регистраción。3D翻译超过了2D方法，导致了改进的Dice分数（0.80）和精度高于原始MR图像的正确的分割。Conclusion: 使用两个附加的landmark per vertebra进行 registration可以实现对MR图像与CT图像的对应，并且超过了所有的无对应方法。3D技术提供了正确的分割，避免了小结构的下预测，如脊梗处。

LSCD: A Large-Scale Screen Content Dataset for Video Compression

paper_url: http://arxiv.org/abs/2308.09332
repo_url: None
paper_authors: Yuhao Cheng, Siru Zhang, Yiqiang Yan, Rong Chen, Yun Zhang
for: 提供一个大规模屏幕内容数据集(LSCD)，用于促进屏幕内容视频压缩领域的研究。
methods: 使用人工智能和视频压缩技术，对屏幕内容视频进行学习型压缩。
results: 提供了一个大规模的屏幕内容视频压缩数据集(LSCD)，并对数据集进行分析，以帮助研究人员更好地理解屏幕内容视频的特点，并提高学习型压缩算法的开发。

Abstract
Multimedia compression allows us to watch videos, see pictures and hear sounds within a limited bandwidth, which helps the flourish of the internet. During the past decades, multimedia compression has achieved great success using hand-craft features and systems. With the development of artificial intelligence and video compression, there emerges a lot of research work related to using the neural network on the video compression task to get rid of the complicated system. Not only producing the advanced algorithms, but researchers also spread the compression to different content, such as User Generated Content(UGC). With the rapid development of mobile devices, screen content videos become an important part of multimedia data. In contrast, we find community lacks a large-scale dataset for screen content video compression, which impedes the fast development of the corresponding learning-based algorithms. In order to fulfill this blank and accelerate the research of this special type of videos, we propose the Large-scale Screen Content Dataset(LSCD), which contains 714 source sequences. Meanwhile, we provide the analysis of the proposed dataset to show some features of screen content videos, which will help researchers have a better understanding of how to explore new algorithms. Besides collecting and post-processing the data to organize the dataset, we also provide a benchmark containing the performance of both traditional codec and learning-based methods.

摘要
Multimedia compression allow us to watch videos, see pictures and hear sounds within a limited bandwidth, which helps the flourish of the internet. During the past decades, multimedia compression has achieved great success using hand-craft features and systems. With the development of artificial intelligence and video compression, there emerges a lot of research work related to using the neural network on the video compression task to get rid of the complicated system. Not only producing the advanced algorithms, but researchers also spread the compression to different content, such as User Generated Content(UGC). With the rapid development of mobile devices, screen content videos become an important part of multimedia data. In contrast, we find community lacks a large-scale dataset for screen content video compression, which impedes the fast development of the corresponding learning-based algorithms. In order to fulfill this blank and accelerate the research of this special type of videos, we propose the Large-scale Screen Content Dataset(LSCD), which contains 714 source sequences. Meanwhile, we provide the analysis of the proposed dataset to show some features of screen content videos, which will help researchers have a better understanding of how to explore new algorithms. Besides collecting and post-processing the data to organize the dataset, we also provide a benchmark containing the performance of both traditional codec and learning-based methods.

SAMedOCT: Adapting Segment Anything Model (SAM) for Retinal OCT

paper_url: http://arxiv.org/abs/2308.09331
repo_url: None
paper_authors: Botond Fazekas, José Morano, Dmitrii Lachinov, Guilherme Aresta, Hrvoje Bogunović
for: 这篇论文主要是为了评估Segment Anything Model（SAM）在 RETOUCH 挑战中的大规模公共数据集上的应用。methods: 这篇论文使用了SAM和其修改版本进行了Retinal OCT影像分割的评估，并与当前领导的Retinal fluid segmentation方法进行了比较。results: 研究发现， adapted SAM在Retinal OCT影像分割中表现出了优异的能力，但在一些情况下仍落后于当前领导的方法。这些结果表明SAM在Retinal OCT图像分析中具有适应性和稳定性，并且可以作为Retinal OCT图像分析中的一种有价值工具。

Abstract
The Segment Anything Model (SAM) has gained significant attention in the field of image segmentation due to its impressive capabilities and prompt-based interface. While SAM has already been extensively evaluated in various domains, its adaptation to retinal OCT scans remains unexplored. To bridge this research gap, we conduct a comprehensive evaluation of SAM and its adaptations on a large-scale public dataset of OCTs from RETOUCH challenge. Our evaluation covers diverse retinal diseases, fluid compartments, and device vendors, comparing SAM against state-of-the-art retinal fluid segmentation methods. Through our analysis, we showcase adapted SAM's efficacy as a powerful segmentation model in retinal OCT scans, although still lagging behind established methods in some circumstances. The findings highlight SAM's adaptability and robustness, showcasing its utility as a valuable tool in retinal OCT image analysis and paving the way for further advancements in this domain.

摘要
segmen anything model (SAM) 在图像分割领域备受关注，因其出色的能力和提示式界面。 although SAM 在不同领域得到了广泛的评估，它在Retinal OCT 图像中的适应仍然未经探索。 To bridge this research gap, we conducted a comprehensive evaluation of SAM and its adaptations on a large-scale public dataset of OCTs from RETOUCH challenge. 我们的评估覆盖了多种Retinal diseases, fluid compartments, and device vendors，比较SAM 与现有的Retinal fluid segmentation方法。 Through our analysis, we showcased adapted SAM's efficacy as a powerful segmentation model in retinal OCT scans, although still lagging behind established methods in some circumstances. The findings highlight SAM's adaptability and robustness, showcasing its utility as a valuable tool in retinal OCT image analysis and paving the way for further advancements in this domain.

Unlimited Knowledge Distillation for Action Recognition in the Dark

paper_url: http://arxiv.org/abs/2308.09327
repo_url: None
paper_authors: Ruibing Jin, Guosheng Lin, Min Wu, Jie Lin, Zhengguo Li, Xiaoli Li, Zhenghua Chen
for: 提高动作识别网络学习的知识。
methods: 提出无限知识填充（UKD）技术，不需大量GPU内存，可以有效地融合不同的知识。
results: 对ARID数据集进行了广泛的实验，单流网络通过UKD的填充得到了优于两流网络的表现。

Abstract
Dark videos often lose essential information, which causes the knowledge learned by networks is not enough to accurately recognize actions. Existing knowledge assembling methods require massive GPU memory to distill the knowledge from multiple teacher models into a student model. In action recognition, this drawback becomes serious due to much computation required by video process. Constrained by limited computation source, these approaches are infeasible. To address this issue, we propose an unlimited knowledge distillation (UKD) in this paper. Compared with existing knowledge assembling methods, our UKD can effectively assemble different knowledge without introducing high GPU memory consumption. Thus, the number of teaching models for distillation is unlimited. With our UKD, the network's learned knowledge can be remarkably enriched. Our experiments show that the single stream network distilled with our UKD even surpasses a two-stream network. Extensive experiments are conducted on the ARID dataset.

摘要
黑色视频常常会产生重要信息的丢失，导致网络学习的知识不够准确地识别动作。现有的知识组合方法需要巨量的GPU内存来浸泡多个教师模型中的知识到学生模型中。在动作识别 tasks中，这种缺点变得非常严重，因为视频处理需要很多计算。由于计算源有限，这些方法是不可能实现的。为解决这个问题，我们在这篇论文中提出了无限知识浸泡（UKD）方法。与现有的知识组合方法相比，我们的 UKD 可以无需高GPU内存占用，因此教师模型的数量是无限的。通过我们的 UKD，网络学习的知识可以备受扩展。我们的实验表明，单流网络通过我们的 UKD 甚至超过了两流网络。我们在 ARID 数据集上进行了广泛的实验。

Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.09314
repo_url: https://github.com/allenxiangx/retro-fpn
paper_authors: Peng Xiang, Xin Wen, Yu-Shen Liu, Hui Zhang, Yi Fang, Zhizhong Han
for: 提高点云Semantic segmentation的精度，解决过去方法中的信息损失和不明确区域特征问题。
methods: 提出Retro-FPN方法，将每个点的特征预测设计为一个显式和回顾的修复过程，通过所有层次结构来提取semantic features。其关键新特点是一个Retro-Transformer，用于从前一层层次结构中概括semantic context，并在当前阶段修复特征。
results: 与州际标准背景方法相比，Retro-FPN可以显著提高性能。经过广泛的实验证明，Retro-FPN可以在多种常用的 benchmark 上达到州际标准水平。源代码可以在https://github.com/AllenXiangX/Retro-FPN上获取。

Abstract
Learning per-point semantic features from the hierarchical feature pyramid is essential for point cloud semantic segmentation. However, most previous methods suffered from ambiguous region features or failed to refine per-point features effectively, which leads to information loss and ambiguous semantic identification. To resolve this, we propose Retro-FPN to model the per-point feature prediction as an explicit and retrospective refining process, which goes through all the pyramid layers to extract semantic features explicitly for each point. Its key novelty is a retro-transformer for summarizing semantic contexts from the previous layer and accordingly refining the features in the current stage. In this way, the categorization of each point is conditioned on its local semantic pattern. Specifically, the retro-transformer consists of a local cross-attention block and a semantic gate unit. The cross-attention serves to summarize the semantic pattern retrospectively from the previous layer. And the gate unit carefully incorporates the summarized contexts and refines the current semantic features. Retro-FPN is a pluggable neural network that applies to hierarchical decoders. By integrating Retro-FPN with three representative backbones, including both point-based and voxel-based methods, we show that Retro-FPN can significantly improve performance over state-of-the-art backbones. Comprehensive experiments on widely used benchmarks can justify the effectiveness of our design. The source is available at https://github.com/AllenXiangX/Retro-FPN

摘要
学习每个点的 semantic 特征从层次特征 пирамид是semantic segmentation of point clouds 中的关键。然而，大多数之前的方法受到不明确的区域特征或不能有效地细化每个点的特征，导致信息损失和不明确的 semantic 标识。为解决这个问题，我们提出了Retro-FPN，它模型了每个点的特征预测为明确的和Retrospective 细化过程，通过所有层次来提取semantic 特征。其关键创新是一个Retro-transformer，它在上一层的semantic 上下文中进行总结，并在当前阶段细化特征。因此，每个点的分类是基于其当前semantic 模式。具体来说，Retro-transformer包括一个local cross-attention块和一个semantic gate单元。cross-attention 用于在上一层的semantic 模式中总结，并将其细化到当前阶段。而gate单元通过细化current semantic features ，以提高分类精度。Retro-FPN是一个可插入的神经网络，可以应用于层次解码器。通过与三种代表性的背景结构相结合，包括点基的和voxel基的方法，我们表明Retro-FPN可以在state-of-the-art 背景下显著提高性能。广泛的实验证明了我们的设计的有效性。源代码可以在https://github.com/AllenXiangX/Retro-FPN 中下载。

Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

paper_url: http://arxiv.org/abs/2308.09307
repo_url: https://github.com/highwaywu/focal
paper_authors: Haiwei Wu, Yiming Chen, Jiantao Zhou
for: 本研究旨在提高图像forge检测的精度和效果，并提出一种新的方法FOCAL（Forensic ContrAstive cLustering），这种方法基于对比学习和无监督划分，能够准确地检测图像中的forge区域。
methods: FOCAL方法包括三个主要部分：1）使用像素级对比学习来监督高级别侦验特征提取; 2）使用在线无监督划分算法来将学习到的特征分为forge和正常两类; 3）通过简单地Feature层 concatenation来进一步提高检测性能，无需重新训练。
results: 实验结果表明，FOCAL方法在六个公共测试数据集上达到了与state-of-the-art竞争算法之间的大幅提升：+24.3%的覆盖率、+18.6%的哥伦比亚、+17.5%的FF++, +14.2%的MISD、+13.5%的CASIA和+10.3%的NIST，以及IoU方面。

Abstract
Image forgery detection aims to detect and locate forged regions in an image. Most existing forgery detection algorithms formulate classification problems to classify pixels into forged or pristine. However, the definition of forged and pristine pixels is only relative within one single image, e.g., a forged region in image A is actually a pristine one in its source image B (splicing forgery). Such a relative definition has been severely overlooked by existing methods, which unnecessarily mix forged (pristine) regions across different images into the same category. To resolve this dilemma, we propose the FOrensic ContrAstive cLustering (FOCAL) method, a novel, simple yet very effective paradigm based on contrastive learning and unsupervised clustering for the image forgery detection. Specifically, FOCAL 1) utilizes pixel-level contrastive learning to supervise the high-level forensic feature extraction in an image-by-image manner, explicitly reflecting the above relative definition; 2) employs an on-the-fly unsupervised clustering algorithm (instead of a trained one) to cluster the learned features into forged/pristine categories, further suppressing the cross-image influence from training data; and 3) allows to further boost the detection performance via simple feature-level concatenation without the need of retraining. Extensive experimental results over six public testing datasets demonstrate that our proposed FOCAL significantly outperforms the state-of-the-art competing algorithms by big margins: +24.3% on Coverage, +18.6% on Columbia, +17.5% on FF++, +14.2% on MISD, +13.5% on CASIA and +10.3% on NIST in terms of IoU. The paradigm of FOCAL could bring fresh insights and serve as a novel benchmark for the image forgery detection task. The code is available at https://github.com/HighwayWu/FOCAL.

摘要
Image forgery detection aims to detect and locate forged regions in an image. Most existing forgery detection algorithms formulate classification problems to classify pixels into forged or pristine. However, the definition of forged and pristine pixels is only relative within one single image, e.g., a forged region in image A is actually a pristine one in its source image B (splicing forgery). Such a relative definition has been severely overlooked by existing methods, which unnecessarily mix forged (pristine) regions across different images into the same category. To resolve this dilemma, we propose the FOrensic ContrAstive cLustering (FOCAL) method, a novel, simple yet very effective paradigm based on contrastive learning and unsupervised clustering for the image forgery detection. Specifically, FOCAL 1) utilizes pixel-level contrastive learning to supervise the high-level forensic feature extraction in an image-by-image manner, explicitly reflecting the above relative definition; 2) employs an on-the-fly unsupervised clustering algorithm (instead of a trained one) to cluster the learned features into forged/pristine categories, further suppressing the cross-image influence from training data; and 3) allows to further boost the detection performance via simple feature-level concatenation without the need of retraining. Extensive experimental results over six public testing datasets demonstrate that our proposed FOCAL significantly outperforms the state-of-the-art competing algorithms by big margins: +24.3% on Coverage, +18.6% on Columbia, +17.5% on FF++, +14.2% on MISD, +13.5% on CASIA and +10.3% on NIST in terms of IoU. The paradigm of FOCAL could bring fresh insights and serve as a novel benchmark for the image forgery detection task. The code is available at https://github.com/HighwayWu/FOCAL.

paper_url: http://arxiv.org/abs/2308.09306
repo_url: None
paper_authors: Runhui Huang, Jianhua Han, Guansong Lu, Xiaodan Liang, Yihan Zeng, Wei Zhang, Hang Xu
for: 本文旨在探讨将生成和检测融合到一起的可能性，以提高图像生成和图像文本检测的性能。
methods: 本文提出了一种名为DiffDis的散度过程模型，它将批处理的散度过程与多modal的预训练模型（如CLIP、ALIGN和FILIP）结合起来，以实现图像生成和图像文本检测的同时学习。
results: 实验结果表明，DiffDis可以在12个数据集上的零 shot分类任务上提高平均精度0.65%，并在图像生成和图像文本检测任务上提高FID指标0.242%。

Abstract
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis.

摘要
近期，大规模扩散模型，如稳定扩散和DallE2，在图像生成方面已经显示出惊人的成绩。然而，大规模的交叉模态预训练模型（如CLIP、ALIGN和FILIP）在多种下游任务上表现出色，这些模型通过学习视觉和语言嵌入的对应关系来学习。在这篇论文中，我们探讨了将生成和批判合并到一起的可能性。特别是，我们提出了DiffDis模型，它将交叉模态生成和批判预训练集成为一个散射过程中的单一框架。DiffDis首先将图像-文本识别问题定义为图像编码器生成的文本嵌入在图像条件下的散射过程。然后，我们提出了一种新的双流网络架构，它将不同缩放的图像嵌入与文本嵌入进行混合，以进行图像-文本识别学习。此外，生成和批判任务可以在多模态模型中有效共享图像分支网络结构。由于散射基于的统一训练，DiffDis可以同时实现更好的生成能力和交叉模态含义对齐。实验结果表明，DiffDis在图像生成和图像-文本识别任务上都有更好的表现，比如在12个数据集上的平均精度为1.65%，和在FID上的图像生成精度为2.42%。

Human Part-wise 3D Motion Context Learning for Sign Language Recognition

paper_url: http://arxiv.org/abs/2308.09305
repo_url: None
paper_authors: Taeryung Lee, Yeonguk Oh, Kyoung Mu Lee
for: 提高手语识别的表现，特别是利用手部特征来提高表现。
methods: 提出了一种基于人体部分动作上下文学习的框架，包括使用分解变换器（PET）和整体变换器（WET）来学习手部动作上下文，以及对2D和3D姿态进行ensemble。
results: 在WLASL数据集上实现了比前一代方法更高的表现，具体来说是通过学习手部动作上下文来提高手语识别的精度。

Abstract
In this paper, we propose P3D, the human part-wise motion context learning framework for sign language recognition. Our main contributions lie in two dimensions: learning the part-wise motion context and employing the pose ensemble to utilize 2D and 3D pose jointly. First, our empirical observation implies that part-wise context encoding benefits the performance of sign language recognition. While previous methods of sign language recognition learned motion context from the sequence of the entire pose, we argue that such methods cannot exploit part-specific motion context. In order to utilize part-wise motion context, we propose the alternating combination of a part-wise encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET encodes the motion contexts from a part sequence, while WET merges them into a unified context. By learning part-wise motion context, our P3D achieves superior performance on WLASL compared to previous state-of-the-art methods. Second, our framework is the first to ensemble 2D and 3D poses for sign language recognition. Since the 3D pose holds rich motion context and depth information to distinguish the words, our P3D outperformed the previous state-of-the-art methods employing a pose ensemble.

摘要
在这篇论文中，我们提出P3D，人体部分动作上下文学习框架，用于手语识别。我们的主要贡献在两个维度：学习部分动作上下文和利用2D和3D pose共同。首先，我们的实验观察表明，部分动作上下文编码对手语识别性能有益。而前一代方法通常从整个姿态序列学习动作上下文，我们 argue那些方法无法利用部分动作上下文。为了利用部分动作上下文，我们提议使用分解成部分编码器（PET）和整体编码器（WET）的交互式组合。PET编码部分动作上下文，而WET将其合并为一个统一的上下文。通过学习部分动作上下文，我们的P3D在WLASL上比前一代方法更高的性能。其次，我们的框架是首次将2D和3D姿态共同用于手语识别。因为3D姿态具有较多的动作上下文和深度信息，我们的P3D在使用pose ensemble时表现出了更高的性能。

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

paper_url: http://arxiv.org/abs/2308.09303
repo_url: https://github.com/moonjunyyy/si-blurry
paper_authors: Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, Gyeong-Moon Park
for: 本研究旨在 Addressing the challenges of continual learning in real-world scenarios, where the number of input data and tasks is constantly changing in a statistical way.
methods: 本研究提出了一种新的 Stochastic incremental Blurry task boundary scenario (Si-Blurry)，以及一种名为 Mask and Visual Prompt tuning (MVP) 的方法来解决 inter-和 intra-task 忘记问题和 class imbalance 问题。MVP 包括一种新的 instance-wise logit 遮盾和 contrastive visual prompt 准则，以及一种新的 gradient similarity-based focal loss 和 adaptive feature scaling。
results: 对于our challenging Si-Blurry scenario, extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods.

Abstract
Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario.

摘要
<>输入文本转换为简化中文。<>持续学习目标是学习从连续流动的数据中学习模型，但是它主要假设Fixed数据量和任务数量，并且任务boundary是明确的。然而，在实际情况下，输入数据和任务的数量在统计方式上不断变化，而不是静态的方式。虽然最近引入的增量学习方式有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方有些地方

Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

paper_url: http://arxiv.org/abs/2308.09298
repo_url: https://github.com/garynico517/ssl-ian-retraining
paper_authors: Yusheng Liu, Rui Xin, Tao Yang, Lisheng Wang
for: 这个论文的目的是提高自动尖齿神经 canal segmentation 的能力，以便在 dental 和 maxillofacial 手术中避免不可逆的神经损伤。
methods: 作者提出了一种基于 IAN 连接性的选择性重训练方法，以解决自动 segmentation 中 sparse labeling 的负面影响。
results: 作者的方法在 ToothFairy 验证集上进行了量化评估，达到了 dice similarity coefficient (DSC) 0.7956，和 95% hausdorff distance (HD95) 4.4905，并在竞赛中获得冠军。

Abstract
Inferior Alveolar Nerve (IAN) canal detection in CBCT is an important step in many dental and maxillofacial surgery applications to prevent irreversible damage to the nerve during the procedure.The ToothFairy2023 Challenge aims to establish a 3D maxillofacial dataset consisting of all sparse labels and partial dense labels, and improve the ability of automatic IAN segmentation. In this work, in order to avoid the negative impact brought by sparse labeling, we transform the mixed supervised problem into a semi-supervised problem. Inspired by self-training via pseudo labeling, we propose a selective re-training framework based on IAN connectivity. Our method is quantitatively evaluated on the ToothFairy verification cases, achieving the dice similarity coefficient (DSC) of 0.7956, and 95\% hausdorff distance (HD95) of 4.4905, and wining the champion in the competition. Code is available at https://github.com/GaryNico517/SSL-IAN-Retraining.

摘要
“ inferior alveolar nerve（IAN） Canal detection in CBCT 是 dental 和 maxillofacial surgery 中重要的一步，以避免在过程中对神经造成 irreversible 的损害。ToothFairy2023 Challenge 目标是建立一个 3D maxillofacial 数据集，包括所有稀疏标签和部分杂散标签，以提高自动 IAN 分割能力。在本工作中，以免除稀疏标签的负面影响，我们将混合监督问题转化为 semi-supervised 问题。受 pseudo labeling 的启发，我们提出一种选择性重训练框架，基于 IAN 连接性。我们的方法在 ToothFairy 验证例程中被评估，达到了 dice similarity coefficient（DSC）0.7956，和 Hausdorff distance（HD95）4.4905，并在竞赛中获胜。代码可以在 https://github.com/GaryNico517/SSL-IAN-Retraining 上获取。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning

paper_url: http://arxiv.org/abs/2308.09297
repo_url: https://github.com/tamasham/napa-vq
paper_authors: Tamasha Malepathirana, Damith Senanayake, Saman Halgamuge
For: 强调在深度神经网络中避免严重遗传问题，即在新知获得时不损失之前知识。* Methods: 基于非特例学习（NECIL），不使用过去的例子来学习新的类别，并且通过对邻近类别的了解来增强分类器的决策界。* Results: 与现有的State-of-the-art NECIL方法比较，NAPA-VQ方法在CIFAR-100、TinyImageNet和ImageNet-Subset上的实验结果显示，获得了5%、2%和4%的精度提升和10%、3%和9%的遗传减少。

Abstract
Catastrophic forgetting; the loss of old knowledge upon acquiring new knowledge, is a pitfall faced by deep neural networks in real-world applications. Many prevailing solutions to this problem rely on storing exemplars (previously encountered data), which may not be feasible in applications with memory limitations or privacy constraints. Therefore, the recent focus has been on Non-Exemplar based Class Incremental Learning (NECIL) where a model incrementally learns about new classes without using any past exemplars. However, due to the lack of old data, NECIL methods struggle to discriminate between old and new classes causing their feature representations to overlap. We propose NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization, a framework that reduces this class overlap in NECIL. We draw inspiration from Neural Gas to learn the topological relationships in the feature space, identifying the neighboring classes that are most likely to get confused with each other. This neighborhood information is utilized to enforce strong separation between the neighboring classes as well as to generate old class representative prototypes that can better aid in obtaining a discriminative decision boundary between old and new classes. Our comprehensive experiments on CIFAR-100, TinyImageNet, and ImageNet-Subset demonstrate that NAPA-VQ outperforms the State-of-the-art NECIL methods by an average improvement of 5%, 2%, and 4% in accuracy and 10%, 3%, and 9% in forgetting respectively. Our code can be found in https://github.com/TamashaM/NAPA-VQ.git.

摘要
深度神经网络在实际应用中面临的一个挑战是 catastrophic forgetting，即因为新知约而失去老知识的问题。许多现有的解决方案是通过存储过去的数据（例emplars），但这可能不是应用中存储限制或隐私限制的情况下可行。因此，近期的注意点在 Non-Exemplar based Class Incremental Learning (NECIL) 方面，这种方法可以在没有过去数据的情况下，逐步学习新类。然而，由于缺乏过去数据，NECIL 方法可能会将旧类和新类的特征表示相互混淆。我们提出了 NAPA-VQ：它是一种基于 Neural Gas 学习特征空间中类之间的 topological 关系的框架，以便在 feature 空间中强制分离邻近类。此外，它还可以生成旧类代表对象，以便更好地帮助得出精准的决策边界 между旧类和新类。我们的全面实验表明，NAPA-VQ 在 CIFAR-100、TinyImageNet 和 ImageNet-Subset 上比 State-of-the-art NECIL 方法平均提高了5%, 2%, 4% 的准确率和10%, 3%, 9% 的忘记率。我们的代码可以在 https://github.com/TamashaM/NAPA-VQ.git 找到。

Self-Calibrated Cross Attention Network for Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2308.09294
repo_url: https://github.com/sam1224/sccan
paper_authors: Qianxiong Xu, Wenting Zhao, Guosheng Lin, Cheng Long
for:This paper focuses on improving few-shot segmentation (FSS) by effectively utilizing support samples.methods:The proposed method uses a self-calibrated cross attention (SCCA) block, which splits the query and support features into patches, aligns each query patch with its most similar support patch, and fuses the query background features with matched background features from the support image.results:The proposed method achieves state-of-the-art performance on PASCAL-5^i and COCO-20^i under 5-shot setting, with a mIoU score of 5.6% better than previous state-of-the-arts on COCO-20^i.

Abstract
The key to the success of few-shot segmentation (FSS) lies in how to effectively utilize support samples. Most solutions compress support foreground (FG) features into prototypes, but lose some spatial details. Instead, others use cross attention to fuse query features with uncompressed support FG. Query FG could be fused with support FG, however, query background (BG) cannot find matched BG features in support FG, yet inevitably integrates dissimilar features. Besides, as both query FG and BG are combined with support FG, they get entangled, thereby leading to ineffective segmentation. To cope with these issues, we design a self-calibrated cross attention (SCCA) block. For efficient patch-based attention, query and support features are firstly split into patches. Then, we design a patch alignment module to align each query patch with its most similar support patch for better cross attention. Specifically, SCCA takes a query patch as Q, and groups the patches from the same query image and the aligned patches from the support image as K&V. In this way, the query BG features are fused with matched BG features (from query patches), and thus the aforementioned issues will be mitigated. Moreover, when calculating SCCA, we design a scaled-cosine mechanism to better utilize the support features for similarity calculation. Extensive experiments conducted on PASCAL-5^i and COCO-20^i demonstrate the superiority of our model, e.g., the mIoU score under 5-shot setting on COCO-20^i is 5.6%+ better than previous state-of-the-arts. The code is available at https://github.com/Sam1224/SCCAN.

摘要
针对几何shot segmentation（FSS）的成功关键在于如何有效利用支持样本。大多数解决方案将支持背景（BG）特征压缩成原型，但是会产生一些空间细节的损失。其他人则使用批注注意力机制来融合查询特征和未压缩的支持BG。然而，查询BG无法在支持BG中找到匹配的BG特征，却必然混合不同的特征。此外，由于查询BG和支持BG都与支持BG进行融合，因此会导致不准确的分割。为了解决这些问题，我们设计了一个自适应批注注意力块（SCCA）。SCCA块的实现方式如下：首先，我们将查询特征和支持特征切分成小块。然后，我们设计了一个块对齐模块，用于将每个查询块与其最相似的支持块进行对齐。这样，查询BG特征可以与支持BG中的匹配BG特征进行混合，从而解决上述问题。此外，在计算SCCA时，我们设计了一个托管整数机制，以更好地利用支持特征进行相似性计算。我们在PASCAL-5^i和COCO-20^i上进行了广泛的实验，结果显示我们的模型在5架shot设定下的mIoU分数比前一个状态艺术高出5.6%。代码可以在https://github.com/Sam1224/SCCAN中下载。

RFDforFin: Robust Deep Forgery Detection for GAN-generated Fingerprint Images

paper_url: http://arxiv.org/abs/2308.09285
repo_url: None
paper_authors: Hui Miao, Yuanfang Guo, Yunhong Wang
for: 防止GAN生成的指纹图像被恶意褪别，以保护公共安全。
methods: combining unique ridge features of fingerprint and generation artifacts of the GAN-generated images to detect deep forgery.
results: 提出了首个针对指纹图像的深度褪别检测方法，实现了低复杂度和高效果。

Abstract
With the rapid development of the image generation technologies, the malicious abuses of the GAN-generated fingerprint images poses a significant threat to the public safety in certain circumstances. Although the existing universal deep forgery detection approach can be applied to detect the fake fingerprint images, they are easily attacked and have poor robustness. Meanwhile, there is no specifically designed deep forgery detection method for fingerprint images. In this paper, we propose the first deep forgery detection approach for fingerprint images, which combines unique ridge features of fingerprint and generation artifacts of the GAN-generated images, to the best of our knowledge. Specifically, we firstly construct a ridge stream, which exploits the grayscale variations along the ridges to extract unique fingerprint-specific features. Then, we construct a generation artifact stream, in which the FFT-based spectrums of the input fingerprint images are exploited, to extract more robust generation artifact features. At last, the unique ridge features and generation artifact features are fused for binary classification (\textit{i.e.}, real or fake). Comprehensive experiments demonstrate that our proposed approach is effective and robust with low complexities.

摘要
随着图像生成技术的快速发展，GAN生成的指纹图像在某些情况下可能会对公共安全构成威胁。虽然现有的通用深度伪造检测方法可以检测假指纹图像，但它们容易受到攻击并有低效率。此外，没有专门设计的深度伪造检测方法 для指纹图像。在这篇论文中，我们提出了首个深度伪造检测方法 для指纹图像，该方法结合了指纹特有的缝合特征和GAN生成图像的生成痕迹特征。具体来说，我们首先构建缝合流，利用缝合中的灰度变化来提取指纹特有的特征。然后，我们构建生成痕迹流，利用输入指纹图像的FFT基准谱来提取更加鲁棒的生成痕迹特征。最后，缝合特征和生成痕迹特征进行 binary 分类（即真实或假）。广泛的实验表明，我们提出的方法有效和可靠，并且具有低复杂性。

Diverse Cotraining Makes Strong Semi-Supervised Segmentor

paper_url: http://arxiv.org/abs/2308.09281
repo_url: None
paper_authors: Yijiang Li, Xinjiang Wang, Lihe Yang, Litong Feng, Wayne Zhang, Ying Gao
for: 本研究旨在探讨深度共训的工作机制，并证明多视图支持的假设不符合现实。
methods: 本研究使用多种方法来增加共训模型的多样性，包括输入领域、不同的扩展和网络架构。
results: compared to current state-of-the-art 方法，我们的多样共训方法在不同的评估协议上取得了明显的提高，例如在 Pascal 上 achieve the best mIoU of 76.2%, 77.7% and 80.2% with only 92, 183 and 366 labeled images.

Abstract
Deep co-training has been introduced to semi-supervised segmentation and achieves impressive results, yet few studies have explored the working mechanism behind it. In this work, we revisit the core assumption that supports co-training: multiple compatible and conditionally independent views. By theoretically deriving the generalization upper bound, we prove the prediction similarity between two models negatively impacts the model's generalization ability. However, most current co-training models are tightly coupled together and violate this assumption. Such coupling leads to the homogenization of networks and confirmation bias which consequently limits the performance. To this end, we explore different dimensions of co-training and systematically increase the diversity from the aspects of input domains, different augmentations and model architectures to counteract homogenization. Our Diverse Co-training outperforms the state-of-the-art (SOTA) methods by a large margin across different evaluation protocols on the Pascal and Cityscapes. For example. we achieve the best mIoU of 76.2%, 77.7% and 80.2% on Pascal with only 92, 183 and 366 labeled images, surpassing the previous best results by more than 5%.

摘要

DiffLLE: Diffusion-guided Domain Calibration for Unsupervised Low-light Image Enhancement

paper_url: http://arxiv.org/abs/2308.09279
repo_url: None
paper_authors: Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, Jian Zhang
for: 这篇论文是为了提出一种robust和有效的无监督低光照图像改善方法，即Diffusion-based domain calibration（DiffLLE）。
methods: 该方法采用了一种简单的无监督增强算法，并采用了两个附加的零扩展插件模块：Diffusion-guided Degradation Calibration（DDC）模块和Fine-grained Target domain Distillation（FTD）模块。
results: 该方法在多种实验中表现出色，甚至超过了一些监督学习方法。

Abstract
Existing unsupervised low-light image enhancement methods lack enough effectiveness and generalization in practical applications. We suppose this is because of the absence of explicit supervision and the inherent gap between real-world scenarios and the training data domain. In this paper, we develop Diffusion-based domain calibration to realize more robust and effective unsupervised Low-Light Enhancement, called DiffLLE. Since the diffusion model performs impressive denoising capability and has been trained on massive clean images, we adopt it to bridge the gap between the real low-light domain and training degradation domain, while providing efficient priors of real-world content for unsupervised models. Specifically, we adopt a naive unsupervised enhancement algorithm to realize preliminary restoration and design two zero-shot plug-and-play modules based on diffusion model to improve generalization and effectiveness. The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve, which makes the enhancement model perform robustly even in sophisticated wild degradation. Due to the limited enhancement effect of the unsupervised model, we further develop the Fine-grained Target domain Distillation (FTD) module to find a more visual-friendly solution space. It exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrinks the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field, addressing the lack of strong explicit supervision for unsupervised methods. Benefiting from these, our approach even outperforms some supervised methods by using only a simple unsupervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE.

摘要
We adopt a diffusion model that has been trained on massive clean images, and use it to bridge the gap between the real low-light domain and the training degradation domain. This provides efficient priors of real-world content for unsupervised models. We use a naive unsupervised enhancement algorithm to realize preliminary restoration, and design two zero-shot plug-and-play modules based on the diffusion model to improve generalization and effectiveness.The Diffusion-guided Degradation Calibration (DDC) module narrows the gap between real-world and training low-light degradation through diffusion-based domain calibration and a lightness enhancement curve. This makes the enhancement model perform robustly even in sophisticated wild degradation. However, the limited enhancement effect of the unsupervised model leads to the development of the Fine-grained Target domain Distillation (FTD) module. This module exploits the priors of the pre-trained diffusion model to generate pseudo-references, which shrink the preliminary restored results from a coarse normal-light domain to a finer high-quality clean field. This addresses the lack of strong explicit supervision for unsupervised methods.Our approach outperforms some supervised methods using only a simple unsupervised baseline. Extensive experiments demonstrate the superior effectiveness of the proposed DiffLLE.

MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR

paper_url: http://arxiv.org/abs/2308.09278
repo_url: https://github.com/SheldonTsui/Matlaber
paper_authors: Xudong Xu, Zhaoyang Lyu, Xingang Pan, Bo Dai
for:本研究旨在提高文本到3D图像生成中的物料质量，通过使用新的秘密BRDF自动编码器（Latent BRDF auto-EncodeR，简称MATLABER）。methods:我们使用大规模的真实世界BRDF收集来训练这个自动编码器，并确保其隐藏空间的光滑性，这些隐藏空间自然变为物料的自然分布。在文本到3D图像生成中，我们在物料预测中使用这些隐藏空间编码器，而不是直接使用BRDF参数。results:我们的方法在生成物料的实际和准确性方面表现出色，并且高质量的物料自然地启用了多个下游任务，如重新照明和物料编辑。

Abstract
Based on powerful text-to-image diffusion models, text-to-3D generation has made significant progress in generating compelling geometry and appearance. However, existing methods still struggle to recover high-fidelity object materials, either only considering Lambertian reflectance, or failing to disentangle BRDF materials from the environment lights. In this work, we propose Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR (\textbf{MATLABER}) that leverages a novel latent BRDF auto-encoder for material generation. We train this auto-encoder with large-scale real-world BRDF collections and ensure the smoothness of its latent space, which implicitly acts as a natural distribution of materials. During appearance modeling in text-to-3D generation, the latent BRDF embeddings, rather than BRDF parameters, are predicted via a material network. Through exhaustive experiments, our approach demonstrates the superiority over existing ones in generating realistic and coherent object materials. Moreover, high-quality materials naturally enable multiple downstream tasks such as relighting and material editing. Code and model will be publicly available at \url{https://sheldontsui.github.io/projects/Matlaber}.

摘要

Progression-Guided Temporal Action Detection in Videos

paper_url: http://arxiv.org/abs/2308.09268
repo_url: https://github.com/makecent/apn
paper_authors: Chongkai Lu, Man-Wai Mak, Ruimin Li, Zheru Chi, Hong Fu
For: The paper proposes a novel framework called Action Progression Network (APN) for temporal action detection (TAD) in videos.* Methods: The APN framework uses a complete action process to encode the temporal structure of actions, and trains a neural network to recognize the action progressions. The framework detects action boundaries by detecting complete action processes in the videos.* Results: The APN achieves competitive performance and significantly surpasses its counterparts in detecting long-lasting actions, with a mean Average Precision (mAP) of 58.3% on the THUMOS14 dataset and 98.9% mAP on the DFMAD70 dataset.Here’s the same information in Simplified Chinese text:* For: 本文提出了一种名为Action Progression Network (APN)的新框架，用于视频中的时间动作检测 (TAD)。* Methods: APN框架使用完整的动作进程来编码动作的时间结构，然后使用神经网络来识别动作进程。框架通过检测视频中的完整动作进程来检测动作边界。* Results: APN达到了竞争性的性能，并在检测持续时间的动作方面表现出色，THUMOS14 dataset上的mean Average Precision (mAP)为58.3%，DFMAD70 dataset上的mAP为98.9%。

Abstract
We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos. The framework locates actions in videos by detecting the action evolution process. To encode the action evolution, we quantify a complete action process into 101 ordered stages (0\%, 1\%, ..., 100\%), referred to as action progressions. We then train a neural network to recognize the action progressions. The framework detects action boundaries by detecting complete action processes in the videos, e.g., a video segment with detected action progressions closely follow the sequence 0\%, 1\%, ..., 100\%. The framework offers three major advantages: (1) Our neural networks are trained end-to-end, contrasting conventional methods that optimize modules separately; (2) The APN is trained using action frames exclusively, enabling models to be trained on action classification datasets and robust to videos with temporal background styles differing from those in training; (3) Our framework effectively avoids detecting incomplete actions and excels in detecting long-lasting actions due to the fine-grained and explicit encoding of the temporal structure of actions. Leveraging these advantages, the APN achieves competitive performance and significantly surpasses its counterparts in detecting long-lasting actions. With an IoU threshold of 0.5, the APN achieves a mean Average Precision (mAP) of 58.3\% on the THUMOS14 dataset and 98.9\% mAP on the DFMAD70 dataset.

摘要
我们提出了一个新的框架，即 Action Progression Network (APN)，用于视频中的时间动作检测（TAD）。这个框架通过检测动作进程来确定动作的位置。为了编码动作进程，我们将完整的动作过程分解为101个顺序阶段（0%、1%、...、100%），称之为动作进程。然后我们将神经网络训练以认识这些动作进程。框架通过检测视频中的完整动作进程来检测动作边界，例如视频段中的检测动作进程与序列0%、1%、...、100%的顺序匹配。框架具有以下三大优点：1. 我们的神经网络是通过端到端训练的，与传统方法不同，这些方法通常将模块分割并优化。2. APN使用唯一的动作帧进行训练，因此模型可以在动作分类数据集上训练，并且对视频中的时间背景样式不同于训练时的样式具有鲁棒性。3. 我们的框架可以准确地检测长时间的动作，因为它使用细化和明确的时间结构编码器，从而避免检测不完整的动作。通过这些优点，APN在检测长时间动作方面实现了竞争性的性能，并在THUMOS14和DFMAD70数据集上达到了98.9%的mAP和58.3%的mAP。

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

paper_url: http://arxiv.org/abs/2308.09244
repo_url: https://github.com/mcg-nju/sparsebev
paper_authors: Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, Limin Wang
for: This paper focuses on developing a fully sparse 3D object detector, SparseBEV, to mitigate the performance gap between sparse and dense detectors in camera-based 3D object detection.
methods: SparseBEV uses a query-based paradigm without explicit dense BEV feature construction, and includes three key designs: scale-adaptive self attention, adaptive spatio-temporal sampling, and adaptive mixing.
results: On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS.

Abstract
Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV.

摘要
过去几年，基于 bird's eye view（BEV）空间的几何检测器在Camera-based 3D объек体检测领域中吸引了很大的注意力。紧密的检测器通常遵循一个 two-stage 管道，首先是在BEV空间建立紧密的BEV特征，然后进行 объек detection，这会受到复杂的视野转换和高计算成本的影响。另一方面，稀疏的检测器遵循一个查询基本的思想，不需要明确的紧密BEV特征建立，但是其性能较差。在这篇论文中，我们发现了关键是在BEV和图像空间中的检测器适应能力。为了实现这个目标，我们提出了SparseBEV，一个完全稀疏的3D几何检测器，它的性能比紧密检测器更高。SparseBEV包括三个关键设计，分别是（1）缩寸自适应的自我注意力，用于在BEV空间中缩寸特征的整合，（2）适应的空间-时间抽样，用于在寻找适当的抽样位置的指导，以及（3）适应的混合，用于将抽样结果与问题答案进行适应权重的混合。在nuScenes的测试分区上，SparseBEV实现了67.5 NDS的国际级别性能。在val分区上，SparseBEV实现了55.8 NDS，同时保持了实时推理速度的23.5 FPS。代码可以在https://github.com/MCG-NJU/SparseBEV上找到。

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

paper_url: http://arxiv.org/abs/2308.09242
repo_url: https://github.com/isee-laboratory/asag
paper_authors: Shenghao Fu, Junkai Yan, Yipeng Gao, Xiaohua Xie, Wei-Shi Zheng
for: 提高 object detection 的速度和准确率， bridging the performance gap between sparse and dense detectors.
methods: 提出 Adaptive Sparse Anchor Generator (ASAG)， dynamically predicting anchors on patches rather than grids in a sparse way to alleviate feature conflict problem, and using a simple and effective Query Weighting method to ease the training instability.
results: 比较 dense-initialized ones 和其他方法，实现了更好的速度-准确率平衡，并且在实验中表现出色。

Abstract
Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}.

摘要
We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hindering the performance of one-decoder-layer detectors. To address this, we propose the Adaptive Sparse Anchor Generator (ASAG), which predicts dynamic anchors on patches in a sparse way, alleviating the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors.Further, we propose a simple and effective Query Weighting method to ease the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}.Translation notes:* "Recent sparse detectors" is translated as "近期的 sparse detectors" (jì qī de sparse detectors)* "multiple, e.g. six, decoder layers" is translated as "多个，例如六个，decoder层" (duō gè, yǐ jǐ liù gè, decoder layer)* "achieve promising performance" is translated as "实现了可能的表现" (shí yì le kě néng de biǎo xiǎng)* "but much inference time" is translated as "但是大量的推理时间" (dàn shì dà liú de tuī lǐ shí jiān)* "due to complex heads" is translated as "由于复杂的头部" (yūn guī zhī de tóu bù)* "Previous works have explored using dense priors as initialization" is translated as "先前的工作已经探索了使用密集的初始值" (xiān qian de gōng zuò yǐ jīn tàng zhī de chū shí zhì)* "built one-decoder-layer detectors" is translated as "构建了一层decoder的检测器" (gòng jìan le yī cè decoder de jiǎn téng)* "Although they gain remarkable acceleration" is translated as "尽管它们很快" (zhòu guàn tā men hěn kuài)* "their performance still lags behind their six-decoder-layer counterparts by a large margin" is translated as "它们的表现仍然落后六层decoder的对手 by a large margin" (tā men de biǎo xiǎng zhèng rán zài liù cè decoder de duì shǒu by a large margin)* "In this work, we aim to bridge this performance gap" is translated as "在这个工作中，我们目标是填补这个表现差距" (zài zhè ge gōng zuò, wǒmen mù tiāo shì fēn chōng zhè ge biǎo xiǎng jì dao)* "while retaining fast speed" is translated as "同时保持快速" (tóng shí bǎo qiú sù)* "We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict" is translated as "我们发现 dense和sparse detector的 architecture difference导致 feature conflict" (wǒmen fā xiǎng dense yào sparse detector de architecture difference dào yùn feature conflict)* "hampering the performance of one-decoder-layer detectors" is translated as "妨碍一层decoder的检测器表现" (mèi yòu yī cè decoder de jiǎn téng biǎo xiǎng)* "Thus we propose Adaptive Sparse Anchor Generator (ASAG)" is translated as "因此我们提出 Adaptive Sparse Anchor Generator (ASAG)" (yīn qí wǒmen tī shuā Adaptive Sparse Anchor Generator (ASAG))* "which predicts dynamic anchors on patches in a sparse way" is translated as "可以在缺省的方式下预测动态的锚点" (kě yǐ zài yì qiū xiāng de fāng shí zhī xiǎng yì qiū yì zhī)* "alleviating the feature conflict problem" is translated as "解决 feature conflict 问题" (jiě jī feature conflict wèn tí)* "For each image, ASAG dynamically selects which feature maps and which locations to predict" is translated as "对于每个图像，ASAG动态选择哪些特征地图和哪些位置预测" (duì yú zhè ge hú xiàng, ASAG dòng tài xūn zhè yǐ qù zhì yǐ zhòng zhì)* "forming a fully adaptive way to generate image-specific anchors" is translated as "形成一种完全适应的方式来生成图像特定的锚点" (xíng chéng yī zhī qù zhì yǐ zhòng zhì)* "Further, we propose a simple and effective Query Weighting method" is translated as "另外，我们提出了一种简单有效的查询权重方法" (yīn qí, wǒmen tī shuā yī qiū yǒu yì de chá qiǎn zhèng yì)* "to ease the training instability from adaptiveness" is translated as "以适应性导致的训练不稳定性" (yǐ shì yì qiū xiāng de tào xiǎng shì)* "Extensive experiments show that our method outperforms dense-initialized ones" is translated as "广泛的实验表明我们的方法在 dense-initialized 方面表现出色" (guǎng fāng de shí yàn bǎng mìng wǒmen de fāng shì zài dense-initialized zhōng xiàng)* "and achieves a better speed-accuracy trade-off" is translated as "并实现了更好的速度-准确性质量比" (yǔ shì yì qiū yǐ zhèng yì qiū yì zhòng zhì)* "The code is available at \url{https://github.com/iSEE-Laboratory/ASAG}" is translated as "代码可以在 \url{https://github.com/iSEE-Laboratory/ASAG} 上获取" (dài māo kě yǐ zài \url{https://github.com/iSEE-Laboratory/ASAG} shàng gòu qǔ)

paper_url: http://arxiv.org/abs/2308.09234
repo_url: None
paper_authors: Sahar Rahimi Malakshan, Mohammad Saeed Ebrahimi Saadabadi, Nima Najafzadeh, Nasser M. Nasrabadi
for: 提高人脸识别（FR）模型的泛化能力，解决现有训练数据的质量不均问题。
methods: 使用多模型增强技术，即AdaBoost的sample-level weighting方法，使得不同模型在不同样本困难程度上拥有专家性。
results: 在CFP-FP、LFW、CPLFW、CALFW、AgeDB、TinyFace、IJB-B和IJB-C评估 datasets上实现了比 estado-of-the-art 性能。

Abstract
Deep convolutional neural networks have achieved remarkable success in face recognition (FR), partly due to the abundant data availability. However, the current training benchmarks exhibit an imbalanced quality distribution; most images are of high quality. This poses issues for generalization on hard samples since they are underrepresented during training. In this work, we employ the multi-model boosting technique to deal with this issue. Inspired by the well-known AdaBoost, we propose a sample-level weighting approach to incorporate the importance of different samples into the FR loss. Individual models of the proposed framework are experts at distinct levels of sample hardness. Therefore, the combination of models leads to a robust feature extractor without losing the discriminability on the easy samples. Also, for incorporating the sample hardness into the training criterion, we analytically show the effect of sample mining on the important aspects of current angular margin loss functions, i.e., margin and scale. The proposed method shows superior performance in comparison with the state-of-the-art algorithms in extensive experiments on the CFP-FP, LFW, CPLFW, CALFW, AgeDB, TinyFace, IJB-B, and IJB-C evaluation datasets.

摘要
深度卷积神经网络在人脸识别（FR）领域取得了很大的成功，一部分是因为数据的充足性。然而，当前的训练标准 exhibit 一个不均衡的质量分布，大多数图像是高质量的。这会导致在训练中困难样本的代表性受到限制。在这种情况下，我们使用多模型增强技术来解决这个问题。我们提出了一种样本水平的权重方法，以便在 FR 损失中包含不同样本的重要性。具体来说，我们的框架中的各个模型是专家在不同水平上的样本困难程度。因此，将这些模型相加可以获得一个可靠的特征提取器，而不会失去易样本的把握。此外，为了在训练标准中包含样本困难程度，我们分析了现有的angular margin loss函数的重要方面，即边和缩放。我们的方法在大量的实验中表现出色，比如CFP-FP、LFW、CPLFW、CALFW、AgeDB、TinyFace、IJB-B 和 IJB-C 评估数据集。

CCFace: Classification Consistency for Low-Resolution Face Recognition

paper_url: http://arxiv.org/abs/2308.09230
repo_url: None
paper_authors: Mohammad Saeed Ebrahimi Saadabadi, Sahar Rahimi Malakshan, Hossein Kashiani, Nasser M. Nasrabadi
for: 提高低分辨率人脸识别的性能
methods: 使用分类一致知识塑造和适应角度罚款，以及异形跨分辨率学习
results: 在低分辨率benchmark上提高三个百分点，包括tinyface和scface等In English, this would be:
for: Improving the performance of low-resolution face recognition
methods: Using classification consistency knowledge distillation and adaptive angular penalty, as well as asymmetric cross-resolution learning
results: Improving performance by three percent on TinyFace and other low-resolution benchmarks while maintaining performance on high-resolution benchmarks.

Abstract
In recent years, deep face recognition methods have demonstrated impressive results on in-the-wild datasets. However, these methods have shown a significant decline in performance when applied to real-world low-resolution benchmarks like TinyFace or SCFace. To address this challenge, we propose a novel classification consistency knowledge distillation approach that transfers the learned classifier from a high-resolution model to a low-resolution network. This approach helps in finding discriminative representations for low-resolution instances. To further improve the performance, we designed a knowledge distillation loss using the adaptive angular penalty inspired by the success of the popular angular margin loss function. The adaptive penalty reduces overfitting on low-resolution samples and alleviates the convergence issue of the model integrated with data augmentation. Additionally, we utilize an asymmetric cross-resolution learning approach based on the state-of-the-art semi-supervised representation learning paradigm to improve discriminability on low-resolution instances and prevent them from forming a cluster. Our proposed method outperforms state-of-the-art approaches on low-resolution benchmarks, with a three percent improvement on TinyFace while maintaining performance on high-resolution benchmarks.

摘要

Generalized Sum Pooling for Metric Learning

paper_url: http://arxiv.org/abs/2308.09228
repo_url: https://github.com/yetigurbuz/generalized-sum-pooling
paper_authors: Yeti Z. Gurbuz, Ozan Sener, A. Aydın Alatan
for: 该论文主要研究了深度度量学中的核心选择方法，即全局平均 pooling (GAP) 的扩展和改进。
methods: 该论文提出了一种名为泛化总和Pooling (GSP) 的新方法，它可以更好地选择Semantic entity，并学习每个 Semantic entity 的重要性。
results: 论文通过广泛的实验证明了 GSP 方法的效果，在四个Popular metric learning benchmark上表现出色，代替 GAP 方法可以更好地进行深度度量学。

Abstract
A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework

摘要
通常的建筑设计 для深度度量学是使用卷积神经网络后跟global average pooling（GAP）。尽管简单，但GAP是一种非常有效的信息汇总方法。一种可能的解释是每个特征向量都代表着不同的semantic entity，GAP是这些entity的 convex combination。基于这种视角，我们总结GAP并提出了一种可学习的总和汇总方法（GSP）。GSP在两个方面提高了GAP：一是选择一 subset of semantic entities，有效地忽略干扰信息；二是学习每个entity的重要性的权重。我们提出了一个 Entropy-smoothed optimal transport problem，并证明它是GAP的严格泛化，即一个特定的实现问题可以得回GAP。我们还提出了一个零战损损失函数，以便学习GSP。我们通过对4个popular度量学benchmark进行广泛的评估，证明了我们的方法的效果。代码可以在：GSP-DML框架中找到。

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

paper_url: http://arxiv.org/abs/2308.09223
repo_url: https://github.com/hexiaoxiao-cs/dmcvr
paper_authors: Xiaoxiao He, Chaowei Tan, Ligong Han, Bo Liu, Leon Axel, Kang Li, Dimitris N. Metaxas
for: 提高心脏疾病诊断和治疗规划的准确3D心脏重建
methods: 使用形态导航推 diffusion模型（DMCVR）Synthesize高解度2D图像和对应的3D重建体积
results: 比前方法高效，能生成高解度3D心脏MRI重建图像，提高心脏疾病诊断和治疗规划的准确性

Abstract
Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks, we propose a morphology-guided diffusion model for 3D cardiac volume reconstruction, DMCVR, that synthesizes high-resolution 2D images and corresponding 3D reconstructed volumes. Our method outperforms previous approaches by conditioning the cardiac morphology on the generative model, eliminating the time-consuming iterative optimization process of the latent code, and improving generation quality. The learned latent spaces provide global semantics, local cardiac morphology and details of each 2D cMRI slice with highly interpretable value to reconstruct 3D cardiac shape. Our experiments show that DMCVR is highly effective in several aspects, such as 2D generation and 3D reconstruction performance. With DMCVR, we can produce high-resolution 3D cardiac MRI reconstructions, surpassing current techniques. Our proposed framework has great potential for improving the accuracy of cardiac disease diagnosis and treatment planning. Code can be accessed at https://github.com/hexiaoxiao-cs/DMCVR.

摘要
精确的3D心脏重建从cinematic magnetic resonance imaging（cMRI）是诊断心血管疾病和心脏运动的关键。然而，现有的心脏MRI基于的重建技术在临床设置中只有2D的限制 Through-plane 分辨率，导致低质量重建的心脏体积。为了更好地从稀疏的2D图像堆栈中重建3D心脏体积，我们提议一种基于形态指导的分子模型，DMCVR，该模型可以生成高分辨率的2D图像和对应的3D重建体积。我们的方法比前一代方法更高效，因为它们 conditioning cardiac morphology 在生成模型中，消除耗时的迭代优化过程，并提高生成质量。学习的秘密空间提供了全球 semantics，局部心脏形态和每个2D cMRI slice 的高可读性，以重建3D心脏形态。我们的实验表明，DMCVR 在多个方面表现出色，如2D生成和3D重建性能。通过 DMCVR，我们可以生成高分辨率的3D心脏MRI重建，超过当前技术。我们提出的框架具有较大的诊断心血管疾病和治疗规划的潜在优势。代码可以在https://github.com/hexiaoxiao-cs/DMCVR 中获取。

A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

paper_url: http://arxiv.org/abs/2308.09221
repo_url: None
paper_authors: Sam Khallaghi, J. Ronald Eastman, Lyndon D. Estes
for: 本文旨在提供对遥感图像semantic segmentation（分类） tasks的技术因素的全面审查，帮助研究人员和实践者更好地理解这个领域中的 neural network 设计因素。
methods: 本文详细介绍了 CNNs、RNNs、GANs 和 transformer 模型，并讨论了这些 ANN 家族中的显著设计特征和其对 semantic segmentation 的影响。同时，也涵盖了常见的预处理技术，包括图像normalization和chipping，以及如何处理训练样本数据不均衡的问题，以及如何使用扩展学习、转移学习和领域适应来解决有限数据的问题。
results: 本文提供了一个全面和最新的理解，涵盖了遥感图像semantic segmentation tasks中 neural network 设计因素的技术和数据相关因素。

Abstract
Semantic segmentation (classification) of Earth Observation imagery is a crucial task in remote sensing. This paper presents a comprehensive review of technical factors to consider when designing neural networks for this purpose. The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and transformer models, discussing prominent design patterns for these ANN families and their implications for semantic segmentation. Common pre-processing techniques for ensuring optimal data preparation are also covered. These include methods for image normalization and chipping, as well as strategies for addressing data imbalance in training samples, and techniques for overcoming limited data, including augmentation techniques, transfer learning, and domain adaptation. By encompassing both the technical aspects of neural network design and the data-related considerations, this review provides researchers and practitioners with a comprehensive and up-to-date understanding of the factors involved in designing effective neural networks for semantic segmentation of Earth Observation imagery.

摘要
Semantic segmentation (classification) of Earth Observation imagery 是 remote sensing 领域中的一个重要任务。本文将提供 Earth Observation imagery 中 neural networks 的设计因素的 comprehensive 综述，包括 Convolutional Neural Networks (CNNs)、Recurrent Neural Networks (RNNs)、Generative Adversarial Networks (GANs) 和 transformer 模型，并讨论这些 ANN 家族的主要设计模式以及它们在 semantic segmentation 中的应用。本文还讨论了一些常见的预processing 技术，包括图像 normalization 和扩展、训练数据均衡问题的解决方案、以及对于有限数据的处理方法，包括扩展技术、传播学习以及领域适应。本文涵盖了 neural network 设计的技术性和数据相关的考虑因素，以提供研究者和实践者一个完整和最新的理解，对 Earth Observation imagery 中 neural networks 的设计为 semantic segmentation 做出更有效的应用。

LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis

paper_url: http://arxiv.org/abs/2308.10713
repo_url: None
paper_authors: Di Chang, Yufeng Yin, Zongjian Li, Minh Tran, Mohammad Soleymani
for: 这个论文主要目的是提出一个开源的人工智能工具kit，用于实时和离线的表情分析。
methods: 这个工具kit使用了深度学习模型，包括表情动作单元（AU）检测、AU强度估计和表情识别。具体来说，我们使用了一个大规模预训练的网络、特征知识填充和任务特定的细化调整等技术，以便准确地分析人脸表情。
results: 在Action Unit（AU）强度估计方面，我们在DISFA上达到了0.63的佩森相关系数（PCC），比OpenFace 2.0的性能高7%，同时保持高效的推理，运行速度两倍于OpenFace 2.0。尽管占用空间小，我们的模型也能够与当前最佳表情分析方法在AffecNet、FFHQ和RAFDB等 datasets上达到竞争性表现。

Abstract
Facial expression analysis is an important tool for human-computer interaction. In this paper, we introduce LibreFace, an open-source toolkit for facial expression analysis. This open-source toolbox offers real-time and offline analysis of facial behavior through deep learning models, including facial action unit (AU) detection, AU intensity estimation, and facial expression recognition. To accomplish this, we employ several techniques, including the utilization of a large-scale pre-trained network, feature-wise knowledge distillation, and task-specific fine-tuning. These approaches are designed to effectively and accurately analyze facial expressions by leveraging visual information, thereby facilitating the implementation of real-time interactive applications. In terms of Action Unit (AU) intensity estimation, we achieve a Pearson Correlation Coefficient (PCC) of 0.63 on DISFA, which is 7% higher than the performance of OpenFace 2.0 while maintaining highly-efficient inference that runs two times faster than OpenFace 2.0. Despite being compact, our model also demonstrates competitive performance to state-of-the-art facial expression analysis methods on AffecNet, FFHQ, and RAFDB. Our code will be released at https://github.com/ihp-lab/LibreFace

摘要
Facial expression analysis is an important tool for human-computer interaction. In this paper, we introduce LibreFace, an open-source toolkit for facial expression analysis. This open-source toolbox offers real-time and offline analysis of facial behavior through deep learning models, including facial action unit (AU) detection, AU intensity estimation, and facial expression recognition. To accomplish this, we employ several techniques, including the utilization of a large-scale pre-trained network, feature-wise knowledge distillation, and task-specific fine-tuning. These approaches are designed to effectively and accurately analyze facial expressions by leveraging visual information, thereby facilitating the implementation of real-time interactive applications. In terms of Action Unit (AU) intensity estimation, we achieve a Pearson Correlation Coefficient (PCC) of 0.63 on DISFA, which is 7% higher than the performance of OpenFace 2.0 while maintaining highly-efficient inference that runs two times faster than OpenFace 2.0. Despite being compact, our model also demonstrates competitive performance to state-of-the-art facial expression analysis methods on AffecNet, FFHQ, and RAFDB. Our code will be released at https://github.com/ihp-lab/LibreFace.

TinyProp – Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

paper_url: http://arxiv.org/abs/2308.09201
repo_url: None
paper_authors: Marcus Rüb, Daniel Maier, Daniel Mueller-Gritschneder, Axel Sikora
for: 这篇论文主要目的是提出一种可以在低功耗微控制器单元（MCU）上进行Device Learning或精益化的深度神经网络训练方法，以减少训练时间和计算负载。
methods: 这篇论文使用了一种名为TinyProp的简单传播方法，这个方法可以在MCU上进行Device Learning，并且可以在训练过程中动态地调整传播比例，以提高训练效率和精确性。
results: 根据论文的结果，TinyProp比非简单训练更快，并且可以保持精确性。具体来说，与非简单训练相比，TinyProp在三个数据集（MNIST、DCASE2020和CIFAR10）上的训练时间比例为5倍，并且仅受到轻微的计算过程影响。此外，TinyProp与现有的静态简单传播方法相比，在训练效率和精确性方面都有优势。

Abstract
Training deep neural networks using backpropagation is very memory and computationally intensive. This makes it difficult to run on-device learning or fine-tune neural networks on tiny, embedded devices such as low-power micro-controller units (MCUs). Sparse backpropagation algorithms try to reduce the computational load of on-device learning by training only a subset of the weights and biases. Existing approaches use a static number of weights to train. A poor choice of this so-called backpropagation ratio limits either the computational gain or can lead to severe accuracy losses. In this paper we present TinyProp, the first sparse backpropagation method that dynamically adapts the back-propagation ratio during on-device training for each training step. TinyProp induces a small calculation overhead to sort the elements of the gradient, which does not significantly impact the computational gains. TinyProp works particularly well on fine-tuning trained networks on MCUs, which is a typical use case for embedded applications. For typical datasets from three datasets MNIST, DCASE2020 and CIFAR10, we are 5 times faster compared to non-sparse training with an accuracy loss of on average 1%. On average, TinyProp is 2.9 times faster than existing, static sparse backpropagation algorithms and the accuracy loss is reduced on average by 6 % compared to a typical static setting of the back-propagation ratio.

摘要
<>使用深度神经网络进行训练是非常占用内存和计算资源的。这使得在低功耗微控制器单元（MCU）上进行设备学习或细化神经网络 become difficult。使用 sparse backpropagation 算法可以减少设备学习中的计算负担，但现有方法使用静态的 backpropagation 比率，这限制了计算减少或导致精度损失。在这篇论文中，我们提出了 TinyProp，第一个在设备学习过程中动态调整 backpropagation 比率的稀疏 backpropagation 方法。TinyProp 在MCU 上进行设备学习时具有较小的计算开销，并不会对计算减少的影响。TinyProp 在微controller 上进行 fine-tuning 神经网络时表现特别好，这是常见的嵌入式应用场景。对于 Typical datasets MNIST、DCASE2020 和 CIFAR10，我们比非稀疏训练更快，减少了平均1%的精度损失。与现有静态稀疐 backpropagation 算法相比，TinyProp 在平均上2.9倍快，并且在平均上减少了6%的精度损失。

FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

paper_url: http://arxiv.org/abs/2308.09160
repo_url: https://github.com/imguangyu/fedperfix
paper_authors: Guangyu Sun, Matias Mendieta, Jun Luo, Shandong Wu, Chen Chen
for: 提高分布式学习的效率和个性化性
methods: 研究如何在Vision Transformer（ViT）模型中部分个性化，并提出了一种基于插件的方法—FedPerfix
results: 在CIFAR-100、OrganAMNIST和Office-Home等 datasets上进行了实验，并证明了该方法可以与多种先进的分布式学习方法相比提高模型的性能

Abstract
Personalized Federated Learning (PFL) represents a promising solution for decentralized learning in heterogeneous data environments. Partial model personalization has been proposed to improve the efficiency of PFL by selectively updating local model parameters instead of aggregating all of them. However, previous work on partial model personalization has mainly focused on Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can be applied to other popular models such as Vision Transformers (ViTs). In this work, we investigate where and how to partially personalize a ViT model. Specifically, we empirically evaluate the sensitivity to data distribution of each type of layer. Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness in improving the model's performance compared to several advanced PFL methods.

摘要
个人化联合学习（PFL）表示在不同数据环境中分布式学习的有优解决方案。部分模型个性化已被提议以提高PFL的效率，但以前的工作主要集中在卷积神经网络（CNNs）上，对其他受欢迎的模型，如视觉变换器（ViTs），的应用仍然存在知识空白。在这种情况下，我们调查了在ViT模型中可以 selectively更新的部分。 Specifically, we empirically evaluated the sensitivity of each type of layer to data distribution. Based on the findings that the self-attention layer and the classification head are the most sensitive parts of a ViT, we proposed a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluated the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrated its effectiveness in improving the model's performance compared to several advanced PFL methods.

Semi-sparsity Priors for Image Structure Analysis and Extraction

paper_url: http://arxiv.org/abs/2308.09141
repo_url: None
paper_authors: Junqing Huang, Haihui Wang, Michael Ruzhansky
for: 图像结构分割和图像分类
methods: 通过总体半稀疑函数框架，将图像结构与复杂的质地背景解耦开来
results: 能够保持图像结构，不会出现波动融合现象，同时能够处理强抗振荡的图像文化分割问题，并且可以与现代方法相比肤

Abstract
Image structure-texture decomposition is a long-standing and fundamental problem in both image processing and computer vision fields. In this paper, we propose a generalized semi-sparse regularization framework for image structural analysis and extraction, which allows us to decouple the underlying image structures from complicated textural backgrounds. Combining with different textural analysis models, such a regularization receives favorable properties differing from many traditional methods. We demonstrate that it is not only capable of preserving image structures without introducing notorious staircase artifacts in polynomial-smoothing surfaces but is also applicable for decomposing image textures with strong oscillatory patterns. Moreover, we also introduce an efficient numerical solution based on an alternating direction method of multipliers (ADMM) algorithm, which gives rise to a simple and maneuverable way for image structure-texture decomposition. The versatility of the proposed method is finally verified by a series of experimental results with the capability of producing comparable or superior image decomposition results against cutting-edge methods.

摘要
Simplified Chinese:图像结构-文本分解是图像处理和计算机视觉领域的一个长期和基本问题。在这篇论文中，我们提出了一种通用半稀畴化常数执行框架，用于图像结构分析和提取。这种常数执行允许我们将图像结构从复杂的文本背景中分离出来。与不同的文本分析模型结合，这种常数执行具有不同于许多传统方法的优点。我们示示它不仅可以保留图像结构，而且可以避免引入普遍存在的梯形artefacts在多阶趋正面上。此外，我们还介绍了一种高效的数字解决方案，基于分配方法 OF 多个参数（ADMM）算法，这给出了一种简单可操作的图像结构-文本分解方法。我们的方法的多样性最终被实验证明，可以生成与当今顶尖方法相当或更好的图像分解结果。

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

paper_url: http://arxiv.org/abs/2308.09139
repo_url: https://github.com/giaczara/dallv
paper_authors: Giacomo Zara, Alessandro Conti, Subhankar Roy, Stéphane Lathuilière, Paolo Rota, Elisa Ricci
for: 这篇论文旨在解决无监督目标数据集上的动作识别模型转移，并且不需要存取实际的源数据。
methods: 这篇论文使用了大型语言视觉模型（LLVM）的“网络超级vision”来推广SFVUDA任务，并且提出了一个叫做“Domain Adaptation with Large Language-Vision models”的方法（简称DALL-V），将 LLVM 的世界假设和补充性源模型信息转换为适应target的学习网络。
results: despite the simplicity, DALL-V 可以实现显著的提升 compared to 目前的SFVUDA方法。

Abstract
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.

摘要

ICAR: Image-based Complementary Auto Reasoning

paper_url: http://arxiv.org/abs/2308.09119
repo_url: None
paper_authors: Xijun Wang, Anqi Liang, Junbang Liang, Ming Lin, Yu Lou, Shan Yang
for: addresses the challenging task of scene-aware complementary item retrieval (CIR), which requires generating a set of compatible items across domains.
methods: proposes a visual compatibility concept based on similarity and complementarity, and a category-aware Flexible Bidirectional Transformer (FBT) framework for visual “scene-based set compatibility reasoning” with cross-domain visual similarity input and auto-regressive complementary item generation.
results: achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively, compared with state-of-the-art methods.

Abstract
Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual "scene-based set compatibility reasoning" with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a "Flexible Bidirectional Transformer (FBT)" consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.

摘要
Scene-aware Complementary Item Retrieval (CIR) 是一个复杂的任务，需要生成兼容性的项目领域之间。由于主观性，设置严格的数据采集标准和学习目标很难。为解决这个挑战，我们提出了视觉兼容性概念，包括相似性（颜色、形状、文本等的相似性）和补充性（如桌子和椅子完成一组）。基于这个概念，我们提议一种兼容学习框架，一种类型意识的灵活双向Transformer（FBT），用于视觉“场景基于集合兼容理解”，输入为跨领域视觉相似性无关的嵌入，并使用自动生成的补充项。我们的FBT模型包括一个灵活的面罩，一个类型预测臂，和一个自动生成视觉嵌入预测臂。我们的输入是跨领域视觉相似性无关的嵌入，使得这个框架非常普适。此外，我们的FBT模型从大量场景图像中学习了各对象之间的相互兼容性，自然地实现了Scene-aware CIR。相比 соState-of-the-art方法，我们的方法可以达到5.3%和9.6%的FITB分数提升和22.3%和31.8%的SFID提升在时尚和家具等领域。

JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer

paper_url: http://arxiv.org/abs/2308.09110
repo_url: None
paper_authors: Mingyu Ouyang, Zhenzhong Chen
for: 提高JPEG压缩图像的Restoration效果，并能处理各种压缩质因子。
methods: 提出了DCT域的Transformer模型， dual-branch架构用于捕捉空间频率相关性，并采用量化矩阵嵌入和同步颜色卷积。
results: 与当前状态的JPEG噪声除法相比，提高了Restoration效果。

Abstract
JPEG compression adopts the quantization of Discrete Cosine Transform (DCT) coefficients for effective bit-rate reduction, whilst the quantization could lead to a significant loss of important image details. Recovering compressed JPEG images in the frequency domain has attracted more and more attention recently, in addition to numerous restoration approaches developed in the pixel domain. However, the current DCT domain methods typically suffer from limited effectiveness in handling a wide range of compression quality factors, or fall short in recovering sparse quantized coefficients and the components across different colorspace. To address these challenges, we propose a DCT domain spatial-frequential Transformer, named as DCTransformer. Specifically, a dual-branch architecture is designed to capture both spatial and frequential correlations within the collocated DCT coefficients. Moreover, we incorporate the operation of quantization matrix embedding, which effectively allows our single model to handle a wide range of quality factors, and a luminance-chrominance alignment head that produces a unified feature map to align different-sized luminance and chrominance components. Our proposed DCTransformer outperforms the current state-of-the-art JPEG artifact removal techniques, as demonstrated by our extensive experiments.

摘要
JPEG压缩使用Discrete Cosine Transform（DCT）系数归一化来实现有效的比特率减少，但这可能导致重要的图像细节丢失。在频域中恢复压缩JPEG图像已经引起了更多的关注，同时出现了许多像素领域的恢复方法。然而，当前的DCT频域方法通常具有处理各种压缩质量因子的有限效果，或者缺乏恢复粗化quantized均值和不同色彩空间中的组件。为解决这些挑战，我们提议了DCT频域的空间-频率Transformer，称为DCTransformer。具体来说，我们设计了双支架构，以捕捉DCT系数中的空间和频率相关性。此外，我们还包含了量化矩阵嵌入操作，可以有效地让我们的单个模型处理各种质量因子，以及颜色彩空间中的一致性头，可以生成一个统一的特征图来对不同大小的颜色和灰度组件进行对齐。我们的提议的DCTransformer比当前状态的JPEG噪声去除技术高效，如我们的广泛的实验所示。

Hyperbolic Face Anti-Spoofing

paper_url: http://arxiv.org/abs/2308.09107
repo_url: None
paper_authors: Shuangpeng Han, Rizhao Cai, Yawen Cui, Zitong Yu, Yongjian Hu, Alex Kot
for: 本研究旨在提高面Recognition系统的安全性，通过学习泛化化脸认证防 spoofing (FAS) 模型，对于训练数据中未seen的攻击进行检测。
methods: 本研究提出了一种基于 hyperbolic space 的 FAS 学习方法，包括将特征编码项 проекed 到 Poincaré 球中，然后使用 hyperbolic binary logistic regression 层进行分类。为了进一步提高泛化能力，我们还实现了 hyperbolic contrastive learning 方法，并对 bonafide 进行了 relaxation 操作。此外，我们还提出了一种新的特征clipping方法，以解决 hyperbolic 空间中的衰减Gradient问题。
results: 实验表明，提出的方法可以与 Euclidean 基eline 相比，在未seen 攻击检测中带来显著的改善。此外，我们还发现，该方法在四个 benchmark dataset （i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, 和 OULU-NPU）上也具有良好的泛化能力。

Abstract
Learning generalized face anti-spoofing (FAS) models against presentation attacks is essential for the security of face recognition systems. Previous FAS methods usually encourage models to extract discriminative features, of which the distances within the same class (bonafide or attack) are pushed close while those between bonafide and attack are pulled away. However, these methods are designed based on Euclidean distance, which lacks generalization ability for unseen attack detection due to poor hierarchy embedding ability. According to the evidence that different spoofing attacks are intrinsically hierarchical, we propose to learn richer hierarchical and discriminative spoofing cues in hyperbolic space. Specifically, for unimodal FAS learning, the feature embeddings are projected into the Poincar\'e ball, and then the hyperbolic binary logistic regression layer is cascaded for classification. To further improve generalization, we conduct hyperbolic contrastive learning for the bonafide only while relaxing the constraints on diverse spoofing attacks. To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models. Besides, we further design a multimodal FAS framework with Euclidean multimodal feature decomposition and hyperbolic multimodal feature fusion & classification. Extensive experiments on three benchmark datasets (i.e., WMCA, PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed method can bring significant improvement compared to the Euclidean baselines on unseen attack detection. In addition, the proposed framework is also generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.

摘要
学习通用面部防伪模型（FAS）对于攻击推送是face recognition系统安全的基础。过去的FAS方法通常会让模型提取特征，其中同类（bonafide或攻击）之间的距离压缩，而不同类之间的距离弹性地压缩。但这些方法基于欧几丁度距离，lack generalization ability for unseen attack detection due to poor hierarchy embedding ability。根据不同的骗术攻击是内在层次结构的证据，我们提议学习更加具有层次结构和特征的骗术攻击特征在希腊空间。specifically，for unimodal FAS learning, the feature embeddings are projected into the Poincaré ball, and then the hyperbolic binary logistic regression layer is cascaded for classification. To further improve generalization, we conduct hyperbolic contrastive learning for the bonafide only while relaxing the constraints on diverse spoofing attacks. To alleviate the vanishing gradient problem in hyperbolic space, a new feature clipping method is proposed to enhance the training stability of hyperbolic models. Besides, we further design a multimodal FAS framework with Euclidean multimodal feature decomposition and hyperbolic multimodal feature fusion & classification. Extensive experiments on three benchmark datasets (i.e., WMCA, PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed method can bring significant improvement compared to the Euclidean baselines on unseen attack detection. In addition, the proposed framework is also generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

paper_url: http://arxiv.org/abs/2308.09105
repo_url: None
paper_authors: Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui
for: 提高资源受限的视觉系统中的检测和分割精度，使用知识储存技术来提高轻量级视觉模型的性能。
methods: 提出了一种简单 yet 有效的顺序方法，通过将多个教师模型转化为一个学生模型来逐渐传递知识。
results: 成功地将Transformer基本的教师检测器知识传递到Convolution基本的学生检测器上，并在MS COCO数据集上提高了RetinaNet的 AP 分数从36.5% 提高到42.0%，以及Mask R-CNN的 AP 分数从38.2% 提高到42.5%。

Abstract
Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

摘要
In this paper, we propose a simple yet effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings.As far as we know, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

paper_url: http://arxiv.org/abs/2308.09098
repo_url: None
paper_authors: Tao Tu, Shun-Po Chuang, Yu-Lun Liu, Cheng Sun, Ke Zhang, Donna Roy, Cheng-Hao Kuo, Min Sun
for: 提高多视图图像基于3D对象检测的精度和效率
methods: 使用图像引导的几何空间表示法，从多视图图像中学习几何信息，并在推理阶段只需要多视图图像进行检测
results: 在三个indoor数据集上（ARKitScenes、ScanNetV2、ScanNet200）实现了与当前状态艺术多视图图像基于方法ImVoxelNet相同或更高的检测精度，并且在数据效率方面也表现出了优异性。

Abstract
We propose ImGeoNet, a multi-view image-based 3D object detection framework that models a 3D space by an image-induced geometry-aware voxel representation. Unlike previous methods which aggregate 2D features into 3D voxels without considering geometry, ImGeoNet learns to induce geometry from multi-view images to alleviate the confusion arising from voxels of free space, and during the inference phase, only images from multiple views are required. Besides, a powerful pre-trained 2D feature extractor can be leveraged by our representation, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results demonstrate that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. In addition, ImGeoNet shows great data efficiency by achieving results comparable to ImVoxelNet with 100 views while utilizing only 40 views. Furthermore, our studies indicate that our proposed image-induced geometry-aware representation can enable image-based methods to attain superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200.

摘要
我们提出ImGeoNet，一种基于多视图图像的3D对象探测框架，该框架使用图像引导的geometry-aware粒子表示来模型3D空间。与前方方法不同，ImGeoNet不仅将2D特征积累到3D粒子上无论 Considering geometry, but also learn to induce geometry from multi-view images to alleviate the confusion arising from free space voxels during the inference phase. In addition, our representation can be leveraged by a powerful pre-trained 2D feature extractor, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results show that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. Furthermore, ImGeoNet achieves results comparable to ImVoxelNet with 100 views using only 40 views, demonstrating great data efficiency. Our studies also indicate that our proposed image-induced geometry-aware representation can enable image-based methods to achieve superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200.

Edit Temporal-Consistent Videos with Image Diffusion Model

paper_url: http://arxiv.org/abs/2308.09091
repo_url: None
paper_authors: Yuanzhi Wang, Yong Li, Xin Liu, Anbo Dai, Antoni Chan, Zhen Cui
for: 本文旨在提出一种能够减少视频时间不一致问题的robust文本导向视频编辑方法，以提高视频编辑效果。
methods: 本文提出了一种名为Temporal-Consistent Video Editing（TCVE）方法，其中使用了预训练的2D Unet进行空间内容修改，同时设计了专门用于捕捉视频序列的时间特征的Temporal Unet架构。此外，通过建立空间和时间两个方面的协同关系，提高了视频编辑效果和时间一致性。
results: 实验结果表明，TCVE方法在视频时间一致性和视频编辑能力两个方面均达到了领域内最佳性能，超过了现有的标准准则。

Abstract
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method, to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive joint spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated video output while simultaneously preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

摘要
In addition to utilizing a pretrained 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, we formulate a cohesive joint spatial-temporal modeling unit. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, enhancing the temporal consistency of the generated video output while simultaneously preserving the capacity for video content manipulation.Experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

paper_url: http://arxiv.org/abs/2308.09089
repo_url: None
paper_authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto
for: 这个论文是为了找到高质量的声效（SFX），以匹配视频中的特定场景。
methods: 这个论文使用了大型自然语言模型和基础视觉语言模型来连接高质量的音频和视频，从而创建了一个可扩展的自动音频视频数据采集管道。它还使用了预训练的音频和视频编码器来训练一种对比学习基于的检索系统。
results: 这个论文的系统在对高质量音频和视频进行检索 task 上显示了显著的超越基eline的表现。它还能够在不同的数据集上保持一定的表现，并且在用户测试中，人们对系统提供的声效 preference 比基eline 67%。

Abstract
Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio-visual training data, previous work on audio-visual retrieval relies on YouTube (in-the-wild) videos of varied quality for training, where the audio is often noisy and the video of amateur quality. As such it is unclear whether these systems would generalize to the task of matching HQ audio to production-quality video. To address this, we propose a multimodal framework for recommending HQ SFX given a video frame by (1) leveraging large language models and foundational vision-language models to bridge HQ audio and video to create audio-visual pairs, resulting in a highly scalable automatic audio-visual data curation pipeline; and (2) using pre-trained audio and visual encoders to train a contrastive learning-based retrieval system. We show that our system, trained using our automatic data curation pipeline, significantly outperforms baselines trained on in-the-wild data on the task of HQ SFX retrieval for video. Furthermore, while the baselines fail to generalize to this task, our system generalizes well from clean to in-the-wild data, outperforming the baselines on a dataset of YouTube videos despite only being trained on the HQ audio-visual pairs. A user study confirms that people prefer SFX retrieved by our system over the baseline 67% of the time both for HQ and in-the-wild data. Finally, we present ablations to determine the impact of model and data pipeline design choices on downstream retrieval performance. Please visit our project website to listen to and view our SFX retrieval results.

摘要
找到合适的声效（SFX）以匹配录影中的时刻是一个困难和时间耗费的任务，并且取决于录影中的文本元数据质量。使用录影帧作为查询来撷取高品质（HQ）声效是一个吸引人的选择，减少了文本元数据的依赖和入门障碍。然而，由于缺乏HQ音频视觉训练数据，先前的对话音频视觉检索工作通常是使用YouTube（在野）影片的不同质量进行训练，其中的音频通常是噪音的，影片质量则是业余级。这使得是否这些系统能够通用到对HQ音频视觉匹配的任务仍然存在问题。为了解决这问题，我们提出了一个多 modal 框架，用于根据录影帧提供高品质声效的建议，包括：1. 利用大型语言模型和基础的视觉语言模型来跨度HQ音频和影片，实现高可扩展自动 audio-visual 数据库 Curate 管道。2. 使用预训的音频和视觉嵌入器来训练对比学习扩展系统。我们发现，我们的系统，通过我们自动生成的数据库管道，与基eline 相比，在HQ声效撷取任务上表现出色，并且在实际应用中具有很好的一致性。此外，我们的系统能够从清洁到在野数据中具有良好的一致性，并且在YouTube 影片上进行测试时表现出色。一个使用者研究确认，使用我们的系统比基eline 67% 的时间 prefer SFX。最后，我们提供了ablation 来测试模型和数据管道设计的影响下沿 Retrieval 性能。您可以访问我们的项目网站，listen 和观看我们的 SFX 撷取结果。

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

paper_url: http://arxiv.org/abs/2308.09084
repo_url: None
paper_authors: Dongyang Yu, Haoyue Zhang, Zhirui Zhou, Wangpeng An, Yanhong Yang
for: 这个研究旨在提供高精度和实时性的人体姿势估测算法，特别适用于CPU型手持式移动设备。
methods: MovePose使用优化的轻量级卷积神经网络，包括三种技术：deconvolution、大kernel卷积和坐标分类方法，以提高精度和速度。
results: MovePose在COCO验证数据集上获得了67.7的Mean Average Precision（mAP）分数，并在Intel i9-10920x CPU和NVIDIA RTX3090 GPU上显示出了高效性，其中在Android手机上的帧率超过11帧/秒。

Abstract
We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. The network produces 17 keypoints for each individual at a rate exceeding 11 frames per second, making it suitable for real-time applications such as fitness tracking, sign language interpretation, and advanced mobile human posture estimation. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 67.7 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.

摘要
我们现在提出了 MovePose，一种优化的轻量级卷积神经网络，专门为 CPU 基于移动设备上的实时人体姿态估计设计。现有的解决方案无法提供满意的准确率和速度，MovePose 弥补了这一空隙。它目标保持实时性，同时改进移动设备上人体姿态估计的准确率。该网络每秒可生成 17 个锚点，每秒 exceeding 11 帧，适用于实时应用程序，如健身跟踪、手语理解和高级移动人体姿态估计。我们的 MovePose 算法在 COCO 验证集（\cite{cocodata}) 上获得了 67.7 的 Mean Average Precision（mAP）分数。MovePose 算法在 Intel i9-10920x CPU 上运行时显示了效率，每秒可以达到 69+ 帧/秒。此外，在 NVIDIA RTX3090 GPU 上，其性能提高了 452+ 帧/秒。在装有 Snapdragon 8 + 4G 处理器的 Android 手机上，帧率可达上限。为了提高准确率，我们采用了三种技术：减 convolution、大小 kernel convolution 和坐标分类方法。相比基本的 upsampling，减 convolution 可以学习、提高模型容量和扩展感受野。大小 kernel convolution 强化这些属性，但减少计算成本。简单来说，MovePose 提供了高准确率和实时性，使其成为许多应用程序的可能工具，包括移动端人体姿态估计。我们将代码和模型公开访问。

Pedestrian Environment Model for Automated Driving

paper_url: http://arxiv.org/abs/2308.09080
repo_url: None
paper_authors: Adrian Holzbock, Alexander Tsaregorodtsev, Vasileios Belagiannis
for: 这个论文的目的是提供一种能够在自动驾驶车辆与游客之间安全交互的环境模型。
methods: 该论文使用了一种基于单目相机图像和车辆定位数据的人体pose估计器，以及一种简单的跟踪算法和egosynchronous抵消。
results: 该论文在CARLA simulate器和nuScenes数据集上测试了其人体环境模型，并达到了约16%的相对位置误差。

Abstract
Besides interacting correctly with other vehicles, automated vehicles should also be able to react in a safe manner to vulnerable road users like pedestrians or cyclists. For a safe interaction between pedestrians and automated vehicles, the vehicle must be able to interpret the pedestrian's behavior. Common environment models do not contain information like body poses used to understand the pedestrian's intent. In this work, we propose an environment model that includes the position of the pedestrians as well as their pose information. We only use images from a monocular camera and the vehicle's localization data as input to our pedestrian environment model. We extract the skeletal information with a neural network human pose estimator from the image. Furthermore, we track the skeletons with a simple tracking algorithm based on the Hungarian algorithm and an ego-motion compensation. To obtain the 3D information of the position, we aggregate the data from consecutive frames in conjunction with the vehicle position. We demonstrate our pedestrian environment model on data generated with the CARLA simulator and the nuScenes dataset. Overall, we reach a relative position error of around 16% on both datasets.

摘要
besides correctly interacting with other vehicles, automated vehicles should also be able to react safely to vulnerable road users like pedestrians or cyclists. for a safe interaction between pedestrians and automated vehicles, the vehicle must be able to interpret the pedestrian's behavior. common environment models do not contain information like body poses used to understand the pedestrian's intent. in this work, we propose an environment model that includes the position of the pedestrians as well as their pose information. we only use images from a monocular camera and the vehicle's localization data as input to our pedestrian environment model. we extract the skeletal information with a neural network human pose estimator from the image. furthermore, we track the skeletons with a simple tracking algorithm based on the hungarian algorithm and an ego-motion compensation. to obtain the 3D information of the position, we aggregate the data from consecutive frames in conjunction with the vehicle position. we demonstrate our pedestrian environment model on data generated with the carla simulator and the nuscenes dataset. overall, we reach a relative position error of around 16% on both datasets.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

2023-08-18

Revisiting Skin Tone Fairness in Dermatological Lesion Classification

GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement

Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark

Far3D: Expanding the Horizon for Surround-view 3D Object Detection

Language-guided Human Motion Synthesis with Atomic Actions

On the Effectiveness of LayerNorm Tuning for Continual Learning in Vision Transformers

Language-Guided Diffusion Model for Visual Grounding

Investigation of Architectures and Receptive Fields for Appearance-based Gaze Estimation

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

O^2-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

Deep Equilibrium Object Detection

Decoupled conditional contrastive learning with variable metadata for prostate lesion detection

Uncertainty-based quality assurance of carotid artery wall segmentation in black-blood MRI

Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning

Improving 3D Pose Estimation for Sign Language

Denoising Diffusion for 3D Hand Pose Estimation from Images

Leveraging Intrinsic Properties for Non-Rigid Garment Alignment

Learnt Contrastive Concept Embeddings for Sign Recognition

ResQ: Residual Quantization for Video Perception

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Quantitative Susceptibility Mapping through Model-based Deep Image Prior (MoDIP)

Data augmentation and explainability for bias discovery and mitigation in deep learning

Accelerated Bayesian imaging by relaxed proximal-point Langevin sampling

Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images

Can ultrasound confidence maps predict sonographers’ labeling variability?

Self-Supervised Single-Image Deconvolution with Siamese Neural Networks

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Metadata Improves Segmentation Through Multitasking Elicitation

Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization

Diffusion Models for Image Restoration and Enhancement – A Comprehensive Survey

DReg-NeRF: Deep Registration for Neural Radiance Fields

Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events

Image Processing and Machine Learning for Hyperspectral Unmixing: An Overview and the HySUPP Python Package

Single Frame Semantic Segmentation Using Multi-Modal Spherical Images

Overlap Bias Matching is Necessary for Point Cloud Registration

Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Multi-scale Target-Aware Framework for Constrained Image Splicing Detection and Localization

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

Denoising diffusion-based MR to CT image translation enables whole spine vertebral segmentation in 2D and 3D without manual annotations

LSCD: A Large-Scale Screen Content Dataset for Video Compression

SAMedOCT: Adapting Segment Anything Model (SAM) for Retinal OCT

Unlimited Knowledge Distillation for Action Recognition in the Dark

Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Human Part-wise 3D Motion Context Learning for Sign Language Recognition

Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning

Inferior Alveolar Nerve Segmentation in CBCT images using Connectivity-Based Selective Re-training

NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning

Self-Calibrated Cross Attention Network for Few-Shot Segmentation

RFDforFin: Robust Deep Forgery Detection for GAN-generated Fingerprint Images

Diverse Cotraining Makes Strong Semi-Supervised Segmentor

DiffLLE: Diffusion-guided Domain Calibration for Unsupervised Low-light Image Enhancement

MATLABER: Material-Aware Text-to-3D via LAtent BRDF auto-EncodeR

Progression-Guided Temporal Action Detection in Videos

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation

Deep Boosting Multi-Modal Ensemble Face Recognition with Sample-Level Weighting

CCFace: Classification Consistency for Low-Resolution Face Recognition

Generalized Sum Pooling for Metric Learning

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis

TinyProp – Adaptive Sparse Backpropagation for Efficient TinyML On-device Learning

FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning

Semi-sparsity Priors for Image Structure Analysis and Extraction

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

ICAR: Image-based Complementary Auto Reasoning

JPEG Quantized Coefficient Recovery via DCT Domain Spatial-Frequential Transformer

Hyperbolic Face Anti-Spoofing

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

Edit Temporal-Consistent Videos with Image Diffusion Model

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices

Pedestrian Environment Model for Automated Driving