2023-08-22

cs.CV

cs.CV - 2023-08-22

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

paper_url: http://arxiv.org/abs/2308.11509
repo_url: https://github.com/lxq1000/swinface
paper_authors: Lixiong Qin, Mei Wang, Chao Deng, Ke Wang, Xi Chen, Jiani Hu, Weihong Deng
for: 这个论文的目的是提出一种多功能的Face recognition和表情识别、年龄估计和面部特征估计（40个特征包括性别）的算法基于单个Swin Transformer。
methods: 该算法使用了一个共享背景和每个相关任务的子网络，并在每个任务特定分析子网络中实现了一种多级渠道注意力（MLCA）模块，以适应不同任务的冲突和需求。
results: 对于所有任务，提出的模型具有出色的表现，尤其是在RAF-DB和CLAP2015上达到了90.97%的准确率和0.22 $\epsilon$-的误差分别，这些结果在表情识别和年龄估计领域是状态的最佳结果。

Abstract
In recent years, vision transformers have been introduced into face recognition and analysis and have achieved performance breakthroughs. However, most previous methods generally train a single model or an ensemble of models to perform the desired task, which ignores the synergy among different tasks and fails to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper presents a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, the SwinFace, consists of a single shared backbone together with a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.

摘要
Recently, vision transformers have been applied to face recognition and analysis, achieving performance breakthroughs. However, most previous methods train a single model or an ensemble of models to perform the desired task, ignoring the synergy among different tasks and failing to achieve improved prediction accuracy, increased data efficiency, and reduced training time. This paper proposes a multi-purpose algorithm for simultaneous face recognition, facial expression recognition, age estimation, and face attribute estimation (40 attributes including gender) based on a single Swin Transformer. Our design, called SwinFace, consists of a single shared backbone and a subnet for each set of related tasks. To address the conflicts among multiple tasks and meet the different demands of tasks, a Multi-Level Channel Attention (MLCA) module is integrated into each task-specific analysis subnet, which can adaptively select the features from optimal levels and channels to perform the desired tasks. Extensive experiments show that the proposed model has a better understanding of the face and achieves excellent performance for all tasks. Especially, it achieves 90.97% accuracy on RAF-DB and 0.22 $\epsilon$-error on CLAP2015, which are state-of-the-art results on facial expression recognition and age estimation respectively. The code and models will be made publicly available at https://github.com/lxq1000/SwinFace.

LCCo: Lending CLIP to Co-Segmentation

paper_url: http://arxiv.org/abs/2308.11506
repo_url: None
paper_authors: Xin Duan, Yan Yang, Liyuan Pan, Xiabi Liu
for: 本文研究了一种基于语言图像预训练框架(CLIP)的图像集合中的共同Semantic object划分方法。
methods: 该方法使用了一种基于CLIP的三个关键模块：一个图像集合特征匹配模块，一个CLIP交互模块，和一个CLIP规范模块。这些模块共同使用CLIP来提高图像划分精度。
results: 实验结果表明，该方法在四个标准图像划分 benchmark 数据集上的性能比前state-of-the-art方法高。

Abstract
This paper studies co-segmenting the common semantic object in a set of images. Existing works either rely on carefully engineered networks to mine the implicit semantic information in visual features or require extra data (i.e., classification labels) for training. In this paper, we leverage the contrastive language-image pre-training framework (CLIP) for the task. With a backbone segmentation network that independently processes each image from the set, we introduce semantics from CLIP into the backbone features, refining them in a coarse-to-fine manner with three key modules: i) an image set feature correspondence module, encoding global consistent semantic information of the image set; ii) a CLIP interaction module, using CLIP-mined common semantics of the image set to refine the backbone feature; iii) a CLIP regularization module, drawing CLIP towards this co-segmentation task, identifying the best CLIP semantic and using it to regularize the backbone feature. Experiments on four standard co-segmentation benchmark datasets show that the performance of our method outperforms state-of-the-art methods.

摘要

Image set feature correspondence module: Encodes global consistent semantic information of the image set.2. CLIP interaction module: Uses CLIP-mined common semantics of the image set to refine the backbone feature.3. CLIP regularization module: Draws CLIP towards this co-segmentation task, identifying the best CLIP semantic and using it to regularize the backbone feature.Experiments on four standard co-segmentation benchmark datasets show that our method outperforms state-of-the-art methods.

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

paper_url: http://arxiv.org/abs/2308.11489
repo_url: https://github.com/wqtwjt1996/sum-l
paper_authors: Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng
For: The paper aims to tackle the challenging problem of unpaired multiview video learning, where the model needs to learn comprehensive multiview representations while dealing with variations in cross-view semantic information.* Methods: The proposed method, called Semantics-based Unpaired Multiview Learning (SUM-L), builds cross-view pseudo-pairs and performs view-invariant alignment by leveraging the semantic information of videos. Additionally, video-text alignment is performed for first-person and third-person videos to improve video representations.* Results: The method is evaluated on multiple benchmark datasets and outperforms multiple existing view-alignment methods, demonstrating its effectiveness in improving video representations under a more challenging scenario than typical paired or unpaired multimodal or multiview learning.Here are the three key points in Simplified Chinese text:* For: 本文目的是解决无对视照视频学习的挑战问题，即学习多视图表示而处理视频cross-view semantic信息的变化。* Methods: 提出的方法是基于Semantics-based Unpaired Multiview Learning（SUM-L），建立cross-view Pseudo-pairs并实现视图不变的Alignment，通过利用视频semantic信息。此外，还进行了首人和第三人视频的文本对齐，以提高视频表示。* Results: 方法在多个benchmark dataset上进行了广泛的实验，并证明其在不同于 Typical paired或Unpaired multimodal或Multiview learning的场景下表现更高效。

Abstract
We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. Our code is available at https://github.com/wqtwjt1996/SUM-L.

摘要
我们面临了一种复杂的无对视视频学习场景。在这种情况下，模型需要学习全面的无对视视频表示，而cross-view含义信息具有变化性。我们提议使用Semantics-based Unpaired Multiview Learning（SUM-L）解决这个无对视视频学习问题。关键思想是建立cross-view Pseudo-对和视图不变Alignment，通过利用视频含义信息来做这两个任务。为了提高多视图学习的数据效率，我们进一步进行了视频文本对齐，以全面利用视频含义知识来改善视频表示。我们的方法在多个benchmark数据集上进行了广泛的实验，并证明了我们的框架的效果。我们的方法还比多种现有的视图对齐方法高效，在更加复杂的场景下。我们的代码可以在https://github.com/wqtwjt1996/SUM-L中找到。

Opening the Vocabulary of Egocentric Actions

paper_url: http://arxiv.org/abs/2308.11488
repo_url: None
paper_authors: Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
for: 本文旨在提出一种开放词汇动作识别任务，能够扩展 Verb 到一个开放词汇的动作中，同时处理已知和未知的对象。
methods: 本文提出了一种嵌入 CLIP 表示的提示来预测开放词汇中的交互对象，并使用一个对象agnostic verb encoder 来预测verb。
results: 对于 EPIC-KITCHENS-100 和 Assembly101 datasets，本文创建了一些开放词汇的benchmark，而关闭动作方法无法泛化，而我们的提议方法却效果很好，同时，我们的对象Encoder 也比既有的开放词汇视觉识别方法更高效。

Abstract
Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.

摘要
人类行为在 egocentric 视频中经常是手部-物体互动，由 verb（由手部执行的动作）和物体组成。尽管这些数据集已经大量扩大，但是它们仍然面临两个限制：动作组合稀缺和固定的交互物体集。本文提出了一个开放词汇动作认知任务。给定一组 verb 和在训练中观察到的物体，目标是将 verb 扩展到一个开放词汇的动作中，包括已知和新的交互物体。为此，我们分离 verb 和物体预测，使用 объекc-agnostic verb 编码器和提示基于 CLIP 表示来预测开放词汇的交互物体。我们创建了开放词汇 benchmark 在 EPIC-KITCHENS-100 和 Assembly101 数据集上，而封闭动作方法无法泛化，我们提posed 方法是有效的。此外，我们的物体编码器在认识新交互物体方面表现出色，大大超过了现有的开放词汇视觉认知方法。

Free Lunch for Gait Recognition: A Novel Relation Descriptor

paper_url: http://arxiv.org/abs/2308.11487
repo_url: None
paper_authors: Jilong Wang, Saihui Hou, Yan Huang, Chunshui Cao, Xu Liu, Yongzhen Huang, Liang Wang
For: This paper focuses on improving gait recognition performance by reconsidering gait representation and emphasizing inter-personal relationships among different subjects’ gait features.* Methods: The proposed method, called Relationship Descriptor (RD), uses reference-anchored gaits to describe each person’s gait and emphasizes meaningful features by normalizing the dot product between gait features and classifier weights. To address the dimensionality challenges, the method proposes a Farthest Anchored gaits Selection algorithm and a dimension reduction method.* Results: The proposed method achieves higher recognition performance than directly using extracted features and consistently outperforms the baselines on four popular gait recognition datasets (GREW, Gait3D, CASIA-B, and OU-MVLP), achieving state-of-the-art performances.Here’s the simplified Chinese text:* For: 本研究目的是提高步行识别性能，重新评估步行表示方式，强调不同人群之间的步行特征关系。* Methods: 提出的方法是关系描述符（RD），使用参照锚定的步行特征来描述每个人的步行，强调意义更高的特征，通过归一化点积分来表示每个测试样本与每个训练ID的步行原型之间的相似性。* Results: 比较直接使用提取的特征，关系描述符可以提高步行识别性能，在四个流行的步行识别数据集（GREW、Gait3D、CASIA-B、OU-MVLP）上表现出state-of-the-art的性能。

Abstract
Gait recognition is to seek correct matches for query individuals by their unique walking patterns at a long distance. However, current methods focus solely on individual gait features, disregarding inter-personal relationships. In this paper, we reconsider gait representation, asserting that gait is not just an aggregation of individual features, but also the relationships among different subjects' gait features once reference gaits are established. From this perspective, we redefine classifier weights as reference-anchored gaits, allowing each person's gait to be described by their relationship with these references. In our work, we call this novel descriptor Relationship Descriptor (RD). This Relationship Descriptor offers two benefits: emphasizing meaningful features and enhancing robustness. To be specific, The normalized dot product between gait features and classifier weights signifies a similarity relation, where each dimension indicates the similarity between the test sample and each training ID's gait prototype, respectively. Despite its potential, the direct use of relationship descriptors poses dimensionality challenges since the dimension of RD depends on the training set's identity count. To address this, we propose a Farthest Anchored gaits Selection algorithm and a dimension reduction method to boost gait recognition performance. Our method can be built on top of off-the-shelf pre-trained classification-based models without extra parameters. We show that RD achieves higher recognition performance than directly using extracted features. We evaluate the effectiveness of our method on the popular GREW, Gait3D, CASIA-B, and OU-MVLP, showing that our method consistently outperforms the baselines and achieves state-of-the-art performances.

摘要
<> translation_direction: zh-CNGait recognition是寻找正确的匹配个体的唯一步态特征，但现有方法强调个体特征，忽略了人之间的关系。在这篇论文中，我们重新定义步态表示，认为步态不仅是个体特征的汇集，还包括不同个体之间的关系。从这个视角，我们定义了一种新的描述符called Relationship Descriptor (RD)。这个描述符有两个优点：强调意义性特征和增强稳定性。具体来说，在测试样本和训练ID之间的标准化点积分比率表示两个样本之间的相似性关系，每个维度表示测试样本与每个训练ID的步态原型之间的相似性。虽有潜在的优势，直接使用关系描述符存在维度挑战，因为关系描述符的维度取决于训练集中个体数量。为解决这个问题，我们提出了最远锚定 gaits 选择算法和维度减少方法，以提高步态识别性能。我们的方法可以在现有的预训练分类模型基础上建立，无需添加参数。我们证明，RD 可以高于直接使用提取的特征来实现更高的识别性能。我们对广泛使用的 GREW、Gait3D、CASIA-B 和 OU-MVLP 进行评估，并证明我们的方法可以一直达到领先水平。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

paper_url: http://arxiv.org/abs/2308.11485
repo_url: https://github.com/ABaldrati/CLIP4Cir
paper_authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto del Bimbo
for: 本研究的目的是寻找基于referenced图像和相关的caption的compose image retrieval，即检索图像具有与referenced图像相同的视觉特征并满足caption中的修改。
methods: 本研究使用OpenAI CLIP模型的特征进行任务调整和组合，并使用对比学习进行训练。首先，通过任务特有的方式进行CLIP encoder的任务调整，然后在第二个阶段使用Combiner网络将图像和文本特征相结合，提供了组合特征进行检索。
results: 实验结果表明，基于CLIP特征的任务调整和Combiner网络对于FashionIQ和CIRR两个Popular和挑战性的compose image retrieval dataset具有高效性，并且超过了更复杂的现有方法。

Abstract
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir

摘要
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, so we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.Here's the translation in Traditional Chinese: Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, so we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

Pose2Gait: Extracting Gait Features from Monocular Video of Individuals with Dementia

paper_url: http://arxiv.org/abs/2308.11484
repo_url: https://github.com/taatiteam/pose2gait_public
paper_authors: Caroline Malin-Mayor, Vida Adeli, Andrea Sabo, Sergey Noritsyn, Carolina Gorodetsky, Alfonso Fasano, Andrea Iaboni, Babak Taati
for: 这项研究旨在通过视频监测 older adults with dementia 的步态，早期发现健康状况下降，以防止跌倒或入院。
methods: 该研究使用了计算机视觉基于pose tracking模型来自动处理视频数据，并提取了人体 JOINT 位置。
results: 该模型能够从视频中提取步态特征，并与深度摄像头中的特征相比，Spearman 相关系数为。83和.60。这表明，三维空间时间特征可以从单一视频中预测。

Abstract
Video-based ambient monitoring of gait for older adults with dementia has the potential to detect negative changes in health and allow clinicians and caregivers to intervene early to prevent falls or hospitalizations. Computer vision-based pose tracking models can process video data automatically and extract joint locations; however, publicly available models are not optimized for gait analysis on older adults or clinical populations. In this work we train a deep neural network to map from a two dimensional pose sequence, extracted from a video of an individual walking down a hallway toward a wall-mounted camera, to a set of three-dimensional spatiotemporal gait features averaged over the walking sequence. The data of individuals with dementia used in this work was captured at two sites using a wall-mounted system to collect the video and depth information used to train and evaluate our model. Our Pose2Gait model is able to extract velocity and step length values from the video that are correlated with the features from the depth camera, with Spearman's correlation coefficients of .83 and .60 respectively, showing that three dimensional spatiotemporal features can be predicted from monocular video. Future work remains to improve the accuracy of other features, such as step time and step width, and test the utility of the predicted values for detecting meaningful changes in gait during longitudinal ambient monitoring.

摘要
视频基于环境监测 older adults 的步态有潜力检测身体健康状况下降，并允许临床专业人员和照顾者在早期发现并预防滥落或入院。通过计算机视觉技术，可自动处理视频数据，并提取关节位置。然而，目前公共可用的模型并没有适应老年人或临床人群的步态分析。在这项工作中，我们训练了深度神经网络，将二维姿态序列，从视频中捕捉到人走向墙面镜头的人走动捕捉到三维空间时间步态特征的映射。我们的 Pose2Gait 模型可以从视频中提取速度和步长值，与深度相机中的特征相关性 coefficient 为 0.83 和 0.60，表明可以从单视频中预测三维空间时间步态特征。未来的工作是提高其他特征的准确性，如步时和步宽，并测试预测值的检测意义步长监测。

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11681
repo_url: None
paper_authors: Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang
for: 这个研究的目的是提出一个新的弱监督类别 видео异常探测（WSVAD）方法，并将CLIP模型 directly applied to WSVAD任务。
methods: 这个方法使用了CLIP模型的冻结版本，不需要进行预训练。它还包括一个双支路径，其中一支使用了视觉特征进行粗糙分类，另一支则全面利用语言-图像Alignment。
results: 实验结果显示，VadCLIP在两个常用的标准集上（XD-Violence和UCF-Crime）都达到了最高性能，比前一代方法高度超越。具体来说，VadCLIP在XD-Violence上 achieve 84.51% AP和88.02% AUC，在UCF-Crime上 achieve 84.51% AP和88.02% AUC。

Abstract
The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features will be released to facilitate future VAD research.

摘要
Recent contrastive language-image pre-training (CLIP) 模型已经在各种图像任务中显示出惊人的成功，揭示出了强大的视觉表示和丰富的 semantics。一个值得关注的问题是如何有效地适应这种强大模型到视频领域，并设计一个强健的视频异常检测器。在这种工作中，我们提出了VadCLIP，一种新的弱相关视频异常检测（WSVAD）方法，利用直接使用冻结CLIP模型，而不需要任何预训练和调整过程。与当前工作不同，VadCLIP不直接将提取的特征 fed into 弱相关分类器进行帧级二分类，而是利用CLIP模型的细腻语义关系，在两个分支中进行检测。一个分支使用视觉特征进行粗略二分类，另一个分支完全利用语义-图像对齐来进行细腻异常检测。通过两个分支的合作，VadCLIP可以将CLIP模型的预训练知识传递到WSVAD任务中，从而实现粗略和细腻的视频异常检测。我们进行了广泛的实验，证明VadCLIP在XD-Violence和UCF-Crime上的表现均高于当前最佳方法，具体来说是84.51% AP和88.02% AUC。代码和特征将会被发布，以便未来的VAD研究。

Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms

paper_url: http://arxiv.org/abs/2308.11468
repo_url: None
paper_authors: Mariapia Rita Iandolo, Francesca Razzano, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 这个研究旨在使用Google Earth Engine（GEE）平台进行多时间分析，检测城市区域的变化 using 光学数据和专门的机器学习（ML）算法。
methods: 这个研究使用了GEE平台和光学数据，以及专门的ML算法进行分类和变化检测分析。
results: 结果表明，提posed方法可以准确地标识 changed和unchanged的城市区域在选定的时间段内。此外，这个研究也证明了GEE的云存储平台对处理大量卫星数据的管理有所重要性。

Abstract
The aim of this work is to perform a multitemporal analysis using the Google Earth Engine (GEE) platform for the detection of changes in urban areas using optical data and specific machine learning (ML) algorithms. As a case study, Cairo City has been identified, in Egypt country, as one of the five most populous megacities of the last decade in the world. Classification and change detection analysis of the region of interest (ROI) have been carried out from July 2013 to July 2021. Results demonstrate the validity of the proposed method in identifying changed and unchanged urban areas over the selected period. Furthermore, this work aims to evidence the growing significance of GEE as an efficient cloud-based solution for managing large quantities of satellite data.

摘要
本工作的目的是使用Google Earth Engine（GEE）平台进行多时间分析，以探测城市区域的变化使用光学数据和专门的机器学习（ML）算法。作为案例研究，埃及国的开罗城市被选为世界上最后一个十年内最为人口稠密的五大都市之一。从2013年7月至2021年7月的时间段进行了区域 интерес（ROI）的分类和变化检测分析。结果表明提出的方法的有效性，可以准确地标识变化和不变的城市区域在选定的时间段内。此外，这项工作还旨在证明GEE作为云端解决方案，对处理巨量卫星数据的管理有着高效的能力。

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.11677
repo_url: None
paper_authors: Grégoire Petit, Michael Soumm, Eva Feillet, Adrian Popescu, Bertrand Delezoide, David Picard, Céline Hudelot
for: 这个论文的目的是探讨分类模型在数据流中建立的增量学习问题。
methods: 论文使用的方法包括增量学习过程中的新类 интеграción、采用先前学习的模型初始化、选择合适的增量学习算法和评估增量学习模型的性能等。
results: 论文的主要发现是初始学习策略对增量准确率的影响是最大的，但是选择合适的增量学习算法更重要地防止忘记。根据这些发现，论文提出了实践增量学习的实用建议。

Abstract
Class-Incremental Learning (CIL) aims to build classification models from data streams. At each step of the CIL process, new classes must be integrated into the model. Due to catastrophic forgetting, CIL is particularly challenging when examples from past classes cannot be stored, the case on which we focus here. To date, most approaches are based exclusively on the target dataset of the CIL process. However, the use of models pre-trained in a self-supervised way on large amounts of data has recently gained momentum. The initial model of the CIL process may only use the first batch of the target dataset, or also use pre-trained weights obtained on an auxiliary dataset. The choice between these two initial learning strategies can significantly influence the performance of the incremental learning model, but has not yet been studied in depth. Performance is also influenced by the choice of the CIL algorithm, the neural architecture, the nature of the target task, the distribution of classes in the stream and the number of examples available for learning. We conduct a comprehensive experimental study to assess the roles of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing the average incremental accuracy, but that the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations are intended to facilitate the practical deployment of incremental learning.

摘要
We conduct a comprehensive study to assess the influence of these factors. We present a statistical analysis framework that quantifies the relative contribution of each factor to incremental performance. Our main finding is that the initial training strategy is the dominant factor influencing average incremental accuracy, but the choice of CIL algorithm is more important in preventing forgetting. Based on this analysis, we propose practical recommendations for choosing the right initial training strategy for a given incremental learning use case. These recommendations aim to facilitate the practical deployment of incremental learning.

Food Image Classification and Segmentation with Attention-based Multiple Instance Learning

paper_url: http://arxiv.org/abs/2308.11452
repo_url: None
paper_authors: Valasia Vlachopoulou, Ioannis Sarafis, Alexandros Papadopoulos
for:* 这个论文是为了解决食物量计量问题而写的，它适用于食物监测应用场景。methods:* 这篇论文使用了弱监视学习方法，不需要像素级别的标注数据来训练食物图像分类和Semantic Segmentation模型。* 该方法基于多例学习approach，并使用了注意力机制来自动生成食物类划分图像。results:* 在FoodSeg103数据集上进行了实验，并证明了提议的方法的可行性和注意力机制的作用。

Abstract
The demand for accurate food quantification has increased in the recent years, driven by the needs of applications in dietary monitoring. At the same time, computer vision approaches have exhibited great potential in automating tasks within the food domain. Traditionally, the development of machine learning models for these problems relies on training data sets with pixel-level class annotations. However, this approach introduces challenges arising from data collection and ground truth generation that quickly become costly and error-prone since they must be performed in multiple settings and for thousands of classes. To overcome these challenges, the paper presents a weakly supervised methodology for training food image classification and semantic segmentation models without relying on pixel-level annotations. The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism. At test time, the models are used for classification and, concurrently, the attention mechanism generates semantic heat maps which are used for food class segmentation. In the paper, we conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach and we explore the functioning properties of the attention mechanism.

摘要
随着食物质量评估的需求增加，计算机视觉方法在食物领域中表现出了很大的潜力。然而，传统的机器学习模型开发方法仍然需要带有像素级别的分类注释。然而，这种方法会导致数据收集和真实性生成的挑战，这些挑战会在多个设置下并且对 тысячи个类型进行多次重复。为了缓解这些挑战，文章提出了一种弱型监督的方法，不需要像素级别的注释来训练食物图像分类和 semantic segmentation 模型。该方法基于多例学习approach和注意力机制。在测试时，模型用于分类，同时注意力机制生成 semantic heat map，用于食物类划分。在文章中，我们对 FoodSeg103 数据集中的两个元类进行了实验，以验证提议的可行性，并探索注意力机制的工作性质。

Towards Discriminative Representations with Contrastive Instances for Real-Time UAV Tracking

paper_url: http://arxiv.org/abs/2308.11450
repo_url: None
paper_authors: Dan Zeng, Mingliang Zou, Xucheng Wang, Shuiwang Li
for: 提高UAV跟踪中的精准率和效率，这两个基本挑战是由于计算资源限制、电池容量和UAV最大荷载所带来的。
methods: 使用异构相关矩阵（DCF）基于的跟踪器可以在单个CPU上实现高效性，但是精准率较差。具有轻量级深度学习（DL）基于的跟踪器可以实现精准率和效率的平衡，但是性能增加受压缩率的限制。
results: 根据四个UAV标准测试集（UAV123@10fps、DTB70、UAVDT和VisDrone2018）的广泛实验结果，提出的DRCI跟踪器在对state-of-the-art UAV跟踪方法进行比较时表现出了显著的优势。

Abstract
Maintaining high efficiency and high precision are two fundamental challenges in UAV tracking due to the constraints of computing resources, battery capacity, and UAV maximum load. Discriminative correlation filters (DCF)-based trackers can yield high efficiency on a single CPU but with inferior precision. Lightweight Deep learning (DL)-based trackers can achieve a good balance between efficiency and precision but performance gains are limited by the compression rate. High compression rate often leads to poor discriminative representations. To this end, this paper aims to enhance the discriminative power of feature representations from a new feature-learning perspective. Specifically, we attempt to learn more disciminative representations with contrastive instances for UAV tracking in a simple yet effective manner, which not only requires no manual annotations but also allows for developing and deploying a lightweight model. We are the first to explore contrastive learning for UAV tracking. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that the proposed DRCI tracker significantly outperforms state-of-the-art UAV tracking methods.

摘要
维护高效率和高精度是无人机追踪中的两个基本挑战，因为计算资源、电池容量和无人机的最大负载对紧。对应式相关滤波器（DCF）基本trackers可以在单一CPU上提供高效率，但是精度较差。轻量级深度学习（DL）基本trackers可以实现高效率和精度的平衡，但是性能增加受压缩率的限制。高压缩率通常会导致糟糕的描述表现。因此，本文的目标是将无人机追踪中的特征表现强化，以提高追踪精度。具体来说，我们尝试通过新的特征学习角度，从而学习更有弹性的特征表现，这不仅不需要手动标注，而且允许开发和部署轻量级模型。我们是无人机追踪中首次应用对照学习。实验结果显示，提案的DRCI tracker在四个无人机测试 benchmark 上具有明显的 superiority，包括UAV123@10fps、DTB70、UAVDT和VisDrone2018。

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

paper_url: http://arxiv.org/abs/2308.11448
repo_url: None
paper_authors: Jiantao Wu, Shentong Mo, Muhammad Awais, Sara Atito, Zhenhua Feng, Josef Kittler
for: 本研究旨在评估自主学习技术在计算机视觉任务中的效iveness，以避免训练和finetuning大型模型。
methods: 本研究提出了一种评估协议，基于提示 patch，用于评估零shot segmentation的能力。此外，还提出了一种简单的SSL方法，称为MMC，该方法组合了masked image modelling、 momentum based self-distillation和global contrast等技术，以提高SSL ViTs的描述性表示。
results: 实验表明，MMC方法在零shot semantic segmentation中达到了顶尖水平，并且在不同的 dataset 上都表现出色。

Abstract
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

摘要
自顾学前教程（SSP）在机器学习中得到广泛应用，帮助提取有意义的特征表示无需标注数据。在计算机视觉领域，预训练视transformer（ViT）已经发挥了重要的作用，促进了转移学习。然而，模型的规模快速增长带来了训练成本的涨幅问题。本研究旨在评估无标注学习（SSL）技术在计算机视觉任务中的效果，不需要训练，以模仿人类对未看到对象的概念化和识别能力。为此，我们提出了一种零shot segmentation的评估协议，基于提示 patch。给定目标对象的一点作为提示，算法计算selected patch和其他 patch之间的相似度图，然后应用简单的阈值处理来 segment the target。此外，我们还进行了内部和外部相似性评估，以评估 SSL ViTs 的泛化能力。通过零shot segmentation和泛化能力的研究，我们设计了一种简单的 SSP 方法，称为 MMC。该方法结合了做masked image模型，自身融合和全局对比，以提高 SSL ViTs 的泛化表示。实验表明，MMC 可以在不同的 dataset 上达到顶尖的 zero-shot semantic segmentation结果。

Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging

paper_url: http://arxiv.org/abs/2308.11443
repo_url: None
paper_authors: Xiaojun Jia, Yuefeng Chen, Xiaofeng Mao, Ranjie Duan, Jindong Gu, Rong Zhang, Hui Xue, Xiaochun Cao
For: The paper aims to improve the robustness of machine learning models against adversarial attacks while reducing the training cost of standard adversarial training.* Methods: The paper proposes an effective Lipschitz regularization method for fast adversarial training and explores the effect of data augmentation and weight averaging in fast adversarial training.* Results: The proposed method, FGSM-LAW, demonstrates superior robustness performance compared to state-of-the-art fast adversarial training methods and advanced standard adversarial training methods, as shown in experimental evaluations on four benchmark databases.

Abstract
Fast Adversarial Training (FAT) not only improves the model robustness but also reduces the training cost of standard adversarial training. However, fast adversarial training often suffers from Catastrophic Overfitting (CO), which results in poor robustness performance. Catastrophic Overfitting describes the phenomenon of a sudden and significant decrease in robust accuracy during the training of fast adversarial training. Many effective techniques have been developed to prevent Catastrophic Overfitting and improve the model robustness from different perspectives. However, these techniques adopt inconsistent training settings and require different training costs, i.e, training time and memory costs, leading to unfair comparisons. In this paper, we conduct a comprehensive study of over 10 fast adversarial training methods in terms of adversarial robustness and training costs. We revisit the effectiveness and efficiency of fast adversarial training techniques in preventing Catastrophic Overfitting from the perspective of model local nonlinearity and propose an effective Lipschitz regularization method for fast adversarial training. Furthermore, we explore the effect of data augmentation and weight averaging in fast adversarial training and propose a simple yet effective auto weight averaging method to improve robustness further. By assembling these techniques, we propose a FGSM-based fast adversarial training method equipped with Lipschitz regularization and Auto Weight averaging, abbreviated as FGSM-LAW. Experimental evaluations on four benchmark databases demonstrate the superiority of the proposed method over state-of-the-art fast adversarial training methods and the advanced standard adversarial training methods.

摘要
快速对抗训练（FAT）不仅提高模型的Robustness，还可以降低标准对抗训练的训练成本。然而，快速对抗训练经常会遭遇Catastrophic Overfitting（CO），这会导致对抗性性能很差。Catastrophic Overfitting是指在快速对抗训练中突然和 significatively减少对抗性性能的现象。许多有效的技术已经开发以预防Catastrophic Overfitting并提高模型的Robustness，但这些技术采用不一致的训练设置和不同的训练成本，如训练时间和内存成本，导致不公平的比较。在这篇论文中，我们进行了快速对抗训练方法超过10种的全面研究，包括对抗性和训练成本。我们从模型本地非线性的角度重新评估快速对抗训练技术的效iveness和效率，并提出了一种有效的Lipschitz regularization方法。此外，我们还研究了快速对抗训练中数据扩展和权重平均的效果，并提出了一种简单又有效的自动权重平均方法，以进一步提高对抗性。通过组合这些技术，我们提出了一种基于FGSM的快速对抗训练方法，即FGSM-LAW。实验评估在四个基本数据库中，显示我们的方法在与标准对抗训练方法和先进的标准对抗训练方法相比，具有显著的优势。

SDeMorph: Towards Better Facial De-morphing from Single Morph

paper_url: http://arxiv.org/abs/2308.11442
repo_url: None
paper_authors: Nitish Shukla
for: 防止摸索攻击 (Morph Attack Detection)
methods: 无参考基于Diffusion Probabilistic Models (DDPM)和branched-UNet
results: 可以准确地回归真实的人脸特征，提高了人脸识别系统的安全性

Abstract
Face Recognition Systems (FRS) are vulnerable to morph attacks. A face morph is created by combining multiple identities with the intention to fool FRS and making it match the morph with multiple identities. Current Morph Attack Detection (MAD) can detect the morph but are unable to recover the identities used to create the morph with satisfactory outcomes. Existing work in de-morphing is mostly reference-based, i.e. they require the availability of one identity to recover the other. Sudipta et al. \cite{ref9} proposed a reference-free de-morphing technique but the visual realism of outputs produced were feeble. In this work, we propose SDeMorph (Stably Diffused De-morpher), a novel de-morphing method that is reference-free and recovers the identities of bona fides. Our method produces feature-rich outputs that are of significantly high quality in terms of definition and facial fidelity. Our method utilizes Denoising Diffusion Probabilistic Models (DDPM) by destroying the input morphed signal and then reconstructing it back using a branched-UNet. Experiments on ASML, FRLL-FaceMorph, FRLL-MorDIFF, and SMDD datasets support the effectiveness of the proposed method.

摘要
人脸识别系统（FRS）容易受到形态攻击。一个形态是通过将多个标识 combine 以达到欺骗 FRS 并让它匹配形态中的多个标识。现有的形态攻击检测（MAD）可以检测形态，但无法恢复创建形态的标识。现有的工作大多数是参考基于的，即需要一个标识的可用性来恢复另一个标识。Sudipta et al. \cite{ref9} 提出了一种无参考的恢复技术，但Visual realism 输出的质量不够高。在这个工作中，我们提出了SDeMorph（稳定扩散恢复器），一种新的无参考恢复方法，可以恢复创建形态的标识。我们的方法生成了高质量的输出，具有高度的定义和人脸准确性。我们的方法利用了Denosing Diffusion Probabilistic Models (DDPM)，通过破坏输入形态信号，然后使用分支-UNet 重建它。实验表明，我们的方法在 ASML、FRLL-FaceMorph、FRLL-MorDIFF 和 SMDD 数据集上具有效果。

Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection

paper_url: http://arxiv.org/abs/2308.11441
repo_url: https://github.com/junshengzhou/levelsetudf
paper_authors: Junsheng Zhou, Baorui Ma, Shujuan Li, Yu-Shen Liu, Zhizhong Han
for: 该paper aimed to address the problem of reconstructing surfaces with open surfaces using unsigned distance functions (UDFs).
methods: The authors proposed to learn UDFs using neural networks and reconstruct surfaces with the gradients around the zero level set of the UDF. However, they found that the differential networks struggled to learn the zero level set, leading to large errors on unsigned distances and gradients. To resolve this, they proposed to learn a more continuous zero level set using level set projections.
results: The authors conducted comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and demonstrated non-trivial improvements over the state-of-the-art methods. They also explored the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF.

Abstract
Latest methods represent shapes with open surfaces using unsigned distance functions (UDFs). They train neural networks to learn UDFs and reconstruct surfaces with the gradients around the zero level set of the UDF. However, the differential networks struggle from learning the zero level set where the UDF is not differentiable, which leads to large errors on unsigned distances and gradients around the zero level set, resulting in highly fragmented and discontinuous surfaces. To resolve this problem, we propose to learn a more continuous zero level set in UDFs with level set projections. Our insight is to guide the learning of zero level set using the rest non-zero level sets via a projection procedure. Our idea is inspired from the observations that the non-zero level sets are much smoother and more continuous than the zero level set. We pull the non-zero level sets onto the zero level set with gradient constraints which align gradients over different level sets and correct unsigned distance errors on the zero level set, leading to a smoother and more continuous unsigned distance field. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF, which demonstrate our non-trivial improvements over the state-of-the-art methods. Code is available at https://github.com/junshengzhou/LevelSetUDF .

摘要
最新的方法使用无符号距离函数（UDF）来表示形状。它们使用神经网络学习UDF并重建表面的梯度在UDF的零水平面周围。然而，� diferencial networks 受到学习零水平面的约束，在零水平面不 diferenciable 的情况下，会导致大量的unsigned distance 和梯度 around the zero level set 的错误，从而导致表面变得高度分裂和不连续。为解决这问题，我们提议通过约束水平面投影来学习更加连续的零水平面。我们的想法来自于观察到非零水平面比零水平面更加平滑和连续。我们使用梯度约束将非零水平面投影到零水平面，以实现梯度的对齐和unsigned distance 错误的修正，从而导致更加平滑和连续的unsigned distance field。我们进行了广泛的实验，包括点云重建、真实扫描或深度图重建，以及不supervised point cloud upsampling 和不supervised point normal estimation 等，其中所获得的改进均非常 significativ。代码可以在上找到。

PoseGraphNet++: Enriching 3D Human Pose with Orientation Estimation

paper_url: http://arxiv.org/abs/2308.11440
repo_url: None
paper_authors: Soubarna Banik, Edvard Avagyan, Alejandro Mendoza Gracia, Alois Knoll
for: 本研究旨在提出一种基于图гра夫 convolutional neural network（Graph Convolutional Network，GCN）的2D-to-3D提升方法，以便预测人体3D姿态包括关节位置和骨orientation。
methods: 我们提出了一种名为PoseGraphNet++的新型2D-to-3D提升网络，该网络通过节点和边卷积来利用关节和骨特征。
results: 我们在多个标准测试集上评估了我们的模型，并与状态的艺术结果相比，我们的模型在位置和旋转度量上具有类似或更高的性能。此外，我们还通过广泛的减少研究，证明了PoseGraphNet++可以借鉴关节和骨之间的相互关系，从而提高预测性能。

Abstract
Existing kinematic skeleton-based 3D human pose estimation methods only predict joint positions. Although this is sufficient to compute the yaw and pitch of the bone rotations, the roll around the axis of the bones remains unresolved by these methods. In this paper, we propose a novel 2D-to-3D lifting Graph Convolution Network named PoseGraphNet++ to predict the complete human pose including the joint positions and the bone orientations. We employ node and edge convolutions to utilize the joint and bone features. Our model is evaluated on multiple benchmark datasets, and its performance is either on par with or better than the state-of-the-art in terms of both position and rotation metrics. Through extensive ablation studies, we show that PoseGraphNet++ benefits from exploiting the mutual relationship between the joints and the bones.

摘要
现有的骨骼基本体系3D人姿估算方法只预测关节位置。尽管这足够计算关节的封顶和旋转，但是关节的滚动仍然无法被这些方法解决。在这篇论文中，我们提出了一种新的2D-to-3D提升图 convolution neural network（PoseGraphNet++），用于预测完整的人姿，包括关节位置和骨头orientation。我们利用节点和边卷积来利用关节和骨头特征。我们的模型在多个基准数据集上进行评估，其性能与或超过了现有的状态的艺术metric。通过广泛的减少研究，我们表明了PoseGraphNet++可以通过利用关节和骨头之间的相互关系来增强性能。

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

paper_url: http://arxiv.org/abs/2308.11417
repo_url: None
paper_authors: Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, Angela Dai
for: 提供了一个大规模的indoor场景数据集，其中每个场景都是使用高级激光扫描仪获得高分辨率的geometry和颜色信息，同时还包括了 региSTR的3300万像素图像和iPhone的RGB-D流。
methods: 使用高级激光扫描仪和DSLR相机捕捉场景图像，并使用iPhone捕捉RGB-D流。场景重建还包括了开放词汇的semantics标注，以便实现全面的semantic理解。
results: ScanNet++提供了一个新的real-worldbenchmark дляnovel view synthesis，包括高品质RGB捕捉和商业级图像的synthesis，以及一个新的3Dsemantic scene理解 benchmark，全面地涵盖了多种和抽象的semantic标注方案。ScanNet++ currently contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames.

Abstract
We present ScanNet++, a large-scale dataset that couples together capture of high-quality and commodity-level geometry and color of indoor scenes. Each scene is captured with a high-end laser scanner at sub-millimeter resolution, along with registered 33-megapixel images from a DSLR camera, and RGB-D streams from an iPhone. Scene reconstructions are further annotated with an open vocabulary of semantics, with label-ambiguous scenarios explicitly annotated for comprehensive semantic understanding. ScanNet++ enables a new real-world benchmark for novel view synthesis, both from high-quality RGB capture, and importantly also from commodity-level images, in addition to a new benchmark for 3D semantic scene understanding that comprehensively encapsulates diverse and ambiguous semantic labeling scenarios. Currently, ScanNet++ contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames.

摘要
我们现在提供ScanNet++ dataset，这是一个大规模的数据集，它结合了高质量的激光扫描仪和商用级别的颜色捕捉indoor场景。每个场景都被高级激光扫描仪 capture，并与注册的3300万像素DSLR镜头拍摄的图像，以及iPhone的RGB-D流相匹配。场景重建还被标注为开放词汇Semantics，并且明确标注了涉及多义 Label的情况，以便全面理解semantic。ScanNet++提供了一个新的真实世界标准 benchmark for novel view synthesis，不仅来自高质量RGB捕捉，还来自商用级别的图像，以及一个新的3Dsemantic场景理解标准，全面涵盖多样化和涉及多义标签的情况。目前，ScanNet++包含460个场景，280,000个拍摄的DSLR图像，以及超过370万个iPhone RGBD帧。

MatFuse: Controllable Material Generation with Diffusion Models

paper_url: http://arxiv.org/abs/2308.11408
repo_url: None
paper_authors: Giuseppe Vecchio, Renato Sortino, Simone Palazzo, Concetto Spampinato
for: This paper aims to simplify the creation of SVBRDF maps in computer graphics, using a novel unified approach based on diffusion models.
methods: The proposed method integrates multiple sources of conditioning, such as color palettes, sketches, and pictures, to enable fine-grained control and flexibility in material synthesis.
results: The proposed method yields performance comparable to state-of-the-art approaches in estimating SVBRDF, both qualitatively and quantitatively, under various conditioning settings.

Abstract
Creating high quality and realistic materials in computer graphics is a challenging and time-consuming task, which requires great expertise. In this paper, we present MatFuse, a novel unified approach that harnesses the generative power of diffusion models (DM) to simplify the creation of SVBRDF maps. Our DM-based pipeline integrates multiple sources of conditioning, such as color palettes, sketches, and pictures, enabling fine-grained control and flexibility in material synthesis. This design allows for the combination of diverse information sources (e.g., sketch + image embedding), enhancing creative possibilities in line with the principle of compositionality. We demonstrate the generative capabilities of the proposed method under various conditioning settings; on the SVBRDF estimation task, we show that our method yields performance comparable to state-of-the-art approaches, both qualitatively and quantitatively.

摘要

Non-Redundant Combination of Hand-Crafted and Deep Learning Radiomics: Application to the Early Detection of Pancreatic Cancer

paper_url: http://arxiv.org/abs/2308.11389
repo_url: None
paper_authors: Rebeca Vétil, Clément Abi-Nader, Alexandre Bône, Marie-Pierre Vullierme, Marc-Michel Rohé, Pietro Gori, Isabelle Bloch
for: 这篇论文旨在解决深度学习医学影像特征（DLR）和手工设计医学影像特征（HCR）之间的重复性问题。
methods: 作者使用了一种简单的Variational Autoencoder（VAE）来提取DLR特征，并且通过降低这两种特征之间的相互信息来确保它们之间的独立性。
results: 作者的方法可以与手工设计的特征结合，并且通过一个分类器来预测抑制肝癌的早期标志。实验结果显示，结合非重复的DLR和HCR特征可以提高预测性能，比基eline方法更好。

Abstract
We address the problem of learning Deep Learning Radiomics (DLR) that are not redundant with Hand-Crafted Radiomics (HCR). To do so, we extract DLR features using a VAE while enforcing their independence with HCR features by minimizing their mutual information. The resulting DLR features can be combined with hand-crafted ones and leveraged by a classifier to predict early markers of cancer. We illustrate our method on four early markers of pancreatic cancer and validate it on a large independent test set. Our results highlight the value of combining non-redundant DLR and HCR features, as evidenced by an improvement in the Area Under the Curve compared to baseline methods that do not address redundancy or solely rely on HCR features.

摘要
我们解决了深度学习医学影像特征（DLR）不重复的问题。我们使用VAE将DLR特征提取出来，并在这些特征之间强制独立性，以避免与手工设计的医学影像特征（HCR）之间的相互信息。这些DLR特征可以与手工设计的特征结合，并由分类器使用来预测早期癌症 markers。我们在四种早期肝癌标志物中进行了实验，并在一个大型独立测试集上验证了我们的方法。我们的结果显示了结合非重复的DLR和HCR特征的价值，即比基eline方法不 addressed 或仅仅靠赖于HCR特征的情况下，预测性能有所提高。

Targeted Data Augmentation for bias mitigation

paper_url: http://arxiv.org/abs/2308.11386
repo_url: None
paper_authors: Agnieszka Mikołajczyk-Bareła, Maria Ferlin, Michał Grochowski
for: This paper aims to address the issue of bias in AI systems by introducing a novel approach called Targeted Data Augmentation (TDA).
methods: The TDA method leverages classical data augmentation techniques to insert biases into the training data, which helps to mitigate biases in the models.
results: The paper shows that the TDA method can significantly decrease bias measures while maintaining a negligible increase in the error rate, using two diverse datasets of clinical skin lesions and male and female faces.Here’s the Chinese translation of the three points:
for: 这篇论文目的是解决人工智能系统中的偏见问题，通过引入一种新的方法 called Targeted Data Augmentation (TDA)。
methods: TDA方法利用经典的数据增强技术，插入偏见到训练数据中，以减少模型中的偏见。
results: 论文显示，TDA方法可以Significantly减少偏见度量，同时保持误差率的增长在较低水平，使用了两个多样化的数据集：皮肤病变和男女面孔数据集。

Abstract
The development of fair and ethical AI systems requires careful consideration of bias mitigation, an area often overlooked or ignored. In this study, we introduce a novel and efficient approach for addressing biases called Targeted Data Augmentation (TDA), which leverages classical data augmentation techniques to tackle the pressing issue of bias in data and models. Unlike the laborious task of removing biases, our method proposes to insert biases instead, resulting in improved performance. To identify biases, we annotated two diverse datasets: a dataset of clinical skin lesions and a dataset of male and female faces. These bias annotations are published for the first time in this study, providing a valuable resource for future research. Through Counterfactual Bias Insertion, we discovered that biases associated with the frame, ruler, and glasses had a significant impact on models. By randomly introducing biases during training, we mitigated these biases and achieved a substantial decrease in bias measures, ranging from two-fold to more than 50-fold, while maintaining a negligible increase in the error rate.

摘要
发展公正和伦理AI系统需要仔细考虑偏见缓解，这个领域经常被排除或忽略。在这个研究中，我们提出了一种新的和高效的偏见缓解方法，即目标数据扩展（TDA），该方法利用了经典数据扩展技术来解决数据和模型中的偏见问题。与努力除去偏见不同，我们的方法提议在训练过程中插入偏见，从而提高性能。为了标识偏见，我们对两个多样化的数据集进行了标注：一个是皮肤病变 dataset，另一个是男女脸部 dataset。这些偏见标注在本研究中首次公布，为未来研究提供了一个有价值的资源。通过对比方案插入，我们发现了框架、尺度和镜片等偏见对模型产生了重要影响。通过随机在训练过程中引入偏见，我们减少了这些偏见的度量，从2倍到更多于50倍，同时保持了错误率的增长在较低水平。

DALNet: A Rail Detection Network Based on Dynamic Anchor Line

paper_url: http://arxiv.org/abs/2308.11381
repo_url: https://github.com/yzichen/mmlanedet
paper_authors: Zichen Yu, Quanli Liu, Wei Wang, Liyong Zhang, Xiaoguang Zhao
for: 提高智能列车的轨道检测精度
methods: 基于动态锚点线的轨道检测网络DALNet，包括动态锚点生成器和轨道检测模块
results: DALNet在我们提供的DL-Rail轨道检测数据集和知名的Tusimple和LLAMAS车道检测标准 benchmark中达到了状态之精度表现。

Abstract
Rail detection is one of the key factors for intelligent train. In the paper, motivated by the anchor line-based lane detection methods, we propose a rail detection network called DALNet based on dynamic anchor line. Aiming to solve the problem that the predefined anchor line is image agnostic, we design a novel dynamic anchor line mechanism. It utilizes a dynamic anchor line generator to dynamically generate an appropriate anchor line for each rail instance based on the position and shape of the rails in the input image. These dynamically generated anchor lines can be considered as better position references to accurately localize the rails than the predefined anchor lines. In addition, we present a challenging urban rail detection dataset DL-Rail with high-quality annotations and scenario diversity. DL-Rail contains 7000 pairs of images and annotations along with scene tags, and it is expected to encourage the development of rail detection. We extensively compare DALNet with many competitive lane methods. The results show that our DALNet achieves state-of-the-art performance on our DL-Rail rail detection dataset and the popular Tusimple and LLAMAS lane detection benchmarks. The code will be released at https://github.com/Yzichen/mmLaneDet.

摘要
铁路检测是智能列车的关键因素之一。在论文中，我们被动 anchor line-based 铁路检测方法所 inspirited，并提出了基于动态 anchor line 的铁路检测网络 DALNet。为了解决预定义的 anchor line 是图像不特定的问题，我们设计了一种新的动态 anchor line 机制。它利用动态 anchor line 生成器生成每个铁路实例的相应 anchor line，根据输入图像中铁路的位置和形状。这些动态生成的 anchor line 可以视为更好的位置参考，以准确地 Localize 铁路。此外，我们提供了一个挑战性的城市铁路检测数据集 DL-Rail，其包括7000个图像和注释对，以及Scene 标签。我们进行了广泛的比较，结果显示，我们的 DALNet 在我们的 DL-Rail 铁路检测数据集和知名的 Tusimple 和 LLAMAS lane detection benchmark 上均达到了状态组件表现。代码将在 GitHub 上发布，详细信息请参考。

Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

paper_url: http://arxiv.org/abs/2308.11376
repo_url: None
paper_authors: Weixi Yi, Vasilis Stavrinides, Zachary M. C. Baum, Qianye Yang, Dean C. Barratt, Matthew J. Clarkson, Yipeng Hu, Shaheer U. Saeed
for: 这个研究旨在提出一种弱度指导的类别分割方法，使用仅有单元标签进行训练，并且将类别分割看作是范围检测问题，而不是像前一些研究所使用的像素级别分类。methods: 这个方法使用了强化学习来训练一个控制函数，以寻找范围内的类别 bounding，使用一个从预训练的边界存在检测器获得的赏兹。results: 在评估这个方法的临床实际任务中，关于胆囊组织分类，我们获得了与其他已知的弱度指导方法相比，更好的表现，使用相同的标签，例如多个例问题学习。

Abstract
We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.

摘要
我们提出了Boundary-RL，一种新的弱类标注方法，只使用补丁级别标签进行训练。我们认为 segmentation 是一个边检测问题，而不是像前一些工作一样将每个像素分类为不同的类别。这种对 segmentation 的看法可能allow for 边界定义在具有噪声artifacts的区域内的场景中，其中传统的像素级别分类基于的弱类标注方法可能无法有效地分类区域。特别是，ultrasound 图像，其中 интенсивности值表示物体边界上的声学阻抗差异，也可能受惠于边界定义方法。我们的方法使用 reinforcement learning 训练一个控制函数，用于localize 区域内的边界。该控制函数通过一个Markov 决策过程来修改补丁位置，并且使用一个预训练的边界存在分类器来指示在补丁中是否遇到了物体边界。该分类器只使用补丁级别标签进行训练，这些标签也是训练整个边界定义框架的唯一标签。我们使用控制函数来避免使用滑块窗口覆盖整个图像，并且避免可能的 false-positive 或 false-negative 情况。我们对肾脏成像进行评估，并与其他测试的弱类标注方法进行比较。我们显示出我们的方法在评估中表现出色，使用相同的标签。

Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization

paper_url: http://arxiv.org/abs/2308.11369
repo_url: None
paper_authors: Ning Gao, Bernard Hohmann, Gerhard Neumann
For: The paper is focused on improving object-centric representations using slots for efficient, flexible, and interpretable abstraction from low-level perceptual features in a compositional scene.* Methods: The paper proposes using clustering algorithms conditioned on perceptual input features to initialize the slot representations, and designs permutation invariant and permutation equivariant versions of this layer to enable exchangeable slot representations after clustering.* Results: The paper shows that its method outperforms prior works consistently, especially for complex scenes, through experiments on object discovery and novel view synthesis tasks with various datasets.

Abstract
Object-centric representations using slots have shown the advances towards efficient, flexible and interpretable abstraction from low-level perceptual features in a compositional scene. Current approaches randomize the initial state of slots followed by an iterative refinement. As we show in this paper, the random slot initialization significantly affects the accuracy of the final slot prediction. Moreover, current approaches require a predetermined number of slots from prior knowledge of the data, which limits the applicability in the real world. In our work, we initialize the slot representations with clustering algorithms conditioned on the perceptual input features. This requires an additional layer in the architecture to initialize the slots given the identified clusters. We design permutation invariant and permutation equivariant versions of this layer to enable the exchangeable slot representations after clustering. Additionally, we employ mean-shift clustering to automatically identify the number of slots for a given scene. We evaluate our method on object discovery and novel view synthesis tasks with various datasets. The results show that our method outperforms prior works consistently, especially for complex scenes.

摘要
使用槽来表示对象已经取得了高效、灵活和可解释的抽象优势。现有方法在初始化槽时随机，然后进行迭代优化。我们在这篇论文中表明，随机槽初始化会对最终槽预测的准确性产生显著影响。此外，现有方法需要先知道数据中的槽数量，这限制了实际应用的可行性。在我们的工作中，我们使用基于感知输入特征的 clustering 算法来初始化槽表示。这需要一个额外的架构层来初始化槽给确定的群集。我们设计了卷积不变和卷积对称版本的这层，以便在 clustering 后换取可交换的槽表示。此外，我们使用 Mean-Shift clustering 自动确定Scene中的槽数量。我们在对象发现和新视图合成任务中使用了多个数据集进行评估，结果表明，我们的方法在复杂场景下一直高于先前的方法。

Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images

paper_url: http://arxiv.org/abs/2308.11365
repo_url: None
paper_authors: Alperen Kalay, Bahri Batuhan Bilecen, Mustafa Ayazoglu
for: 这个研究旨在解决 SR 网络中训练后量化 (PTQ) 阶段的一个重要问题，即代表性数据集 (RD)。
methods: 我们提出了一个新的 clip-free 量化管道 (CFQP)，并提供了广泛的实验证明，以 cleverly 使用 FP32 模型的输出来增强 RD 图像。这种方法可以消除不必要的 clipped 活动层，从而提高整体稳定性、减少推理时间（最多下降到 54%）、提高视觉质量 Results compared to INT8 clipped models，并在一些 SR 模型上甚至超越了不量化 FP32 模型。
results: 我们的方法可以在某些 SR 模型上提高视觉质量，同时减少推理时间，并且不需要重新训练 clipped activation。在一些情况下，我们的方法可以超越不量化 FP32 模型， both in runtime and visual quality。

Abstract
Super-resolution (SR) networks have been investigated for a while, with their mobile and lightweight versions gaining noticeable popularity recently. Quantization, the procedure of decreasing the precision of network parameters (mostly FP32 to INT8), is also utilized in SR networks for establishing mobile compatibility. This study focuses on a very important but mostly overlooked post-training quantization (PTQ) step: representative dataset (RD), which adjusts the quantization range for PTQ. We propose a novel pipeline (clip-free quantization pipeline, CFQP) backed up with extensive experimental justifications to cleverly augment RD images by only using outputs of the FP32 model. Using the proposed pipeline for RD, we can successfully eliminate unwanted clipped activation layers, which nearly all mobile SR methods utilize to make the model more robust to PTQ in return for a large overhead in runtime. Removing clipped activations with our method significantly benefits overall increased stability, decreased inference runtime up to 54% on some SR models, better visual quality results compared to INT8 clipped models - and outperforms even some FP32 non-quantized models, both in runtime and visual quality, without the need for retraining with clipped activation.

摘要
超解像（SR）网络已经被研究了一段时间，其移动和轻量级版本在最近得到了关注。量化，减小网络参数的精度（主要是FP32到INT8），也在SR网络中使用，以实现移动兼容性。这项研究关注一个很重要但又多少被注意的后training量化（PTQ）步骤：代表数据集（RD），它调整PTQ的范围。我们提出一个新的批处理管道（clip-free quantization pipeline，CFQP），并提供了详细的实验证明，以智能地增强RD图像，只使用FP32模型的输出。使用我们的方法进行RD，可以成功地消除无用的clip activation层，这些层通常在移动SR方法中使用，以使模型更具 robustness to PTQ，但是带来了大量的运行时间开销。从我们的方法中消除clip activation可以获得更好的稳定性、下降到54%的推理时间、更好的视觉质量结果，并且超过了INT8clip模型，以及FP32非量化模型，无需重新训练clip activation。

Exemplar-Free Continual Transformer with Convolutions

paper_url: http://arxiv.org/abs/2308.11357
repo_url: None
paper_authors: Anurag Roy, Vinay Kumar Verma, Sravan Voonna, Kripabandhu Ghosh, Saptarshi Ghosh, Abir Das
for: 这 paper 的目的是提出一种新的无例子（exemplar-free）的类/任务逐步学习方法，不需要在测试时显式提供任务标识符（task identifier），并且不需要保留之前训练集。
methods: 该方法使用 transformer 架构，并通过重新权重 multi-head self-attention 层中的键、查询和值权重来实现类/任务逐步学习。具体来说，通过 convolution 来重新权重这些权重，以便在每个任务上保持低的参数数量。此外，使用图像增强技术来预测任务，而无需在测试时显式提供任务标识符。
results: 实验结果表明，该方法可以在四个 benchmark 数据集上超越许多竞争方法，而且需要更少的参数。

Abstract
Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transformers, there is a need to explore their potential for CL. Although there have been some recent CL approaches for vision transformers, they either store training instances of previous tasks or require a task identifier during test time, which can be limiting. This paper proposes a new exemplar-free approach for class/task incremental learning called ConTraCon, which does not require task-id to be explicitly present during inference and avoids the need for storing previous training instances. The proposed approach leverages the transformer architecture and involves re-weighting the key, query, and value weights of the multi-head self-attention layers of a transformer trained on a similar task. The re-weighting is done using convolution, which enables the approach to maintain low parameter requirements per task. Additionally, an image augmentation-based entropic task identification approach is used to predict tasks without requiring task-ids during inference. Experiments on four benchmark datasets demonstrate that the proposed approach outperforms several competitive approaches while requiring fewer parameters.

摘要

Integration of Sentinel-1 and Sentinel-2 data for Earth surface classification using Machine Learning algorithms implemented on Google Earth Engine

paper_url: http://arxiv.org/abs/2308.11340
repo_url: None
paper_authors: Francesca Razzano, Mariapia Rita Iandolo, Chiara Zarro, G. S. Yogesh, Silvia Liberata Ullo
for: 本研究使用Synthetic Aperture Radar (SAR)和光学数据进行地面分类。
methods: 通过在Google Earth Engine (GEE)平台上实施监督式机器学习（ML）算法，将Sentinel-1 (S-1)和Sentinel-2 (S-2)数据集成起来，用于地面覆盖分类。
results: 研究结果表明，在这种情况下，雷达和光学远程探测提供了补偿信息，有利地面覆盖分类，通常导致映射精度的提高。此外，本研究也证明了GEE在处理大量卫星数据方面的emerging角色。

Abstract
In this study, Synthetic Aperture Radar (SAR) and optical data are both considered for Earth surface classification. Specifically, the integration of Sentinel-1 (S-1) and Sentinel-2 (S-2) data is carried out through supervised Machine Learning (ML) algorithms implemented on the Google Earth Engine (GEE) platform for the classification of a particular region of interest. Achieved results demonstrate how in this case radar and optical remote detection provide complementary information, benefiting surface cover classification and generally leading to increased mapping accuracy. In addition, this paper works in the direction of proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.

摘要
在这项研究中，人造 aperature radar（SAR）和光学数据都被考虑用于地球表面分类。具体来说，Sentinel-1（S-1）和Sentinel-2（S-2）数据的集成通过在Google Earth Engine（GEE）平台上实施监督式机器学习（ML）算法进行地区特定的分类。实现的结果表明在这种情况下，雷达和光学远程探测提供了补做的信息，从而改善表面覆盖分类和通常提高地图准确性。此外，这篇论文还在irection towards proving the emerging role of GEE as an effective cloud-based tool for handling large amounts of satellite data.

Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection

paper_url: http://arxiv.org/abs/2308.11327
repo_url: https://github.com/bingqingzhang/odd-vod
paper_authors: Bingqing Zhang, Sen Wang, Yifan Liu, Brano Kusy, Xue Li, Jiajun Liu
for: 提高视频对象检测（VOD）系统的实用性
methods: 提出一种图像级对象检测难度（ODD）指标，用于衡量检测图像中对象的难度，并在VOD过程中应用该指标来减少过度聚合。
results: 对8种VOD模型进行了广泛的实验，结果表明，当选择全局参照帧时，ODD-VOD可以不断提高全局基于VOD模型的准确率。当用于加速时，ODD-VOD可以不断提高帧数（FPS），并且不会降低准确率。

Abstract
Current video object detection (VOD) models often encounter issues with over-aggregation due to redundant aggregation strategies, which perform feature aggregation on every frame. This results in suboptimal performance and increased computational complexity. In this work, we propose an image-level Object Detection Difficulty (ODD) metric to quantify the difficulty of detecting objects in a given image. The derived ODD scores can be used in the VOD process to mitigate over-aggregation. Specifically, we train an ODD predictor as an auxiliary head of a still-image object detector to compute the ODD score for each image based on the discrepancies between detection results and ground-truth bounding boxes. The ODD score enhances the VOD system in two ways: 1) it enables the VOD system to select superior global reference frames, thereby improving overall accuracy; and 2) it serves as an indicator in the newly designed ODD Scheduler to eliminate the aggregation of frames that are easy to detect, thus accelerating the VOD process. Comprehensive experiments demonstrate that, when utilized for selecting global reference frames, ODD-VOD consistently enhances the accuracy of Global-frame-based VOD models. When employed for acceleration, ODD-VOD consistently improves the frames per second (FPS) by an average of 73.3% across 8 different VOD models without sacrificing accuracy. When combined, ODD-VOD attains state-of-the-art performance when competing with many VOD methods in both accuracy and speed. Our work represents a significant advancement towards making VOD more practical for real-world applications.

摘要
当前的视频对象检测（VOD）模型经常遇到过度聚合的问题，这是因为 redundancy 的聚合策略在每帧进行特征聚合，从而导致性能下降和计算复杂性增加。在这种情况下，我们提出了一个图像级别的对象检测困难度（ODD）度量，用于衡量检测图像中对象的困难度。这些计算的ODD分数可以在 VOD 过程中使用，以避免过度聚合。我们在 VOD 系统中训练了一个 ODD 预测器，用于计算每幅图像的 ODD 分数，基于检测结果和真实 bounding box 之间的差异。ODD 分数可以在两种方式帮助 VOD 系统：1）选择更高精度的全局参照帧，以提高总体精度；2）作为 ODD 调度器中的指标，以消除容易检测的帧的聚合，从而加速 VOD 过程。我们的实验表明，当用于选择全局参照帧时，ODD-VOD 可以不断提高全球帧基本 VOD 模型的精度。当用于加速时，ODD-VOD 可以平均提高 FPS 值 by 73.3%，无需牺牲精度。当两者结合使用时，ODD-VOD 可以在精度和速度两个方面占据领先地位，代表了对 VOD 实际应用的一项重要进步。

CiteTracker: Correlating Image and Text for Visual Tracking

paper_url: http://arxiv.org/abs/2308.11322
repo_url: https://github.com/NorahGreen/CiteTracker
paper_authors: Xin Li, Yuqing Huang, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang
for: 提高视觉跟踪中目标模型和推理的精度，使得在目标呈现大幅变化时仍能准确跟踪。
methods: 提出了一种基于图像和文本的目标跟踪方法，通过图像转文本模块将目标图像区域转换为描述性文本，并通过动态描述模块适应目标变化以提高跟踪精度。
results: 经过了五种不同的数据集测试，并与现有方法进行比较，研究发现提出的跟踪方法在跟踪目标呈现大幅变化时表现出了优于现有方法的性能。

Abstract
Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking. However, a single image patch cannot provide a complete and precise concept of the target object as images are limited in their ability to abstract and can be ambiguous, which makes it difficult to track targets with drastic variations. In this paper, we propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target. In addition, a dynamic description module is designed to adapt to target variations for more effective target representation. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference. Extensive experiments on five diverse datasets are conducted to evaluate the proposed algorithm and the favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed tracking method.

摘要
现有的视觉跟踪方法通常使用图像块作为目标的参考点进行跟踪。然而，单个图像块无法提供完整和准确的目标对象概念，因为图像有限制，容易受到歧义和变化的影响，这使得跟踪目标变化具有挑战性。在这篇论文中，我们提出了CiteTracker，用于增强视觉跟踪中目标模型化和推理的方法。具体来说，我们开发了一个文本生成模块，将目标图像块转换为包含类和特征信息的详细文本描述，为目标提供了全面的参考点。此外，我们还设计了一个动态描述模块，以适应目标变化，以更有效地表示目标。然后，我们使用关注机制来关联目标描述和搜索图像，生成相关特征，用于目标状态参考。我们对五种不同的数据集进行了广泛的实验，以评估提出的算法效果。结果表明，与现有方法相比，我们的跟踪方法具有优秀的效果。

Using and Abusing Equivariance

paper_url: http://arxiv.org/abs/2308.11316
repo_url: None
paper_authors: Tom Edixhoven, Attila Lengyel, Jan van Gemert
for: 学习破坏对称性的群同态卷积神经网络
methods: 使用抽样来学习破坏对称性，对2D旋转和反射进行研究
results: 发现小变化的输入维度可以使通用的网络变得相对均匀，而不是精确均匀；研究不同对称性的网络如何影响其性能，并发现在训练数据中的对称性与网络的对称性不同时，相对均匀网络可以放弃自己的对称性约束，与精确均匀网络匹配或超越它们在常用 benchmark 数据上。

Abstract
In this paper we show how Group Equivariant Convolutional Neural Networks use subsampling to learn to break equivariance to their symmetries. We focus on 2D rotations and reflections and investigate the impact of broken equivariance on network performance. We show that a change in the input dimension of a network as small as a single pixel can be enough for commonly used architectures to become approximately equivariant, rather than exactly. We investigate the impact of networks not being exactly equivariant and find that approximately equivariant networks generalise significantly worse to unseen symmetries compared to their exactly equivariant counterparts. However, when the symmetries in the training data are not identical to the symmetries of the network, we find that approximately equivariant networks are able to relax their own equivariant constraints, causing them to match or outperform exactly equivariant networks on common benchmark datasets.

摘要
在这篇论文中，我们展示了GROUP等变征 convolutional neural networks 如何通过抽象来学习破坏对其Symmetries的等变征性。我们关注了2D旋转和反射，并调查了不等变征性对网络性能的影响。我们发现，只需将网络输入维度改变一个像素，常用的架构就可以变得相对等变征了，而不是精确等变征。我们调查了不等变征网络在未seen Symmetries上的generalization情况，发现它们与等变征网络相比在Common benchmark datasets上匹配或超越。但当网络中的Symmetries与训练数据中的Symmetries不同时，我们发现approximately等变征网络能够放弃自己的等变征约束，使其与等变征网络在Common benchmark datasets上匹配或超越。

Approaching human 3D shape perception with neurally mappable models

paper_url: http://arxiv.org/abs/2308.11300
repo_url: None
paper_authors: Thomas P. O’Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Josh B. Tenenbaum, Vincent Sitzmann, Nancy Kanwisher
for: 这研究旨在理解人类如何自然地推测三维形状，以及这种能力是如何被计算机模型重建的？
methods: 研究使用了一种新的计算模型，即3D神经场（3D-LFN），该模型基于深度神经网络（DNN），并通过多视角训练和多视角学习目标来实现人类水平的性能。
results: 研究发现，3D-LFN支持人类水平的三维匹配判断，并在针对标准DNN模型的攻击性定义比较中表现出色。此外，研究还发现，通过多视角训练和多视角学习目标，even conventional DNN architectures可以更接近人类行为。但是，这些模型在处理新的物体类别时仍有所限制。

Abstract
Humans effortlessly infer the 3D shape of objects. What computations underlie this ability? Although various computational models have been proposed, none of them capture the human ability to match object shape across viewpoints. Here, we ask whether and how this gap might be closed. We begin with a relatively novel class of computational models, 3D neural fields, which encapsulate the basic principles of classic analysis-by-synthesis in a deep neural network (DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D matching judgments well aligned to humans for within-category comparisons, adversarially-defined comparisons that accentuate the 3D failure cases of standard DNN models, and adversarially-defined comparisons for algorithmically generated shapes with no category structure. We then investigate the source of the 3D-LFN's ability to achieve human-aligned performance through a series of computational experiments. Exposure to multiple viewpoints of objects during training and a multi-view learning objective are the primary factors behind model-human alignment; even conventional DNN architectures come much closer to human behavior when trained with multi-view objectives. Finally, we find that while the models trained with multi-view learning objectives are able to partially generalize to new object categories, they fall short of human alignment. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures and highlights important questions for future work.

摘要
人类努力地推断物体的3D形状。这些计算是如何进行的？虽然一些计算模型已经被提出，但None of them capture the human ability to match object shape across viewpoints。在这里，我们问whether and how this gap might be closed。我们开始于一种相对新的计算模型，3D神经场（3D-LFN），这个模型涵盖了经典分析synthesis的基本原理，并将其 embedding在深度神经网络（DNN）中。我们发现，3D-LFN支持3D匹配判断，与人类匹配的性能很高，包括在类别内的比较、对抗定义的比较和对算法生成的形状进行比较。然后，我们通过一系列计算实验来研究3D-LFN的能力是如何实现人类对应的性能的。我们发现，在训练中 expose to multiple viewpoints of objects和多视图学习目标是模型人类匹配的主要因素。甚至使用标准DNN架构，通过多视图学习目标进行训练，模型的性能就会更接近人类的行为。最后，我们发现，使用多视图学习目标训练的模型可以部分地泛化到新的物体类别，但是它们仍然不够接近人类的匹配。这个研究提供了人类形状推断在神经Mappable的计算架构中的基础，并高亮了未来工作的重要问题。

BHSD: A 3D Multi-Class Brain Hemorrhage Segmentation Dataset

paper_url: http://arxiv.org/abs/2308.11298
repo_url: None
paper_authors: Biao Wu, Yutong Xie, Zeyu Zhang, Jinchao Ge, Kaspar Yaxley, Suzan Bahadir, Qi Wu, Yifan Liu, Minh-Son To
for: 本研究旨在提供一个3D多类血肿 segmentation dataset（BHSD），以便为血肿 segmentation任务提供支持。
methods: 本研究使用了深度学习技术来进行 médical image segmentation，并应用于血肿 segmentation任务。
results: 本研究提供了一个包含192个Volume的多类血肿数据集，以及2200个slice-level标注的数据集。通过对这些数据集进行supervised和semi-supervisedsegmentation任务的实验，我们证明了数据集的实用性。

Abstract
Intracranial hemorrhage (ICH) is a pathological condition characterized by bleeding inside the skull or brain, which can be attributed to various factors. Identifying, localizing and quantifying ICH has important clinical implications, in a bleed-dependent manner. While deep learning techniques are widely used in medical image segmentation and have been applied to the ICH segmentation task, existing public ICH datasets do not support the multi-class segmentation problem. To address this, we develop the Brain Hemorrhage Segmentation Dataset (BHSD), which provides a 3D multi-class ICH dataset containing 192 volumes with pixel-level annotations and 2200 volumes with slice-level annotations across five categories of ICH. To demonstrate the utility of the dataset, we formulate a series of supervised and semi-supervised ICH segmentation tasks. We provide experimental results with state-of-the-art models as reference benchmarks for further model developments and evaluations on this dataset.

摘要
Intracranial hemorrhage (ICH) 是一种生物学条件，表示脑内或脑部内出血，这可以归因于多种因素。正确地识别、定位和评估ICH具有重要临床意义，具体取决于出血情况。although deep learning techniques have been widely used in medical image segmentation and have been applied to the ICH segmentation task, existing public ICH datasets do not support the multi-class segmentation problem. To address this, we develop the Brain Hemorrhage Segmentation Dataset (BHSD), which provides a 3D multi-class ICH dataset containing 192 volumes with pixel-level annotations and 2200 volumes with slice-level annotations across five categories of ICH. To demonstrate the utility of the dataset, we formulate a series of supervised and semi-supervised ICH segmentation tasks. We provide experimental results with state-of-the-art models as reference benchmarks for further model developments and evaluations on this dataset.Here's the word-for-word translation:Intracranial hemorrhage (ICH) 是一种生物学条件，表示脑内或脑部内出血，这可以归因于多种因素。正确地识别、定位和评估ICH具有重要临床意义，具体取决于出血情况。虽然深度学习技术已经广泛应用于医疗图像分割，并已经应用于ICH分割任务，但现有的公共ICH数据集不支持多类分割问题。为解决这个问题，我们开发了脑出血分割数据集（BHSD），该数据集包含192卷Pixel级别标注和2200卷Slice级别标注，涵盖五类ICH。为证明数据集的实用性，我们提出了一系列的超级vised和半监督ICH分割任务。我们提供了实验结果，作为参考标准模型，以便进一步的模型开发和评估。

PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

paper_url: http://arxiv.org/abs/2308.11281
repo_url: None
paper_authors: Eyal Hanania, Ilya Volovik, Lilach Barkat, Israel Cohen, Moti Freiman
for: The paper is focused on developing a deep-learning-based method for motion correction in free-breathing T1 mapping, which can improve the accuracy and accessibility of diffuse myocardial disease diagnosis.
methods: The proposed method, called PCMC-T1, incorporates a physically-constrained signal decay model into a deep-learning network to correct for motion artifacts in free-breathing T1 mapping.
results: PCMC-T1 was compared to baseline methods using a 5-fold experimental setup on a public dataset of 210 patients and demonstrated superior model fitting quality and clinical impact, with anatomical alignment results that were comparable to the baseline methods.

Abstract
T1 mapping is a quantitative magnetic resonance imaging (qMRI) technique that has emerged as a valuable tool in the diagnosis of diffuse myocardial diseases. However, prevailing approaches have relied heavily on breath-hold sequences to eliminate respiratory motion artifacts. This limitation hinders accessibility and effectiveness for patients who cannot tolerate breath-holding. Image registration can be used to enable free-breathing T1 mapping. Yet, inherent intensity differences between the different time points make the registration task challenging. We introduce PCMC-T1, a physically-constrained deep-learning model for motion correction in free-breathing T1 mapping. We incorporate the signal decay model into the network architecture to encourage physically-plausible deformations along the longitudinal relaxation axis. We compared PCMC-T1 to baseline deep-learning-based image registration approaches using a 5-fold experimental setup on a publicly available dataset of 210 patients. PCMC-T1 demonstrated superior model fitting quality (R2: 0.955) and achieved the highest clinical impact (clinical score: 3.93) compared to baseline methods (0.941, 0.946 and 3.34, 3.62 respectively). Anatomical alignment results were comparable (Dice score: 0.9835 vs. 0.984, 0.988). Our code and trained models are available at https://github.com/eyalhana/PCMC-T1.

摘要
T1映射是一种量化核磁共振成像（qMRI）技术，已经成为诊断散布性心肺疾病的有价值工具。然而，现有的方法几乎完全依赖呼吸停止序列来消除呼吸运动 artifacts。这限制了患者的访问和效果，特别是那些无法忍受呼吸停止的患者。图像 регистраción可以使得呼吸自由T1映射成为可能。然而，内在的时刻点不同INTENSITY带来了注册任务的挑战。我们介绍PCMC-T1，一种基于物理约束的深度学习模型，用于呼吸自由T1映射中的运动 corrections。我们将信号衰减模型integrated into网络架构，以便鼓励物理可能的扭轴运动。我们与基线方法进行比较，使用一个5-fold实验设置，在公共可用的210名患者数据集上进行了测试。PCMC-T1显示出了最高的模型适应质量（R2: 0.955）和最高的临床影响（临床分数：3.93），与基线方法（0.941、0.946和3.34、3.62分别）相比。结构对应结果相似（Dice分数：0.9835 vs. 0.984、0.988）。我们的代码和训练模型可以在https://github.com/eyalhana/PCMC-T1中获得。

HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

paper_url: http://arxiv.org/abs/2308.11261
repo_url: None
paper_authors: Sadegh Aliakbarian, Fatemeh Saleh, David Collier, Pashmina Cameron, Darren Cosker
for: 这个论文的目的是提出一种能够生成正确和可信的全身动作，即使只有部分手部可见，以提高混合现实场景中的吸引力。
methods: 该论文使用了一种名为HMD-NeMo的轻量级神经网络，通过在线和实时方式预测全身动作，并使用了新的时间适应mask токен来促进合理的动作在手部不可见情况下。
results: 经过了广泛的分析和评估，该论文在AMASS数据集上达到了新的状态元。

Abstract
Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation.

摘要
<>Translate the given text into Simplified Chinese.<>创造真实和准确的全身人物动作是混合现实场景中质量体验的关键。头戴设备（HMD）通常只提供头和手6个自由度的输入信号。近期，不同的方法已经实现了在只有头和手信号的情况下取得了出色的表现。然而，根据我们所知，所有现有的方法均依赖于全手可见。这会导致在使用动作控制器时是可见的，但是一部分混合现实经验不使用动作控制器，而是基于自己центric的手姿跟踪。这引入了只有部分手可见的挑战，即HMD的视野限制。在这篇论文中，我们提出了第一个统一的方法，即HMD-NeMo，它可以在线和实时方式下生成真实和准确的全身动作。HMD-NeMo是一个轻量级的神经网络，它通过在线和实时方式下预测全身动作来解决部分手可见的挑战。我们对HMD-NeMo中不同组件的影响进行了广泛的分析，并通过我们的评估而提出了新的状态码。

Video BagNet: short temporal receptive fields increase robustness in long-term action recognition

paper_url: http://arxiv.org/abs/2308.11249
repo_url: https://github.com/ombretta/videobagnet
paper_authors: Ombretta Strafforello, Xin Liu, Klamer Schutte, Jan van Gemert
for: 提高视频动作识别模型的 robustness，使其能够更好地承受视频中的子动作顺序变化。
methods: 对于现有的深度3D convolutional模型，我们采用了限制时间响应领域大小的方法，从而实现了模型的时间响应领域的缩小。
results: 我们在 sintetic 和 real-world 视频数据集上进行了实验，发现短时间响应领域的模型具有较高的 robustness，而大于17帧的时间响应领域模型具有较低的 robustness。

Abstract
Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video BagNet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.

摘要
previous research on long-term video action recognition relies on deep 3D-convolutional models with a large temporal receptive field (RF). we argue that these models are not always the best choice for temporal modeling in videos. a large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. in this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. for this, we design video bagnet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. we analyze video bagnet on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. we find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Are current long-term video understanding datasets long-term?

paper_url: http://arxiv.org/abs/2308.11244
repo_url: https://github.com/ombretta/longterm_datasets
paper_authors: Ombretta Strafforello, Klamer Schutte, Jan van Gemert
for: This paper aims to evaluate the suitability of video datasets for long-term action recognition.
methods: The proposed method defines long-term actions as those that cannot be recognized using solely short-term information, and tests this definition on three popular real-world datasets.
results: The study finds that the existing datasets can be effectively solved using shortcuts based on short-term information, and encourages researchers to use datasets that require long-term information to be solved.Here’s the simplified Chinese text for the three pieces of information:
for: 这篇论文目的是评估视频数据集是否适用于长期动作识别。
methods: 该方法定义长期动作为不能通过短期信息alone来识别的动作，并对三个实际世界数据集进行测试。
results: 研究发现现有数据集可以使用短期信息 shortcut 进行解决，并促使研究人员使用需要长期信息来解决的数据集。

Abstract
Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate if such models actually learn and reason over long-term information. In this work, we propose a method to evaluate how suitable a video dataset is to evaluate models for long-term action recognition. To this end, we define a long-term action as excluding all the videos that can be correctly recognized using solely short-term information. We test this definition on existing long-term classification tasks on three popular real-world datasets, namely Breakfast, CrossTask and LVU, to determine if these datasets are truly evaluating long-term recognition. Our study reveals that these datasets can be effectively solved using shortcuts based on short-term information. Following this finding, we encourage long-term action recognition researchers to make use of datasets that need long-term information to be solved.

摘要
很多现实世界应用，从运动分析到监测，都会受益于自动长期动作识别。在当前的深度学习框架中，自动动作识别模型的训练和测试通常是基于长期信息的。在这种情况下，我们提议一种方法来评估视频集是否适用于评估长期动作识别模型。为此，我们定义长期动作为排除所有可以通过短期信息来正确地识别的视频。我们在三个流行的实际世界数据集上进行测试，分别是Breakfast、CrossTask和LVU，以确定这些数据集是否真的评估长期认知。我们的研究发现，这些数据集可以通过快捷途径基于短期信息来解决。根据这个发现，我们鼓励长期动作识别研究人员使用需要长期信息来解决的数据集。

LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training

paper_url: http://arxiv.org/abs/2308.11239
repo_url: None
paper_authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Balaji Krishnamurthy
for: 本研究旨在无需人工监督下完成图像和视频数据集中的对象分割问题。
methods: 我们提出了一种自动化对象发现方法，利用运动和外观信息来生成高质量的对象分割面积。我们在传统图像树剖中添加了运动信息，并与外观信息进行线性组合来生成边权。
results: 我们的方法在多个标准视频对象分割、图像吸引力检测和对象分割 benchmark 上达到了与现状对照的性能。我们还通过自我训练来进一步提高性能。在审查实验中，我们的方法在未知领域中的转移性也得到了证明。

Abstract
Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. Specifically, we redesign the traditional graph cut on images to include motion information in a linear combination with appearance information to produce edge weights. Remarkably, this step produces object segmentation masks comparable to the current state-of-the-art on multiple benchmarks. To further improve performance, we bootstrap a segmentation network trained on these preliminary masks as pseudo-ground truths to learn from its own outputs via self-training. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks, achieving results on par with and, in many cases surpassing state-of-the-art methods. We also demonstrate the transferability of our approach to novel domains through a qualitative study on in-the-wild images. Additionally, we present extensive ablation analysis to support our design choices and highlight the contribution of each component of our proposed method.

摘要
学习图像和视频集合中的对象分割无人监督是一个具有挑战性的问题。人类容易通过gestalt原则的共同命运来识别视频中移动的焦点对象，这个原则表明移动 вместе的对象属于同一个集合。基于这个想法，我们提出了一种自动化对象发现方法，利用运动和外观信息生成高质量的对象分割面积。 Specifically,我们重新设计了传统的图像截割方法，在线性组合中包括运动信息和外观信息来生成边重量。这一步生成的对象分割面积与当前状态的各种标准测试 benchmark 相当。为了进一步提高性能，我们使用这些初步的面积作为 pseudo-ground truth 来自我准备一个 segmentation 网络，并通过自我训练来学习自己的输出。我们命名这种方法为 LOCATE，并在多个标准视频对象分割、图像焦点检测和对象分割 bencmarks 上实现了与和超越当前状态的方法。我们还进行了质量研究，以证明我们的方法在新领域中的传输性。此外，我们还提供了广泛的减少分析，以支持我们的设计选择，并高亮每个方法的贡献。

Affordance segmentation of hand-occluded containers from exocentric images

paper_url: http://arxiv.org/abs/2308.11233
repo_url: None
paper_authors: Tommaso Apicella, Alessio Xompero, Edoardo Ragusa, Riccardo Berta, Andrea Cavallaro, Paolo Gastaldo
for: 本研究旨在解决手持物体上 occlusion 问题，提高机器人可视且抓取物体的可行性。
methods: 提议的模型使用辅助分支处理物体和手部分 separately，学习受 occlusion 影响的可行特征。
results: 实验表明，我们的模型在实际和混合现实图像上具有更好的可行分割和泛化能力，比现有模型更好。

Abstract
Visual affordance segmentation identifies the surfaces of an object an agent can interact with. Common challenges for the identification of affordances are the variety of the geometry and physical properties of these surfaces as well as occlusions. In this paper, we focus on occlusions of an object that is hand-held by a person manipulating it. To address this challenge, we propose an affordance segmentation model that uses auxiliary branches to process the object and hand regions separately. The proposed model learns affordance features under hand-occlusion by weighting the feature map through hand and object segmentation. To train the model, we annotated the visual affordances of an existing dataset with mixed-reality images of hand-held containers in third-person (exocentric) images. Experiments on both real and mixed-reality images show that our model achieves better affordance segmentation and generalisation than existing models.

摘要
<>Visual affordance segmentation 可以识别物体上可互动的表面。常见的挑战是物体表面的多样性和物理特性以及遮挡。在这篇论文中，我们专注于人手持物体时的遮挡问题。为解决这个挑战，我们提议一种基于辅助分支的可互动分割模型。该模型通过对手和物体区域进行分割来学习可互动特征。通过对手和物体分割来权重特征图，以便在手遮挡情况下学习可互动特征。我们使用混合现实图像的混合现实数据集进行训练。实验表明，我们的模型在实际和混合现实图像上的可互动分割和泛化性比较好。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

LDP-Feat: Image Features with Local Differential Privacy

paper_url: http://arxiv.org/abs/2308.11223
repo_url: None
paper_authors: Francesco Pittaluga, Bingbing Zhuang
for: 保护隐私，防止恶意攻击者通过图像特征恢复原始图像
methods: 使用嵌入式法律空间和对抗性特征样本来隐藏图像特征，并提出了两种新的倒转攻击来缓解隐私风险
results: 实现了在保护隐私的情况下具有强大的视觉本地化性能，并提供了一种基于地方散布隐私的方法，这种方法可以提供固定的隐私泄露 bound，不受攻击强度的影响

Abstract
Modern computer vision services often require users to share raw feature descriptors with an untrusted server. This presents an inherent privacy risk, as raw descriptors may be used to recover the source images from which they were extracted. To address this issue, researchers recently proposed privatizing image features by embedding them within an affine subspace containing the original feature as well as adversarial feature samples. In this paper, we propose two novel inversion attacks to show that it is possible to (approximately) recover the original image features from these embeddings, allowing us to recover privacy-critical image content. In light of such successes and the lack of theoretical privacy guarantees afforded by existing visual privacy methods, we further propose the first method to privatize image features via local differential privacy, which, unlike prior approaches, provides a guaranteed bound for privacy leakage regardless of the strength of the attacks. In addition, our method yields strong performance in visual localization as a downstream task while enjoying the privacy guarantee.

摘要
现代计算机视觉服务经常需要用户将原始特征描述分发到不可信服务器。这会导致隐私风险，因为原始特征可能可以用来恢复源图像。为解决这问题，研究人员最近提出了嵌入图像特征的私有化方法，以确保图像内容的隐私。在这篇论文中，我们提出了两种新的反向攻击，以示它们可以（相对）回归原始图像特征，从而恢复隐私关键的图像内容。另外，我们还提出了首个基于本地差分隐私的图像特征隐私方法，与先前的方法不同，它提供了隐私泄露的保证，无论攻击者的强度如何。此外，我们的方法在视觉本地化任务中具有强表现力，同时享有隐私保证。

paper_url: http://arxiv.org/abs/2308.11206
repo_url: None
paper_authors: Xujie Zhang, Binbin Yang, Michael C. Kampffmeyer, Wenqing Zhang, Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, Xiaodan Liang
for: 这个论文旨在提高模式融合和修改的方式，帮助时尚设计师更加方便地生成和修改他们的设计。
methods: 这个论文使用了DiffCloth，一种扩展了Diffusion-based管道，通过strucurally aligning cross-modal semantics来强化 diffusion models的可 composeability。
results: experiments on CM-Fashion benchmark demonstrate that DiffCloth can both yield state-of-the-art garment synthesis results and support flexible manipulation with region consistency.

Abstract
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces.Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.

摘要
cross-modal 服装合成和修改将会对时尚设计师如何生成服装和修改他们的设计进行深见改进，通过灵活的语言接口。目前的方法采用通用的文本到图像模式，通过简单的跨模态关系抽象模块，忽略了时尚设计领域中的视觉表示结构匹配。在这个工作中，我们发展了DiffCloth，一种基于扩散的渠道管道，用于跨模态服装合成和修改，具有时尚领域的可 compose 性。具体来说，我们将部级跨模态匹配问题定义为语言特征短语（AP）和视觉服装部分之间的对应问题，并通过分词分析和 semantic segmentation 获得视觉服装部分。为了解决特征混淆问题，我们还提出了一种 semantic-bundled 跨注意力，以保持每个AP的注意力地图之间的空间结构相似性。此外，DiffCloth 支持通过简单地更换文本提示中的AP来进行 manipulate 操作，并且识别并保持不变的混合mask。广泛的实验表明，DiffCloth 可以利用内置的结构信息，同时支持灵活的 manipulate 操作，并保持区域一致性。

Masked Cross-image Encoding for Few-shot Segmentation

paper_url: http://arxiv.org/abs/2308.11201
repo_url: None
paper_authors: Wenbo Xu, Huaxi Huang, Ming Cheng, Litao Yu, Qiang Wu, Jian Zhang
for: 这个论文旨在提高几张支持图像上的几个类别的描述，使用少量标注图像进行推断。
methods: 该方法使用Masked Cross-Image Encoding（MCE）来捕捉对象细节的共同视觉特征，以及图像之间的相互依赖关系。
results: 实验表明，该方法在PASCAL-$5^i$和COCO-$20^i$中表现出色，可以快速学习新类别，并且对于描述对象细节的任务有进一步的改进。

Abstract
Few-shot segmentation (FSS) is a dense prediction task that aims to infer the pixel-wise labels of unseen classes using only a limited number of annotated images. The key challenge in FSS is to classify the labels of query pixels using class prototypes learned from the few labeled support exemplars. Prior approaches to FSS have typically focused on learning class-wise descriptors independently from support images, thereby ignoring the rich contextual information and mutual dependencies among support-query features. To address this limitation, we propose a joint learning method termed Masked Cross-Image Encoding (MCE), which is designed to capture common visual properties that describe object details and to learn bidirectional inter-image dependencies that enhance feature interaction. MCE is more than a visual representation enrichment module; it also considers cross-image mutual dependencies and implicit guidance. Experiments on FSS benchmarks PASCAL-$5^i$ and COCO-$20^i$ demonstrate the advanced meta-learning ability of the proposed method.

摘要
几个例图分类（FSS）是一种密集预测任务，旨在只使用有限数量的标注图像来预测未经看过的类别。关键挑战在FSS中是将查询像素的标签分类用支持图像中学习的类 prototype。现有的FSS方法通常是独立地从支持图像中学习类Descriptor，从而忽略了支持图像和查询图像之间的丰富Contextual information和相互依赖关系。为了解决这一限制，我们提出了一种联合学习方法，称为Masked Cross-Image Encoding（MCE），旨在捕捉对象细节中的共同视觉特性，以及在支持图像和查询图像之间的双向依赖关系。MCE不仅是一种视觉表示增强模块，还考虑了图像之间的相互依赖关系和隐藏导航。在PASCAL-$5^i$和COCO-$20^i$的FSS标准测试集上，我们的提议方法表现出了更高级的元学习能力。

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

paper_url: http://arxiv.org/abs/2308.11198
repo_url: None
paper_authors: Wentian Qu, Zhaopeng Cui, Yinda Zhang, Chenyu Meng, Cuixia Ma, Xiaoming Deng, Hongan Wang
for: 这个论文主要是关于手对象交互的理解和生成三维手对象交互的方法。
methods: 该论文提出了一种基于神经网络的渲染和姿态估计系统，用于从稀疏视图中理解手对象交互。该系统还可以实现3D手对象交互编辑。
results: 实验表明，该方法比前州艺术法具有更高的性能。Here’s the full translation of the paper’s abstract in Simplified Chinese:
for: 这个论文主要是关于手对象交互的理解和生成三维手对象交互的方法。
methods: 该论文提出了一种基于神经网络的渲染和姿态估计系统，用于从稀疏视图中理解手对象交互。该系统还可以实现3D手对象交互编辑。
results: 实验表明，该方法比前州艺术法具有更高的性能。I hope this helps! Let me know if you have any other questions.

Abstract
Hand-object interaction understanding and the barely addressed novel view synthesis are highly desired in the immersive communication, whereas it is challenging due to the high deformation of hand and heavy occlusions between hand and object. In this paper, we propose a neural rendering and pose estimation system for hand-object interaction from sparse views, which can also enable 3D hand-object interaction editing. We share the inspiration from recent scene understanding work that shows a scene specific model built beforehand can significantly improve and unblock vision tasks especially when inputs are sparse, and extend it to the dynamic hand-object interaction scenario and propose to solve the problem in two stages. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation at the offline stage. During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction with the pre-built hand and object models as well as interaction priors, which thereby overcomes penetration and separation issues between hand and object and also enables novel view synthesis. In order to get stable contact during the hand-object interaction process in a sequence, we propose a stable contact loss to make the contact region to be consistent. Experiments demonstrate that our method outperforms the state-of-the-art methods. Code and dataset are available in project webpage https://iscas3dv.github.io/HO-NeRF.

摘要
手动对象交互理解和 hardly addressed 新视图合成是 immerse 通信中的高优先级需求，但是受到手动对象的高变形和对手与对象的 occlusion 的影响，是一个挑战。在这篇论文中，我们提出了一种基于 neural 渲染和 pose 估计的手动对象交互从稀疏视图系统，可以帮助解决3D 手动对象交互编辑问题。我们 draw 了 scene 理解工作的 inspirations，使用 scene 特定模型在 offline 阶段学习手动对象交互的 shape 和 appearance 特征，然后在 online 阶段使用 rendering-based joint 模型来理解动态手动对象交互，并且通过 pre-built 手和对象模型以及交互 prior 来解决 penetration 和 separation 问题，并实现 novel view synthesis。为了在手动对象交互过程中保持稳定的 contacts，我们提出了一种稳定 contact loss。实验表明，我们的方法超过了现有方法的性能。代码和数据集可以在项目网站 https://iscas3dv.github.io/HO-NeRF 中获取。

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

paper_url: http://arxiv.org/abs/2308.11186
repo_url: None
paper_authors: Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, Feng Zheng
for: 这篇论文旨在提出一种基于人工知识的潜在图像分类模型，以提高图像分类器的泛化能力。
methods: 该方法使用两种不同类型的知识感知提问，一种是精确提取对象类型的描述信息，另一种是通过学习获得总上下文信息。此外，还设计了一个适应头来归一化重要视觉特征。
results: 对11种广泛使用的基准数据集进行了广泛的实验，结果表明，KAPT方法可以在少量样本下进行图像分类，特别是在新类别上具有泛化能力。与当前最佳方法CoCoOp相比，KAPT方法在新类别上显示了有利的性能，升准值为3.22%和2.57%。

Abstract
Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great capacity of transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes, failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further design an adaptation head for the visual encoder to aggregate salient attentive visual cues, which establishes discriminative and task-aware visual representations. We conduct extensive experiments on 11 widely-used benchmark datasets and the results verify the effectiveness in few-shot image classification, especially in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp method, KAPT exhibits favorable performance and achieves an absolute gain of 3.22% on new classes and 2.57% in terms of harmonic mean.

摘要
<>传统的视觉语言模型，如CLIP，通过手动设计的提示符可以实现很好的转移学习效果。然而，这些提示符容易过拟合已经看过的类别，导致无法泛化到未看过的类别。在这篇论文中，我们提出了一个知识感知提示调整（KAPT）框架，用于视觉语言模型。我们的方法启取人类智能中在识别新类别对象时，通常会 incorporate 外部知识。我们设计了两种强调知识的提示符，一种是精确提取对象类别的描述信息，另一种是学习对象上的总上下文。此外，我们还设计了一个适应头，用于Visual encoder 中的视觉特征汇聚，以建立特异性和任务意识的视觉表示。我们对11种广泛使用的 benchmark 数据集进行了广泛的实验，结果证明了我们的方法的有效性，特别是在新类别上的泛化。相比之下，与状态之arte CoCoOp 方法相比，KAPT 表现更加优秀，在新类别上实现了3.22%的绝对提升和2.57%的harmonic mean。

MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

paper_url: http://arxiv.org/abs/2308.11185
repo_url: None
paper_authors: Najmeh Sadoughi, Xinyu Li, Avijit Vajpayee, David Fan, Bing Shuai, Hector Santos-Villalobos, Vimal Bhat, Rohith MV
for: 本研究旨在提高长形视频（>60分钟）的场景和剧情槽分 segmentation 效果。
methods: 本研究提出了一种基于多媒体Modalities的 Multimodal alignmEnt aGgregation and distillAtion（MEGA）方法，通过多种输入的粗略对应和模式维度的嵌入来解决长形视频的同步问题。
results: 实验结果表明，MEGA方法在 MovieNet 数据集上场景分 segmentation Task 上提高了+1.19%的平均准确率，并在 TRIPOD 数据集上剧情分 segmentation Task 上提高了+5.51%的总一致率。

Abstract
Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%)

摘要

ReFit: Recurrent Fitting Network for 3D Human Recovery

paper_url: http://arxiv.org/abs/2308.11184
repo_url: https://github.com/yufu-wang/ReFit
paper_authors: Yufu Wang, Kostas Daniilidis
for: 本研究旨在提出一种基于神经网络的单影像参数3D人体重建方法，以解决人体重建问题。
methods: 本方法使用反馈更新循环，通过在每个迭代步骤中将人体模型中的关键点投影到特征图中，并使用回归型更新器来调整模型以更好地适应图像。
results: 本研究表明，使用反馈更新循环可以更快速地训练神经网络模型，同时提高了标准benchmark测试数据集上的性能。此外，本方法还可以应用于多视图适应和单视图形状适应等其他优化设定。

Abstract
We present Recurrent Fitting (ReFit), a neural network architecture for single-image, parametric 3D human reconstruction. ReFit learns a feedback-update loop that mirrors the strategy of solving an inverse problem through optimization. At each iterative step, it reprojects keypoints from the human model to feature maps to query feedback, and uses a recurrent-based updater to adjust the model to fit the image better. Because ReFit encodes strong knowledge of the inverse problem, it is faster to train than previous regression models. At the same time, ReFit improves state-of-the-art performance on standard benchmarks. Moreover, ReFit applies to other optimization settings, such as multi-view fitting and single-view shape fitting. Project website: https://yufu-wang.github.io/refit_humans/

摘要
我们介绍Recurrent Fitting（ReFit），一个神经网络架构，用于单一图像、Parametric 3D人体重建。ReFit学习了一个反馈-更新循环，与解析问题的解决策略相似。在每个迭代步骤中，它将人体模型中的关键点投射到特征对称中，以查询反馈，并使用回归型更新器进行调整，以更好地适应图像。由于ReFit传递强大的倒数问题知识，因此它在训练时比前一代 regression 模型更快。同时，ReFit在标准参考数据上提高了现场性能。此外，ReFit可以应用到其他优化设定，例如多视点适摄和单视点形状适摄。相关网站：https://yufu-wang.github.io/refit_humans/

A three in one bottom-up framework for simultaneous semantic segmentation, instance segmentation and classification of multi-organ nuclei in digital cancer histology

paper_url: http://arxiv.org/abs/2308.11179
repo_url: None
paper_authors: Ibtihaj Ahmad, Syed Muhammad Israr, Zain Ul Islam
for: 这paper是用于同时进行核体分割和分类的数字 histology 的研究，它们在计算机协助癌症诊断中扮演着关键的角色，但是这个问题仍然是一个挑战。methods: 这paper使用了一种多头decoder的结构，其中每个头都有独立的权重损失函数，用于生成核体分割、边提议和分类地图。这些输出被用来进行后处理，生成最终的核体分割和分类结果。results: 该paper实现了高性能的核体分割和分类，其中semantic segmentation的Dice score为0.841，实例 segmentation的bPQ score为0.713，和nuclei classification的mPQ score为0.633。此外，该方法在19种组织中得到了普适性。

Abstract
Simultaneous segmentation and classification of nuclei in digital histology play an essential role in computer-assisted cancer diagnosis; however, it remains challenging. The highest achieved binary and multi-class Panoptic Quality (PQ) remains as low as 0.68 bPQ and 0.49 mPQ, respectively. It is due to the higher staining variability, variability across the tissue, rough clinical conditions, overlapping nuclei, and nuclear class imbalance. The generic deep-learning methods usually rely on end-to-end models, which fail to address these problems associated explicitly with digital histology. In our previous work, DAN-NucNet, we resolved these issues for semantic segmentation with an end-to-end model. This work extends our previous model to simultaneous instance segmentation and classification. We introduce additional decoder heads with independent weighted losses, which produce semantic segmentation, edge proposals, and classification maps. We use the outputs from the three-head model to apply post-processing to produce the final segmentation and classification. Our multi-stage approach utilizes edge proposals and semantic segmentations compared to direct segmentation and classification strategies followed by most state-of-the-art methods. Due to this, we demonstrate a significant performance improvement in producing high-quality instance segmentation and nuclei classification. We have achieved a 0.841 Dice score for semantic segmentation, 0.713 bPQ scores for instance segmentation, and 0.633 mPQ for nuclei classification. Our proposed framework is generalized across 19 types of tissues. Furthermore, the framework is less complex compared to the state-of-the-art.

摘要
同时分割和分类肿瘤细胞在数字 histology 中扮演着重要的角色，但是它仍然是一个挑战。最高的二分和多分类 Panoptic Quality (PQ) 只有0.68 bPQ和0.49 mPQ，这是因为细胞染色的变化、组织内部的变化、较为恶劣的临床条件、重叠的细胞和核类别偏度。通用的深度学习方法通常采用端到端模型，这些模型无法直接地解决数字 histology 中相关的问题。在我们的前一个工作中，我们提出了 DAN-NucNet 模型，解决了这些问题。本文将 extend 我们的前一个模型，以实现同时的实例分割和分类。我们添加了多个解码头，每个解码头都有独立的权重损失，它们生成的结果包括semantic segmentation、edge proposal 和分类图像。我们使用这些三个头的输出来进行后处理，以生成最终的分割和分类结果。我们的多阶段方法利用了 edge proposal 和 semantic segmentation，而不是直接进行分割和分类的方法，这使得我们可以提高高质量的实例分割和核类分类。我们在19种组织中实现了0.841 Dice 分割率、0.713 bPQ 分割率和0.633 mPQ 分类率。我们提出的框架比现有的状态之一更加简单。

Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration

paper_url: http://arxiv.org/abs/2308.11165
repo_url: None
paper_authors: Di Wang, Jinyuan Liu, Long Ma, Risheng Liu, Xin Fan
for: addressing the challenges of misalignments between multi-modality images in image fusion
methods: 一种Cross-modality Multi-scale Progressive Dense Registration (C-MPDR) scheme, which uses a one-stage optimization to improve the fusion performance of misaligned multi-modality images
results: 提高了多模态图像匹配的混合性能

Abstract
Misalignments between multi-modality images pose challenges in image fusion, manifesting as structural distortions and edge ghosts. Existing efforts commonly resort to registering first and fusing later, typically employing two cascaded stages for registration,i.e., coarse registration and fine registration. Both stages directly estimate the respective target deformation fields. In this paper, we argue that the separated two-stage registration is not compact, and the direct estimation of the target deformation fields is not accurate enough. To address these challenges, we propose a Cross-modality Multi-scale Progressive Dense Registration (C-MPDR) scheme, which accomplishes the coarse-to-fine registration exclusively using a one-stage optimization, thus improving the fusion performance of misaligned multi-modality images. Specifically, two pivotal components are involved, a dense Deformation Field Fusion (DFF) module and a Progressive Feature Fine (PFF) module. The DFF aggregates the predicted multi-scale deformation sub-fields at the current scale, while the PFF progressively refines the remaining misaligned features. Both work together to accurately estimate the final deformation fields. In addition, we develop a Transformer-Conv-based Fusion (TCF) subnetwork that considers local and long-range feature dependencies, allowing us to capture more informative features from the registered infrared and visible images for the generation of high-quality fused images. Extensive experimental analysis demonstrates the superiority of the proposed method in the fusion of misaligned cross-modality images.

摘要
《多Modalität图像误差问题在图像融合中带来挑战，导致结构扭曲和边幻影。现有尝试通常采用分两个阶段进行 регистрирование，即粗略 регистрирование和精细 регистрирование，两个阶段直接估计目标扭曲场。在这篇论文中，我们认为分开的两个阶段 регистрирование不是 компакт的，而直接估计目标扭曲场也不够精确。为了解决这些挑战，我们提出了一种交叉模态多尺度进行密度注registratin（C-MPDR）方案，通过一个单一的优化过程，提高误差的多模态图像融合性能。具体来说，这种方案包括两个关键组件：一个密集的Deformation Field Fusion（DFF）模块和一个进行进行精细调整的Progressive Feature Fine（PFF）模块。DFF模块将在当前级别预测多尺度扭曲子场，而PFF模块将逐渐调整剩下的误差特征。两个模块共同帮助估计最终的扭曲场。此外，我们还开发了基于Transformer-Conv的融合（TCF）子网络，该子网络考虑了本地和远程特征依赖关系，使我们能够更好地捕捉融合后的高质量混合图像中的有用特征。广泛的实验分析表明，我们提出的方法在跨模态图像融合中具有显著的优势。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Decoupled Contrastive Multi-view Clustering with High-order Random Walks

paper_url: http://arxiv.org/abs/2308.11164
repo_url: None
paper_authors: Yiding Lu, Yijie Lin, Mouxing Yang, Dezhong Peng, Peng Hu, Xi Peng
for: 提高多视图集群（MvC）的稳定性和抗衰落性，解决false negative和false positive问题。
methods: 提出了一种新的多视图集群方法（DIVIDE），通过随机游走进行全球性地标识数据对，从而可以正确地标识内部负样本和外部正样本。同时，DIVIDE采用了一种新的多视图集群架构，在不同的嵌入空间进行对比学习，以提高集群性能和抗衰落性。
results: 通过对四个标准测试集进行广泛的实验，证明了DIVIDE在完整的MvC设定下和缺失视图的MvC设定下都有较高的性能，并且在不同的缺失视图情况下保持稳定性。

Abstract
In recent, some robust contrastive multi-view clustering (MvC) methods have been proposed, which construct data pairs from neighborhoods to alleviate the false negative issue, i.e., some intra-cluster samples are wrongly treated as negative pairs. Although promising performance has been achieved by these methods, the false negative issue is still far from addressed and the false positive issue emerges because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address the issues, we propose a novel robust method, dubbed decoupled contrastive multi-view clustering with high-order random walks (DIVIDE). In brief, DIVIDE leverages random walks to progressively identify data pairs in a global instead of local manner. As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. Moreover, DIVIDE embraces a novel MvC architecture to perform inter- and intra-view contrastive learning in different embedding spaces, thus boosting clustering performance and embracing the robustness against missing views. To verify the efficacy of DIVIDE, we carry out extensive experiments on four benchmark datasets comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings.

摘要
近些年，一些强大的多视图决定 clustering（MvC）方法被提出，这些方法从邻居中构建数据对以解决内部样本被错误地视为负对的问题。 although these methods have achieved promising performance, the false negative issue is still not fully addressed, and a new issue has emerged, i.e., false positives, because all in- and out-of-neighborhood samples are simply treated as positive and negative, respectively. To address these issues, we propose a novel robust method, called decoupled contrastive multi-view clustering with high-order random walks (DIVIDE).DIVIDE 方法利用Random Walk来逐渐标识全球的数据对，而不是局部的方法。 As a result, DIVIDE could identify in-neighborhood negatives and out-of-neighborhood positives. 更over, DIVIDE 方法采用一种新的MvC架构，以进行不同的嵌入空间中的对比学习，从而提高划分性和对缺失视图的抗性。 To verify the effectiveness of DIVIDE, we conduct extensive experiments on four benchmark datasets, comparing with nine state-of-the-art MvC methods in both complete and incomplete MvC settings.

A Preliminary Investigation into Search and Matching for Tumour Discrimination in WHO Breast Taxonomy Using Deep Networks

paper_url: http://arxiv.org/abs/2308.11162
repo_url: None
paper_authors: Abubakr Shafique, Ricardo Gonzalez, Liron Pantanowitz, Puay Hoon Tan, Alberto Machado, Ian A Cree, Hamid R. Tizhoosh
for: 这项研究旨在开发一个基于深度学习的搜索able数字Atlas，用于帮助病理学家对悉数据库中的罕见癌症进行查找和匹配。
methods: 该研究使用了一个国际知名的TCGA数据库，并使用了一个国际顶尖的深度学习模型，对 millions of diagnostic histopathology images进行了预训练。然后，对WHO乳腺分类系统（第5版）中的35种肿瘤类型进行了索引和分析，并使用了深度特征来Visualize所有肿瘤类型。
results: 该研究发现，使用深度学习模型对WHO乳腺分类系统数据进行索引和分析，可以达到88%的准确率，并且使用top-n肿瘤类型进行验证可以达到91%的准确率。这些结果表明，使用索引数字Archive可以investigate complex relationships among common and rare breast lesions。

Abstract
Breast cancer is one of the most common cancers affecting women worldwide. They include a group of malignant neoplasms with a variety of biological, clinical, and histopathological characteristics. There are more than 35 different histological forms of breast lesions that can be classified and diagnosed histologically according to cell morphology, growth, and architecture patterns. Recently, deep learning, in the field of artificial intelligence, has drawn a lot of attention for the computerized representation of medical images. Searchable digital atlases can provide pathologists with patch matching tools allowing them to search among evidently diagnosed and treated archival cases, a technology that may be regarded as computational second opinion. In this study, we indexed and analyzed the WHO breast taxonomy (Classification of Tumours 5th Ed.) spanning 35 tumour types. We visualized all tumour types using deep features extracted from a state-of-the-art deep learning model, pre-trained on millions of diagnostic histopathology images from the TCGA repository. Furthermore, we test the concept of a digital "atlas" as a reference for search and matching with rare test cases. The patch similarity search within the WHO breast taxonomy data reached over 88% accuracy when validating through "majority vote" and more than 91% accuracy when validating using top-n tumour types. These results show for the first time that complex relationships among common and rare breast lesions can be investigated using an indexed digital archive.

摘要
乳癌是全球女性最常见的恶性肿瘤之一，其包括多种生物学、临床和 histopathological 特征的肿瘤。有超过35种不同的乳腺病变可以根据细胞形态、生长和建筑模式进行分类和诊断。近些年来，人工智能技术中的深度学习在医疗领域得到了广泛的关注，特别是在计算机化的医疗图像 Representation 方面。搜索able digital atlas 可以为病理学家提供一个搜索和匹配已诊断和治疗的档案库，这种技术可以视为计算机化的第二次诊断。本研究对 WHO 乳腺分类（第5版）进行了索引和分析，包括35种肿瘤类型。我们使用了一个国际上最新的深度学习模型，该模型在TCGA 数据库上进行了 millions 个诊断 Histopathology 图像的预处理，然后对所有肿瘤类型进行了深度特征提取和可视化。此外，我们还测试了一个数字"atlas"作为参考，用于搜索和匹配罕见案例。在 WHO 乳腺分类数据中进行了质心精度搜索，结果表明，使用 "多数投票" 验证的精度高达88%，使用 top-n 肿瘤类型验证的精度高达91%。这些结果表明，使用索引数字档案库，可以 investigate Complex relationships among common and rare breast lesions.

SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection

paper_url: http://arxiv.org/abs/2308.11159
repo_url: None
paper_authors: Dalong Zheng, Zebin Wu, Jia Liu, Zhihui Wei
for: 本研究旨在提出一种综合厚度网络（SwinV2DNet），以综合利用 transformer 和 CNN 的优点，解决现有网络在特征学习中的缺陷。
methods: 该网络包括 Swin V2 和 VGG16 两部分，通过 densely connected 的 Swin V2 脊梁和 CNN 分支，捕捉到 cambio 关系特征，并通过 mixed feature pyramid (MFP) 提供了多层次和协调的特征学习。
results: 对四个常用的公共遥感数据集进行比较，该网络可以达到state-of-the-art 的变化检测分数和细化变化地图，并且通过自我监督策略提高了 CNN 分支的训练问题。

Abstract
Among the current mainstream change detection networks, transformer is deficient in the ability to capture accurate low-level details, while convolutional neural network (CNN) is wanting in the capacity to understand global information and establish remote spatial relationships. Meanwhile, both of the widely used early fusion and late fusion frameworks are not able to well learn complete change features. Therefore, based on swin transformer V2 (Swin V2) and VGG16, we propose an end-to-end compounded dense network SwinV2DNet to inherit the advantages of both transformer and CNN and overcome the shortcomings of existing networks in feature learning. Firstly, it captures the change relationship features through the densely connected Swin V2 backbone, and provides the low-level pre-changed and post-changed features through a CNN branch. Based on these three change features, we accomplish accurate change detection results. Secondly, combined with transformer and CNN, we propose mixed feature pyramid (MFP) which provides inter-layer interaction information and intra-layer multi-scale information for complete feature learning. MFP is a plug and play module which is experimentally proven to be also effective in other change detection networks. Further more, we impose a self-supervision strategy to guide a new CNN branch, which solves the untrainable problem of the CNN branch and provides the semantic change information for the features of encoder. The state-of-the-art (SOTA) change detection scores and fine-grained change maps were obtained compared with other advanced methods on four commonly used public remote sensing datasets. The code is available at https://github.com/DalongZ/SwinV2DNet.

摘要
当前主流的变化检测网络中，变换器缺乏捕捉准确的低级细节的能力，而卷积神经网络（CNN）缺乏建立远程空间关系和全局信息的能力。同时，现有的早期融合和晚期融合框架都不能良好地学习完整的变化特征。因此，基于Swin transformer V2（Swin V2）和VGG16，我们提出了一种端到端融合密集网络SwinV2DNet，继承了变换器和CNN的优点，并超越了现有网络在特征学习方面的缺陷。首先，SwinV2DNet通过密集连接的Swin V2脊梁捕捉变化关系特征，并提供低级预变和后变特征通过一个CNN分支。基于这三个变化特征，我们实现了准确的变化检测结果。其次，我们提出了混合特征阶梯（MFP），该模块通过卷积神经网络和变换器的结合，为完整的特征学习提供了间层交互信息和多尺度内层信息。MFP是一个可插入的模块，实验证明其效果也可以应用于其他变化检测网络。此外，我们对新的CNN分支进行自我超vision策略，解决了CNN分支的训练不可能问题，并为encoder的特征提供了semantic变化信息。与其他先进方法相比，我们在四个常用的公共遥感数据集上获得了状态前的变化检测分数和细化变化地图。代码可以在https://github.com/DalongZ/SwinV2DNet上获取。

Domain Generalization via Rationale Invariance

paper_url: http://arxiv.org/abs/2308.11158
repo_url: https://github.com/liangchen527/ridg
paper_authors: Liang Chen, Yong Zhang, Yibing Song, Anton van den Hengel, Lingqiao Liu
for: 提高领域总结的稳定性，以便在未见 environments 中维持良好的结果。
methods: 通过对决策过程的最后一层抽象来解决领域总结挑战。具体来说，我们提议将每个样本的元素级别贡献视为决策的理由，并将每个样本的理由表示为一个矩阵。为了确保模型具有良好的普适性，我们建议每个类别的理由矩阵具有相似性，表明模型在各个环境中依靠域外特征来做出决策。
results: 我们的实验表明，通过引入 rational invariance loss 来实现这一思路，可以在不同的领域上实现竞争性的结果，即使使用的是简单的代码。代码可以在 \url{https://github.com/liangchen527/RIDG} 上找到。

Abstract
This paper offers a new perspective to ease the challenge of domain generalization, which involves maintaining robust results even in unseen environments. Our design focuses on the decision-making process in the final classifier layer. Specifically, we propose treating the element-wise contributions to the final results as the rationale for making a decision and representing the rationale for each sample as a matrix. For a well-generalized model, we suggest the rationale matrices for samples belonging to the same category should be similar, indicating the model relies on domain-invariant clues to make decisions, thereby ensuring robust results. To implement this idea, we introduce a rationale invariance loss as a simple regularization technique, requiring only a few lines of code. Our experiments demonstrate that the proposed approach achieves competitive results across various datasets, despite its simplicity. Code is available at \url{https://github.com/liangchen527/RIDG}.

摘要

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

paper_url: http://arxiv.org/abs/2308.11157
repo_url: https://github.com/holmescao/TOPICTrack
paper_authors: Xiaoyan Cao, Yiyao Zheng, Yao Yao, Huapeng Qin, Xiaoyu Cao, Shihui Guo
for: 这 paper 的目的是提出一种新的多对目标跟踪（MOT）数据集，以解决现有数据集忽略了复杂的运动模式的问题。
methods: 该 paper 使用了一种新的并行相关模式（Parallel Paradigm），并提出了一种基于运动和外观特征的两个圆形相关机制（TOPIC），以及一种基于注意力的外观重建模块（AARM）来提高跟踪效果。
results: 该 paper 的方法在四个公共数据集和新 introduce 的 BEE23 数据集上实现了领先的表现，包括减少 false negatives 12% 到 51% compared to 单一特征相关模式。

Abstract
Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE23 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruct Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE23. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 12% to 51% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offers a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.

摘要
视频数据和算法在多对目标跟踪（MOT）领域取得了重大进步。现有的MOT数据集集中焦点在 occlusion 和外观相似性上，然而复杂的运动模式却被忽略。为了解决这个问题，我们提出了一个新的数据集called BEE23，以强调复杂的运动。目标跟踪算法的研究总是围绕着 Identity association 问题进行，现有的跟踪器可以分为两种联系思维：单一特征思维（基于 either motion 或 appearance feature）以及串行思维（一个特征服务为次要，另一个特征服务为主要）。然而，这些思维方法无法完全利用不同的特征。在这篇论文中，我们提议了并行联系思维，并通过 Two rOund Parallel matchIng meChanism（TOPIC）来实现。TOPIC 利用了运动和外观特征，并可以根据运动水平选择适合的一个作为分配度量。此外，我们还提供了 Attention-based Appearance Reconstruct Module（AARM）来重建外观特征嵌入，从而提高外观特征的表示。我们对四个公共数据集和 BEE23 进行了广泛的实验，结果显示我们的方法在这些数据集上达到了当前最佳性能。尤其是，我们提出的并行联系思维在现有的联系思维方法之上减少了12%至51%的假阳性。在这篇论文中，我们还提供了一个新的数据集和联系思维方法，这将为 MOT 领域带来新的视角，并且代码和数据集可以在上获取。

High Dynamic Range Imaging of Dynamic Scenes with Saturation Compensation but without Explicit Motion Compensation

paper_url: http://arxiv.org/abs/2308.11140
repo_url: https://github.com/haesoochung/hdri-saturation-compensation
paper_authors: Haesoo Chung, Nam Ik Cho
for: 提高高动态范围（HDR）图像的获得和修复，解决由相机传感器的限制导致的信息损失问题。
methods: 使用改进的运动补偿和灰度调整问题的解决方案，通过 Contextual attention 技术来修复过度曝光区域。
results: 比对 state-of-the-art 方法，示出了更高的质量和量化评价结果。

Abstract
High dynamic range (HDR) imaging is a highly challenging task since a large amount of information is lost due to the limitations of camera sensors. For HDR imaging, some methods capture multiple low dynamic range (LDR) images with altering exposures to aggregate more information. However, these approaches introduce ghosting artifacts when significant inter-frame motions are present. Moreover, although multi-exposure images are given, we have little information in severely over-exposed areas. Most existing methods focus on motion compensation, i.e., alignment of multiple LDR shots to reduce the ghosting artifacts, but they still produce unsatisfying results. These methods also rather overlook the need to restore the saturated areas. In this paper, we generate well-aligned multi-exposure features by reformulating a motion alignment problem into a simple brightness adjustment problem. In addition, we propose a coarse-to-fine merging strategy with explicit saturation compensation. The saturated areas are reconstructed with similar well-exposed content using adaptive contextual attention. We demonstrate that our method outperforms the state-of-the-art methods regarding qualitative and quantitative evaluations.

摘要
高动态范围（HDR）摄影是一项非常具有挑战性的任务，因为摄像头传感器的限制会导致大量信息的丢失。为实现HDR摄影，一些方法会 capture多个低动态范围（LDR）图像，并将它们进行不同的曝光设定来聚集更多的信息。然而，这些方法会导致在 significative 的运动误差存在时出现幻影artefacts。此外，虽然我们有多个曝光图像，但我们在严重过曝光区域中具有少量信息。大多数现有方法将焦点放在运动补做上，即将多个LDR拍摄Alignment 来减少幻影artefacts，但这些方法仍然生成不满足的结果。这些方法还往往忽略了重建过曝光区域的需求。在这篇论文中，我们将生成准确尺度的多个曝光特征，通过将运动Alignment 转换为简单的明亮调整问题来实现。此外，我们还提出了一种粗略到细节的合并策略，并且使用明确的过曝光补做。在过曝光区域中，我们使用适应的上下文关注来重建相似的准确曝光内容。我们示示了我们的方法在质量和量度上的超越现有方法。

Efficient View Synthesis with Neural Radiance Distribution Field

paper_url: http://arxiv.org/abs/2308.11130
repo_url: https://github.com/yushuang-wu/NeRDF
paper_authors: Yushuang Wu, Xiao Li, Jinglu Wang, Xiaoguang Han, Shuguang Cui, Yan Lu
For: 高品质视角合成* Methods: 使用小型网络模型，采用频率基准来模型光度分布，采样一次网络前进来计算像素值* Results: 提供了一个更好的权衡 между速度、质量和网络大小，与传统方法相比，具有约254倍的速度提升，同时保持了类似的网络大小和质量水平。

Abstract
Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a ~254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF.

摘要
最近的神经辐射场（NeRF）研究取得了高品质视图合成的重要进步。然而，NeRF具有辐射场的内置表示，导致每个像素需要多个网络请求，从而降低了渲染效率。现有的方法可以减少需要的样本数或者优化网络实现以加速网络请求。然而，这些努力仍然无法消除多样本的问题。相比之下，神经光场（NeLF）可以通过每个像素只需要一次网络请求来减少计算成本。然而，现有的NeLF方法需要较大的网络容量，从而限制了实际的渲染效率。在这种情况下，我们提出了一种新的表示方式 called Neural Radiance Distribution Field（NeRDF），旨在实现高效的视图合成。具体来说，我们使用一个小型神经网络，类似于NeRF，同时保持与NeLF一样的渲染速度。关键在于，我们使用频率基is来模型每个辐射线上的光辐射分布，并使用网络来预测频率权重。然后，通过量 rendering 技术来计算像素值。实验表明，我们的提议方法可以在Speed、质量和网络大小之间取得更好的平衡，相比之下NeRF和NeLF的方法。我们的项目页面是 yushuang-wu.github.io/NeRDF。

Hey That’s Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs

paper_url: http://arxiv.org/abs/2308.11123
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 保护内容在线分享
methods: 使用隐形水印技术训练生成模型，并测试其能够在生成图像中检测水印。
results: 通过统计测试，确定了模型是否训练使用水印数据，以及水印数据中的特征与生成图像之间的相关性。

Abstract
Generative models have seen an explosion in popularity with the release of huge generative Diffusion models like Midjourney and Stable Diffusion to the public. Because of this new ease of access, questions surrounding the automated collection of data and issues regarding content ownership have started to build. In this paper we present new work which aims to provide ways of protecting content when shared to the public. We show that a generative Diffusion model trained on data that has been imperceptibly watermarked will generate new images with these watermarks present. We further show that if a given watermark is correlated with a certain feature of the training data, the generated images will also have this correlation. Using statistical tests we show that we are able to determine whether a model has been trained on marked data, and what data was marked. As a result our system offers a solution to protect intellectual property when sharing content online.

摘要
traducciona el texto a chino simplificado.<>广泛的生成模型在大量生成扩散模型如中途和稳定扩散的公共释放后得到了普及。由于这新的访问方便，自动收集数据和内容所有权问题开始升温。在这篇论文中，我们介绍新的工作，旨在在公共分享时保护内容。我们显示，通过训练在不可见水印数据上的生成扩散模型，新生成的图像都会包含这些水印。此外，如果给定的水印与训练数据中某个特征相关，那么生成的图像也会具有这种相关性。通过统计测试，我们表明可以判断模型是否训练在标记数据上，以及具体的标记数据。因此，我们的系统可以解决在线分享内容时保护知识产权。

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11119
repo_url: None
paper_authors: Masato Tamura
for: 这个研究是为了开发一个 zero-shot anomaly detection 方法，利用 CLIP 的视觉语言模型来提供数据源。
methods: 这个方法使用 CLIP 的 prompt-guided classification 技术，将每个图像分成多个部分，并将每个部分作为 input 进行类别。此外，还使用了一些随机生成的词语，以增加训练数据的多样性。
results: 实验结果显示，这个方法可以在 zero-shot 设定下 achieves state-of-the-art 性能，不需要耗费很多时间进行训练。

Abstract
This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Tremendous efforts have been put towards developing anomaly detectors due to their potential industrial applications. Considering the difficulty in acquiring various anomalous samples for training, most existing methods train models with only normal samples and measure discrepancies from the distribution of normal samples during inference, which requires training a model for each object category. The problem of this inefficient training requirement has been tackled by designing a CLIP-based anomaly detector that applies prompt-guided classification to each part of an image in a sliding window manner. However, the method still suffers from the labor of careful prompt ensembling with known object categories. To overcome the issues above, we propose leveraging CLIP as a data source for training. Our method generates text embeddings with the text encoder in CLIP with typical prompts that include words of normal and anomaly. In addition to these words, we insert several randomly generated words into prompts, which enables the encoder to generate a diverse set of normal and anomalous samples. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, and as a result, a category-agnostic anomaly detector can be obtained without any training images. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.

摘要
To overcome these issues, the proposed method leverages CLIP as a data source for training. The method generates text embeddings with the text encoder in CLIP using typical prompts that include words related to normal and anomalous samples. Additionally, several randomly generated words are inserted into the prompts to enable the encoder to generate a diverse set of normal and anomalous samples. These embeddings are then used as training data for a feed-forward neural network to extract features of normal and anomalous samples from CLIP's embeddings. As a result, a category-agnostic anomaly detector can be obtained without any training images.Experimental results demonstrate that the proposed method achieves state-of-the-art performance in zero-shot setups without the need for laborious prompt ensembling.

LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction

paper_url: http://arxiv.org/abs/2308.11116
repo_url: https://github.com/haesoochung/lan-hdr
paper_authors: Haesoo Chung, Nam Ik Cho
for: 提高高清晰度和动态范围（HDR）影像技术，以满足用户对高质量视频的需求。
methods: 基于特征空间的灵活抽象网络（LAN-HDR），包括对适应模块和梦想模块。对模块使用灵活抽象来减少流量估计错误。
results: 比较现有方法表现更好或相当，在多个标准测试数据集上进行了广泛的实验。

Abstract
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.

摘要
As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks.

Development of a Novel Quantum Pre-processing Filter to Improve Image Classification Accuracy of Neural Network Models

paper_url: http://arxiv.org/abs/2308.11112
repo_url: https://github.com/hajimesuzuki999/qpf
paper_authors: Farina Riaz, Shahab Abdulla, Hajime Suzuki, Srinjoy Ganguly, Ravinesh C. Deo, Susan Hopkins
for: 提高图像分类准确率
methods: 使用量子预处理筛选器（QPF），应用于图像分类神经网络模型中
results: 在MNIST和EMNIST数据集上，图像分类准确率提高至95.4%和75.9%，分别提高了2.9%和7.1%，无需添加额外参数或优化机器学习过程。

Abstract
This paper proposes a novel quantum pre-processing filter (QPF) to improve the image classification accuracy of neural network (NN) models. A simple four qubit quantum circuit that uses Y rotation gates for encoding and two controlled NOT gates for creating correlation among the qubits is applied as a feature extraction filter prior to passing data into the fully connected NN architecture. By applying the QPF approach, the results show that the image classification accuracy based on the MNIST (handwritten 10 digits) and the EMNIST (handwritten 47 class digits and letters) datasets can be improved, from 92.5% to 95.4% and from 68.9% to 75.9%, respectively. These improvements were obtained without introducing extra model parameters or optimizations in the machine learning process. However, tests performed on the developed QPF approach against a relatively complex GTSRB dataset with 43 distinct class real-life traffic sign images showed a degradation in the classification accuracy. Considering this result, further research into the understanding and the design of a more suitable quantum circuit approach for image classification neural networks could be explored utilizing the baseline method proposed in this paper.

摘要

Classification of the lunar surface pattern by AI architectures: Does AI see a rabbit in the Moon?

paper_url: http://arxiv.org/abs/2308.11107
repo_url: None
paper_authors: Daigo Shoji
for: 这篇论文的目的是研究月面的颜色模式是否类似于兔子。
methods: 这篇论文使用了七种人工智能架构来评估月面颜色模式与兔子之间的相似性。
results: 测试结果显示，在某些地区，月面颜色模式更容易被识别为兔子，而不是人脸。此外，使用ImageNet权重时，ConvNeXt和CLIP occasionally可以归类月面颜色模式为兔子。

Abstract
In Asian countries, there is a tradition that a rabbit (the Moon rabbit) lives on the Moon. As the origin of this tradition, usually, two reasons are mentioned. One reason is that the color pattern of the lunar surface is similar to the shape of a rabbit. The other reason is that both the Moon and rabbit are symbols of fertility because the Moon appears and disappears (i.e., waxing and waning) cyclically, and rabbits bear children frequently. Considering the latter reason, is the lunar surface color pattern not similar to a rabbit? Here, the similarity between rabbit and the lunar surface pattern was evaluated using seven AI architectures. In the test by CLIP, assuming that people look at the Moon in the early evening frequently, the lunar surface is more similar to a rabbit than a face at low latitude regions, while it can be classified as face as latitude increases, which is consistent with that the oldest literature about the Moon rabbit was written in India and that there is a culture of human's face in the Moon in Europe. Tested with ImageNet weights, ConvNeXt and CLIP sometimes classified the lunar surface pattern into rabbit with relatively high probabilities. Cultures are generated by our attitude to the environment. Both dynamic and static similarities may be required to induce our imagination.

摘要
在亚洲国家，有一传统认为月球上有兔子（月兔）生活。这个传统的起源通常被推断为两个理由。一个理由是月球表面的颜色排列与兔子的形状类似。另一个理由是月球和兔子都是生育的符号，因为月球出现和消失（即增减）的征例行为，兔子则经常产下幼崽。考虑后一个理由，月球表面的颜色排列与兔子是否类似？在这个问题上，我们使用了七种人工智能架构进行评估。在CLIP测试中，假设人们在晚上经常看到月球，那么月球表面的颜色排列更像兔子 than 人脸，而且随着纬度增加，月球表面可以被识别为人脸。这与最古老的月兔文学成果在印度被写成，以及欧洲文化中人类面部在月球上的存在相一致。使用ImageNet权重，ConvNeXt和CLIP occasionally将月球表面 Pattern classification为兔子，并且拥有相对高的概率。文化是由我们对环境的态度所生成的。我们可能需要同时考虑动态和静态相似性，以便让我们的想象力激发。

Recursive Video Lane Detection

paper_url: http://arxiv.org/abs/2308.11106
repo_url: https://github.com/dongkwonjin/rvld
paper_authors: Dongkwon Jin, Dahyun Kim, Chang-Su Kim
for: 这篇论文提出了一种用于视频中检测路面线的新算法，即回归视频lane检测器（RVLD），用于在视频中检测路面线。
methods: 该算法包括一个内部lane检测器（ILD）和一个预测lane检测器（PLD）。首先，我们设计了ILD来在当前帧中地址路面线。然后，我们开发了PLD，以利用上一帧的信息来在当前帧中更可靠地检测路面线。为此，我们估算了运动场和将上一帧的输出折叠到当前帧中。使用折叠后的信息，我们精细地修改当前帧的特征图以更好地检测路面线。
results: 实验结果表明，RVLD在视频路面线数据集上的性能明显超过了现有的检测器。我们的代码可以在https://github.com/dongkwonjin/RVLD中下载。

Abstract
A novel algorithm to detect road lanes in videos, called recursive video lane detector (RVLD), is proposed in this paper, which propagates the state of a current frame recursively to the next frame. RVLD consists of an intra-frame lane detector (ILD) and a predictive lane detector (PLD). First, we design ILD to localize lanes in a still frame. Second, we develop PLD to exploit the information of the previous frame for lane detection in a current frame. To this end, we estimate a motion field and warp the previous output to the current frame. Using the warped information, we refine the feature map of the current frame to detect lanes more reliably. Experimental results show that RVLD outperforms existing detectors on video lane datasets. Our codes are available at https://github.com/dongkwonjin/RVLD.

摘要
“本文提出了一种新的算法检测视频中的路线，即回归视频车道检测器（RVLD）。这种算法在当前帧中进行回归状态，并在下一帧中使用这些状态来提高车道检测的准确性。RVLD由内帧车道检测器（ILD）和预测车道检测器（PLD）两部分组成。首先，我们设计了 ILD 以确定视频帧中的车道。其次，我们开发了 PLD，以利用上一帧的信息来提高当前帧中的车道检测。为此，我们对上一帧的视频进行了运动场景的估算，并将上一帧的输出折叠到当前帧中。使用折叠后的信息，我们可以更加精确地修改当前帧的特征图，以更好地检测车道。实验结果表明，RVLD 在视频车道数据集上的性能比既有的检测器更高。我们的代码可以在 GitHub 上找到：https://github.com/dongkwonjin/RVLD。”

MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers

paper_url: http://arxiv.org/abs/2308.11096
repo_url: None
paper_authors: Daniel Silver, Tirthak Patel, William Cutler, Aditya Ranjan, Harshitta Gandhi, Devesh Tiwari
for: 研究量子机器学习和视觉技术，尤其是量子图像生成技术，以提高图像质量和可靠性。
methods: 我们提出了一个名为MosaiQ的高质量量子图像生成GAN框架，可以在当今的中期级量子计算机（NISQ）上执行。
results: MosaiQ可以生成高质量的图像，并且可以在不同的图像生成任务中实现高度的可靠性和稳定性。

Abstract
Quantum machine learning and vision have come to the fore recently, with hardware advances enabling rapid advancement in the capabilities of quantum machines. Recently, quantum image generation has been explored with many potential advantages over non-quantum techniques; however, previous techniques have suffered from poor quality and robustness. To address these problems, we introduce, MosaiQ, a high-quality quantum image generation GAN framework that can be executed on today's Near-term Intermediate Scale Quantum (NISQ) computers.

摘要
量子机器学习和视觉在最近几年来得到了更多的关注，各种硬件进步使得量子机器的能力得到了快速提升。近期，量子图像生成得到了广泛研究，但以前的技术受到了低质量和稳定性的限制。为了解决这些问题，我们介绍了 MosaiQ，一个高质量量子图像生成GAN框架，可以在当今的中等规模量子计算机上执行。

Addressing Fairness and Explainability in Image Classification Using Optimal Transport

paper_url: http://arxiv.org/abs/2308.11090
repo_url: None
paper_authors: Philipp Ratz, François Hu, Arthur Charpentier
for: 本研究旨在提高人工智能系统的可信worthiness和公平性，使其在医疗和警察等领域中建立信任和责任感。
methods: 本研究使用优化的运输理论来揭示图像中偏见的起源和后果，这种方法可以轻松扩展到表格数据中。
results: 研究发现，通过使用拟合度量来评估模型的偏见，可以独立地保持预测准确性和揭示偏见的起源。这些发现对于建立可信worthiness和公平性的人工智能系统具有重要意义。

Abstract
Algorithmic Fairness and the explainability of potentially unfair outcomes are crucial for establishing trust and accountability of Artificial Intelligence systems in domains such as healthcare and policing. Though significant advances have been made in each of the fields separately, achieving explainability in fairness applications remains challenging, particularly so in domains where deep neural networks are used. At the same time, ethical data-mining has become ever more relevant, as it has been shown countless times that fairness-unaware algorithms result in biased outcomes. Current approaches focus on mitigating biases in the outcomes of the model, but few attempts have been made to try to explain \emph{why} a model is biased. To bridge this gap, we propose a comprehensive approach that leverages optimal transport theory to uncover the causes and implications of biased regions in images, which easily extends to tabular data as well. Through the use of Wasserstein barycenters, we obtain scores that are independent of a sensitive variable but keep their marginal orderings. This step ensures predictive accuracy but also helps us to recover the regions most associated with the generation of the biases. Our findings hold significant implications for the development of trustworthy and unbiased AI systems, fostering transparency, accountability, and fairness in critical decision-making scenarios across diverse domains.

摘要
算法公平和可解释性是建立人工智能系统信任和负责任的关键因素，尤其在医疗和警察领域。虽然在每个领域 separately 有所进步，但在公平应用中实现可解释性仍然是挑战，特别是在使用深度神经网络时。同时，伦理数据挖掘已成为非常重要，因为无数次证明了不公平的算法会导致偏见的结果。现有的方法主要是减轻模型的偏见结果，但几乎没有尝试解释模型为何偏见。为了bridging这个差距，我们提出了一种全面的方法，利用最优运输理论来揭示偏见区域在图像中的原因和后果，这种方法可以轻松扩展到表格数据上。通过使用拓扑 Wasserstein 中心，我们可以获得不виси于敏感变量的分数，但保持其排序。这一步确保预测精度，同时帮助我们回归偏见区域的生成。我们的发现对于开发可靠、无偏的人工智能系统的发展有着深远的意义，推动了诚实、负责任和公平在多个领域中的决策过程中的透明度和公平。

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

paper_url: http://arxiv.org/abs/2308.11079
repo_url: None
paper_authors: Luke Ditria, Tom Drummond
for: 预测高维视频序列是一个非常困难的问题，因为可能的未来场景的数量会 exponential 增长随着时间的推移。特别是在从有限的世界Snapshot中预测自然的视频场景时，内在的不确定性会快速增加，使长期预测变得非常困难。
methods: 我们在这篇论文中引入了一些改进了现有工作，以创建Robust Video Predictors (RoViPs)。我们使用深度Perceptual和 uncertainty-based reconstructionloss来创建高质量短期预测。使用Attention-based skip connections以实现跨距离空间特征输入的长距离移动，以进一步提高性能。
results: 我们显示了使用单步预测任务iterated 可以生成非常长、自然的视频序列。

Abstract
Predicting high dimensional video sequences is a curiously difficult problem. The number of possible futures for a given video sequence grows exponentially over time due to uncertainty. This is especially evident when trying to predict complicated natural video scenes from a limited snapshot of the world. The inherent uncertainty accumulates the further into the future you predict making long-term prediction very difficult. In this work we introduce a number of improvements to existing work that aid in creating Robust Video Predictors (RoViPs). We show that with a combination of deep Perceptual and uncertainty-based reconstruction losses we are able to create high quality short-term predictions. Attention-based skip connections are utilised to allow for long range spatial movement of input features to further improve performance. Finally, we show that by simply making the predictor robust to its own prediction errors, it is possible to produce very long, realistic natural video sequences using an iterated single-step prediction task.

摘要
Predicting high-dimensional video sequences is a challenging problem. The number of possible futures for a given video sequence grows exponentially with time due to uncertainty. This is particularly evident when trying to predict complex natural video scenes from a limited snapshot of the world. The inherent uncertainty accumulates the further into the future you predict, making long-term prediction very difficult. In this work, we introduce several improvements to existing methods that aid in creating Robust Video Predictors (RoViPs). We show that by combining deep perceptual and uncertainty-based reconstruction losses, we can create high-quality short-term predictions. Attention-based skip connections are used to allow for long-range spatial movement of input features, further improving performance. Finally, we show that by simply making the predictor robust to its own prediction errors, we can produce very long, realistic natural video sequences using an iterated single-step prediction task.Here's the text with some notes on the translation:* "high-dimensional" is translated as "高维的" (gāo wéi de), which is a common way to describe high-dimensional data in Chinese.* "video sequences" is translated as "视频序列" (pǐn yǐng xù xià), which is a literal translation of the English phrase.* "uncertainty" is translated as "不确定性" (bù jì dìng xìng), which is a common way to describe uncertainty in Chinese.* "complicated" is translated as "复杂的" (fù zhòng de), which is a common way to describe complex or intricate things in Chinese.* "natural video scenes" is translated as "自然的视频场景" (zì rán de pǐn yǐng chǎng jǐng), which is a literal translation of the English phrase.* "long-term prediction" is translated as "长期预测" (cháng qī yù tiè), which is a common way to describe long-term predictions in Chinese.* "iterated single-step prediction task" is translated as " iterate 单步预测任务" (pītī yī xiào yù jì), which is a literal translation of the English phrase.I hope this helps! Let me know if you have any further questions.

Audio-Visual Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.11073
repo_url: https://github.com/weiguopian/av-cil_iccv2023
paper_authors: Weiguo Pian, Shentong Mo, Yunhui Guo, Yapeng Tian
for: 这篇论文提出了一种 audio-visual 类增 learning 问题，即在 audio-visual 视频认知中进行类增 learning。
methods: 该论文提出了一种叫做 AV-CIL 的方法，该方法通过 dual-audio-visual 相似性约束 (D-AVSC) 和视觉注意力练化 (VAD) 来保持音频视频modalities之间的semantic similarity，并且能够在类增 learning 过程中 preserved previously learned audio-guided visual attentive ability。
results: 该论文的实验结果表明，AV-CIL 方法在 audio-visual 类增 learning 中 Significantly outperforms 现有的类增 learning 方法。

Abstract
In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.

摘要
在这篇论文中，我们介绍了音频视频类增长学习（Audio-Visual Class-Incremental Learning，AVCIL），这是一种类增长学习场景 для音频视频识别。我们示示了 joint audio-visual 模型可以提高类增长学习性能，但现有方法无法保持音频视频特征之间的semantic similarity，特别是在增量步骤增长时。此外，我们发现在前一个任务学习的音频视频相关性可以在后续任务学习过程中被忘记，导致性能下降。为了解决这些挑战，我们提出了 Dual-Audio-Visual Similarity Constraint（D-AVSC）和 Visual Attention Distillation（VAD）两种方法。我们创建了三个音频视频类增长数据集： AVE-Class-Incremental（AVE-CI）、Kinetics-Sounds-Class-Incremental（K-S-CI）和 VGGSound100-Class-Incremental（VS100-CI），基于 AVE、Kinetics-Sounds 和 VGGSound 数据集。我们在 AVE-CI、K-S-CI 和 VS100-CI 上进行了实验，并证明了 AV-CIL 在音频视频类增长学习中表现出色，超过了现有的类增长学习方法。代码和数据可以在 GitHub 上获取：https://github.com/weiguoPian/AV-CIL_ICCV2023。

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection

paper_url: http://arxiv.org/abs/2308.11072
repo_url: https://github.com/UCF-CRCV/TeD-SPAD
paper_authors: Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah
for: 这个研究的目的是为了提出一个具有隐私保护的视预测异常探测方法，以解决现有的人工监控不足和隐私泄露问题。
methods: 这个方法使用了一个自我监控的三元损害函数来增强时间特征，并且使用了一个具有隐私保护的数据隐藏技术来防止隐私泄露。
results: 这个方法在三个弱型监控的视预测异常探测 datasets（UCF-Crime、XD-Violence和ShanghaiTech）上取得了一个良好的平衡，即在保护隐私的同时，也能够维持视预测异常探测的性能。

Abstract
Video anomaly detection (VAD) without human monitoring is a complex computer vision task that can have a positive impact on society if implemented successfully. While recent advances have made significant progress in solving this task, most existing approaches overlook a critical real-world concern: privacy. With the increasing popularity of artificial intelligence technologies, it becomes crucial to implement proper AI ethics into their development. Privacy leakage in VAD allows models to pick up and amplify unnecessary biases related to people's personal information, which may lead to undesirable decision making. In this paper, we propose TeD-SPAD, a privacy-aware video anomaly detection framework that destroys visual private information in a self-supervised manner. In particular, we propose the use of a temporally-distinct triplet loss to promote temporally discriminative features, which complements current weakly-supervised VAD methods. Using TeD-SPAD, we achieve a positive trade-off between privacy protection and utility anomaly detection performance on three popular weakly supervised VAD datasets: UCF-Crime, XD-Violence, and ShanghaiTech. Our proposed anonymization model reduces private attribute prediction by 32.25% while only reducing frame-level ROC AUC on the UCF-Crime anomaly detection dataset by 3.69%. Project Page: https://joefioresi718.github.io/TeD-SPAD_webpage/

摘要
“视频异常检测（VAD）无人监测是一项复杂的计算机视觉任务，如果成功实施，它将对社会产生积极的影响。然而，现有的大多数方法忽略了一个重要的现实问题：隐私。随着人工智能技术的普及，它们的开发中需要落实合适的人工智能伦理。VAD模型可以捕捉和强调人们个人信息的无关的偏见，导致不良决策。在这篇论文中，我们提出了一种隐私意识视频异常检测框架，即TeD-SPAD。具体来说，我们提出使用时间特征分别的 triplet损失来促进时间特征分别，这与现有的弱监测VAD方法相 complement。使用TeD-SPAD，我们在三个流行的弱监测VAD数据集上实现了隐私保护和异常检测性能的积极负作用。我们的提出的匿名化模型可以降低人员特征预测值32.25%，只减某些数据集的异常检测精度3.69%。项目页面：https://joefioresi718.github.io/TeD-SPAD_webpage/”

MetaGCD: Learning to Continually Learn in Generalized Category Discovery

paper_url: http://arxiv.org/abs/2308.11063
repo_url: None
paper_authors: Yanan Wu, Zhixiang Chi, Yang Wang, Songhe Feng
for: 本研究的目的是解决一种实际场景，在训练过程中遇到未标注的数据，该数据包含已知和新类的混合类。目标是不断发现新类，同时保持已知类的性能。我们称之为 Continual Generalized Category Discovery (C-GCD) Setting。
methods: 我们提出了一种方法，即 MetaGCD，以便在 C-GCD Setting 中不断发现新类而不忘记已知类。我们使用了一个元学习框架，并利用了在线标注数据来模拟测试增量学习过程。我们定义了一个元目标，旨在同时解决两个矛盾的学习目标，以实现不断发现新类而不忘记已知类。此外，我们还提出了一种软邻域基于对比网络，以便区分无关图像而吸引相关图像。
results: 我们在三个广泛使用的标准测试benchmark上建立了强大的基准，并进行了广泛的实验。我们的方法在 C-GCD Setting 中表现出色，可以不断发现新类而不忘记已知类。

Abstract
In this paper, we consider a real-world scenario where a model that is trained on pre-defined classes continually encounters unlabeled data that contains both known and novel classes. The goal is to continually discover novel classes while maintaining the performance in known classes. We name the setting Continual Generalized Category Discovery (C-GCD). Existing methods for novel class discovery cannot directly handle the C-GCD setting due to some unrealistic assumptions, such as the unlabeled data only containing novel classes. Furthermore, they fail to discover novel classes in a continual fashion. In this work, we lift all these assumptions and propose an approach, called MetaGCD, to learn how to incrementally discover with less forgetting. Our proposed method uses a meta-learning framework and leverages the offline labeled data to simulate the testing incremental learning process. A meta-objective is defined to revolve around two conflicting learning objectives to achieve novel class discovery without forgetting. Furthermore, a soft neighborhood-based contrastive network is proposed to discriminate uncorrelated images while attracting correlated images. We build strong baselines and conduct extensive experiments on three widely used benchmarks to demonstrate the superiority of our method.

摘要
在这篇论文中，我们考虑了一个真实世界场景，在这个场景中，一个已经训练过的模型在不断接触到预先定义的类和未经标注的数据中遇到了新类。目标是同时发现新类并保持已知类的性能。我们称这种设定为总类发现（C-GCD）。现有的新类发现方法无法直接处理这种设定，这是因为它们假设了尚未标注的数据只包含新类。此外，它们也无法在不断发现新类的情况下保持已知类的性能。在这种工作中，我们终止了这些假设，并提出了一种方法，称为MetaGCD，以incremental learning来发现新类而减少忘记。我们的提出的方法使用了meta-学框架，利用了在线标注数据来模拟测试增量学习过程。我们定义了一个meta-目标，旨在在新类发现和已知类性能之间协调两个不同的学习目标。此外，我们还提出了一种软邻域基于的对比网络，以便在不同类型之间分辨细分图像。我们建立了强大的基elines并进行了广泛的实验，以证明我们的方法的优越性。

UnLoc: A Unified Framework for Video Localization Tasks

paper_url: http://arxiv.org/abs/2308.11062
repo_url: https://github.com/google-research/scenic
paper_authors: Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
for: 这个研究的目的是提出一种新的视频地理ocalization方法，用于在未经trim的视频中进行时间地理ocalization。
methods: 这个方法使用预训练的图像和文本楼层，并将 tokens 传递给一个视频-文本融合模型。输出的融合模型输出将用于构建一个特征 пирамид，每个层与一个头相连，以预测每帧的相关性分数和开始/结束时间偏移。
results: 这个方法可以实现视频地理ocalization、时间地理ocalization和动作分割等三个任务，并且在所有三个任务中达到了现有最佳Result。

Abstract
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code will be available at: \url{https://github.com/google-research/scenic}.

摘要
大规模图像文本预训练模型，如CLIP，已经在剪辑后的视频上进行多种任务，但是它们在未剪辑视频中的时间本地化仍然是一个未解决的问题。我们设计了一种新的方法called UnLoc，它使用预训练的图像和文本楼层，并将Token传递给视频-文本融合模型。融合模块的输出被用construct一个功能PYRAMID，每个层与一个头连接，以预测每帧的相关性分数和开始/结束时间偏移。与之前的方法不同，我们的体系允许场景检索、时间本地化和动作分割使用单一的阶段模型，不需要动作提案、运动基于预训练特征或表示掩码。与专门的模型不同，我们在所有三种本地化任务中达到了状态arta的结果，代码将在：\url{https://github.com/google-research/scenic}上提供。

Harmonization Across Imaging Locations(HAIL): One-Shot Learning for Brain MRI

paper_url: http://arxiv.org/abs/2308.11047
repo_url: None
paper_authors: Abhijeet Parida, Zhifan Jiang, Syed Muhammad Anwar, Nicholas Foreman, Nicholas Stence, Michael J. Fisher, Roger J. Packer, Robert A. Avery, Marius George Linguraru
for: 这篇论文旨在提出一种用于类比较罕见疾病，如儿童脑肿瘤的机器学习基于医疗影像诊断和预测方法。
methods: 本论文使用了生成对抗网络（GANs）进行深度学习预测，并提出了一种一击学习方法，使用神经风格转移来调整医疗影像的强度标准。
results: 实验结果显示，该方法可以保持病人解剖结构，并调整影像强度以适应新的诊疗所需。此外，该方法可以在未见到训练数据的情况下进行应用，因此具有实际应用和临床试验的价值。

Abstract
For machine learning-based prognosis and diagnosis of rare diseases, such as pediatric brain tumors, it is necessary to gather medical imaging data from multiple clinical sites that may use different devices and protocols. Deep learning-driven harmonization of radiologic images relies on generative adversarial networks (GANs). However, GANs notoriously generate pseudo structures that do not exist in the original training data, a phenomenon known as "hallucination". To prevent hallucination in medical imaging, such as magnetic resonance images (MRI) of the brain, we propose a one-shot learning method where we utilize neural style transfer for harmonization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. We further propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures. Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.

摘要
Our approach combines learning a feature extractor, neural style transfer, and adaptive instance normalization. At test time, the method uses one image from a clinical site to generate an image that matches the intensity scale of the collaborating sites. We also propose a novel strategy to evaluate the effectiveness of image harmonization approaches with evaluation metrics that both measure image style harmonization and assess the preservation of anatomical structures.Experimental results demonstrate the effectiveness of our method in preserving patient anatomy while adjusting the image intensities to a new clinical site. Our general harmonization model can be used on unseen data from new sites, making it a valuable tool for real-world medical applications and clinical trials.Here's the Simplified Chinese translation:为了使用机器学习来诊断和预测罕见疾病，如儿童脑肿瘤，需要从多个临床Site收集医学成像数据，这些数据可能使用不同的设备和协议。但是，使用生成对抗网络（GANs）进行深度学习驱动的成像融合可能会导致“幻觉”现象，即生成不存在于训练数据中的 pseudo 结构。为了避免幻觉在医学成像中，我们提议一种一键学习方法，利用神经风格传输来融合。在测试时，方法使用一个来自临床Site的图像，通过神经风格传输来生成医学成像，以适应新的临床Site的INTENSITY规模。我们还提出了一种新的评估图像融合方法的效果的策略，该策略包括评估图像风格融合和评估结构保持。实验结果表明，我们的方法可以保持患者的解剖结构，同时调整图像INTENSITY来适应新的临床Site。我们的通用融合模型可以在新的Site上使用未看过的数据，因此它是实际医疗应用和临床试验中的有价值工具。

Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction

paper_url: http://arxiv.org/abs/2308.11025
repo_url: https://github.com/machineperceptionlab/cq-nir
paper_authors: Sijia Jiang, Jing Hua, Zhizhong Han
for: 用于学习神经隐式表示法从多视图图像中获取3D重建
methods: 使用量化坐标为神经网络中的输入，并使用离散坐标和其позициональ编码来学习隐式函数
results: 提高了多视图一致性约束，并且不会增加计算负担，在最新的方法上显示了超过状态艺术的优势

Abstract
In recent years, huge progress has been made on learning neural implicit representations from multi-view images for 3D reconstruction. As an additional input complementing coordinates, using sinusoidal functions as positional encodings plays a key role in revealing high frequency details with coordinate-based neural networks. However, high frequency positional encodings make the optimization unstable, which results in noisy reconstructions and artifacts in empty space. To resolve this issue in a general sense, we introduce to learn neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. Instead of continuous coordinates, we discretize continuous coordinates into discrete coordinates using nearest interpolation among quantized coordinates which are obtained by discretizing the field in an extremely high resolution. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. This significantly reduces the variations in the sample space, and triggers more multi-view consistency constraints on intersections of rays from different views, which enables to infer implicit function in a more effective way. Our quantized coordinates do not bring any computational burden, and can seamlessly work upon the latest methods. Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.

摘要
Recently, there have been significant advancements in learning neural implicit representations from multi-view images for 3D reconstruction. Using sinusoidal functions as positional encodings has proven to be crucial in revealing high-frequency details with coordinate-based neural networks. However, the use of high-frequency positional encodings can lead to unstable optimization, resulting in noisy reconstructions and artifacts in empty space. To address this issue, we propose learning neural implicit representations with quantized coordinates, which reduces uncertainty and ambiguity in the field during optimization. Instead of using continuous coordinates, we discretize them into discrete coordinates using nearest interpolation among quantized coordinates, which are obtained by discretizing the field in an extremely high resolution. We then use these discrete coordinates and their positional encodings to learn implicit functions through volume rendering, significantly reducing variations in the sample space and triggering more multi-view consistency constraints on intersections of rays from different views, enabling more effective inference of implicit functions. Our quantized coordinates do not increase computational burden and can seamlessly work with the latest methods. Our evaluations on widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.

Multi-Task Hypergraphs for Semi-supervised Learning using Earth Observations

paper_url: http://arxiv.org/abs/2308.11021
repo_url: None
paper_authors: Mihai Pirvu, Alina Marcu, Alexandra Dobrescu, Nabil Belbachir, Marius Leordeanu
for: 这个论文是为了解决多任务学习中数据缺失问题，特别是在地球观测领域，where ground-truth data is often missing.
methods: 该论文提出了一种多任务hypergraphSelf-supervised learning方法，其中每个节点是一个任务，不同的路径通过hypergraph到达给定任务都成为了无监督教师，并通过 ensemble 学习生成可靠的pseudolabels。
results: 经过对NASA NEO数据集的广泛实验，论文示出了其多任务半监督方法的价值，包括在强基elines和最近的工作上的一致提升。此外，论文还表明了hypergraph可以适应不监督数据分布变化，并可靠地恢复缺失数据，以及其可以在多个观测层次上适应数据缺失情况。

Abstract
There are many ways of interpreting the world and they are highly interdependent. We exploit such complex dependencies and introduce a powerful multi-task hypergraph, in which every node is a task and different paths through the hypergraph reaching a given task become unsupervised teachers, by forming ensembles that learn to generate reliable pseudolabels for that task. Each hyperedge is part of an ensemble teacher for a given task and it is also a student of the self-supervised hypergraph system. We apply our model to one of the most important problems of our times, that of Earth Observation, which is highly multi-task and it often suffers from missing ground-truth data. By performing extensive experiments on the NASA NEO Dataset, spanning a period of 22 years, we demonstrate the value of our multi-task semi-supervised approach, by consistent improvements over strong baselines and recent work. We also show that the hypergraph can adapt unsupervised to gradual data distribution shifts and reliably recover, through its multi-task self-supervision process, the missing data for several observational layers for up to seven years.

摘要
世界上有很多方法来解释，它们之间很高度相互依赖。我们利用这些复杂的依赖关系，引入一个强大的多任务超графи，其中每个节点是一个任务，不同的路径通过超графи到达某个任务就会成为无监督教师。每个超边都是一个任务的ensemble教师，同时也是自我超vised系统的学生。我们应用我们的模型到现代时代最重要的问题之一：地球观测，这是一个高度多任务的问题，经常缺少真实数据。通过对NASA NEO数据集进行广泛的实验，覆盖22年时间段，我们展示了我们的多任务半超监教学方法的价值，通过与强基线和最新研究的相对比较，我们的模型在多个任务上具有稳定和可靠的表现。此外，我们还证明了超графи可以适应无监督数据分布变化，通过自我超vision过程，可以重新生成多个观测层的缺失数据，保持稳定的表现，达到7年之久。

Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

paper_url: http://arxiv.org/abs/2308.11015
repo_url: https://github.com/eldentse/Spectral-Graphormer
paper_authors: Tze Ho Elden Tse, Franziska Mueller, Zhengyang Shen, Danhang Tang, Thabo Beeler, Mingsong Dou, Yinda Zhang, Sasa Petrovic, Hyung Jin Chang, Jonathan Taylor, Bardia Doosti
For: reconstruction of two high fidelity hands from multi-view RGB images* Methods: transformer-based framework with spectral graph convolution decoder and optimization-based refinement stage* Results: realistic two-hand reconstructions with physical plausibility and generalization to real data, as well as real-time AR/VR applications.Here’s the full summary in Simplified Chinese:
for: 重构两个高精度手掌从多视图RGB图像中
methods: 使用变换器基于框架，并使用spectral graph convolution decoder和优化基于的反馈阶段
results: 生成真实的两手重构，并实现数据实际性和AR/VR应用中的实时渲染。

Abstract
We propose a novel transformer-based framework that reconstructs two high fidelity hands from multi-view RGB images. Unlike existing hand pose estimation methods, where one typically trains a deep network to regress hand model parameters from single RGB image, we consider a more challenging problem setting where we directly regress the absolute root poses of two-hands with extended forearm at high resolution from egocentric view. As existing datasets are either infeasible for egocentric viewpoints or lack background variations, we create a large-scale synthetic dataset with diverse scenarios and collect a real dataset from multi-calibrated camera setup to verify our proposed multi-view image feature fusion strategy. To make the reconstruction physically plausible, we propose two strategies: (i) a coarse-to-fine spectral graph convolution decoder to smoothen the meshes during upsampling and (ii) an optimisation-based refinement stage at inference to prevent self-penetrations. Through extensive quantitative and qualitative evaluations, we show that our framework is able to produce realistic two-hand reconstructions and demonstrate the generalisation of synthetic-trained models to real data, as well as real-time AR/VR applications.

摘要
我们提出了一种基于转换器的新框架，可以从多视角RGB图像中重建高精度两只手。与现有的手姿估计方法不同，我们处理一个更加复杂的问题Setting，即从 egocentric 视角直接将两只手的绝对根姿 Parameters 高分辨率RGB图像中进行重建。由于现有的数据集不可能进行 egocentric 视角或缺乏背景变化，我们创建了一个大规模的 simulate 数据集和实际数据集，以验证我们的多视角图像特征融合策略。为使重建 Physically plausible，我们提出了两种策略：（i）一种粗到细 spectral 图像 conv 嵌入器来平滑网格时的抗锯齿，以及（ii）在推理时进行优化基于反射 stage 以避免自身穿孔。经过广泛的量化和质量评估，我们表明我们的框架可以生成真实的两只手重建，并示出了 synthetic 训练的模型在实际数据上的普适性，以及实时 AR/VR 应用。

Autonomous Detection of Methane Emissions in Multispectral Satellite Data Using Deep Learning

paper_url: http://arxiv.org/abs/2308.11003
repo_url: None
paper_authors: Bertrand Rouet-Leduc, Thomas Kerdreux, Alexandre Tuel, Claudia Hulbert
for: 监控全球暖化的一种重要方法，减少温室气体的排放
methods: 使用深度学习方法自动识别几何图像中的甲烷泄漏
results: 比前一代多spectrum甲烷数据产品降低假阳性率，无需对潜在泄漏地点有专业知识

Abstract
Methane is one of the most potent greenhouse gases, and its short atmospheric half-life makes it a prime target to rapidly curb global warming. However, current methane emission monitoring techniques primarily rely on approximate emission factors or self-reporting, which have been shown to often dramatically underestimate emissions. Although initially designed to monitor surface properties, satellite multispectral data has recently emerged as a powerful method to analyze atmospheric content. However, the spectral resolution of multispectral instruments is poor, and methane measurements are typically very noisy. Methane data products are also sensitive to absorption by the surface and other atmospheric gases (water vapor in particular) and therefore provide noisy maps of potential methane plumes, that typically require extensive human analysis. Here, we show that the image recognition capabilities of deep learning methods can be leveraged to automatize the detection of methane leaks in Sentinel-2 satellite multispectral data, with dramatically reduced false positive rates compared with state-of-the-art multispectral methane data products, and without the need for a priori knowledge of potential leak sites. Our proposed approach paves the way for the automated, high-definition and high-frequency monitoring of point-source methane emissions across the world.

摘要
偏二氢（methane）是全球暖化气体中最强大的一种，它的大气半衰期短，使其成为快速降低全球暖化的目标。然而，当前的偏二氢泄漏监测技术主要基于估算性的泄漏因素或者自我报告，这些估算经常会很大的下降估计。虽然初始设计用于观察地面特性，但卫星多spectral数据最近才被发现作为分析大气内容的 poderoso方法。然而，多spectral工具的 spectral resolution很差，偏二氢测量通常很吵闹。偏二氢数据产品也受到地面和其他大气气体（水蒸气特别是）的吸收，因此提供的偏二氢潮区图像通常需要人工分析。在这里，我们展示了深度学习方法的图像识别能力可以自动化卫星Sentinel-2多spectral数据中的偏二氢泄漏检测，与现有的多spectral偏二氢数据产品相比， false positive 率有所下降，而无需先知泄漏点位置的假设。我们的提出的方法开 up a new way for the automatic, high-definition and high-frequency monitoring of point-source methane emissions around the world.

Switched auxiliary loss for robust training of transformer models for histopathological image segmentation

paper_url: http://arxiv.org/abs/2308.10994
repo_url: None
paper_authors: Mustaffa Hussain, Saharsh Barve
For: 本研究旨在提供一种模型，用于分类多个器官Functional Tissue Units (FTUs)的cell population neighborhoods，以便帮助病理学家更好地理解人体疾病的影响。* Methods: 该模型使用HuBMAP + HPA - Hacking the Human Body competition dataset进行训练，并提出了shifted auxiliary loss来解决深度模型的减速问题。* Results: 该模型在公共数据集上取得了0.793的dice分数，而在私有数据集上取得了0.778的dice分数，与传统方法相比，该方法提供了1%的提升。这些结果表明 transformers 模型在医学图像分析中的粗粒预测任务中的表现非常出色，并且可以帮助我们更好地理解人体细胞和组织的关系，从而更好地理解人体健康的影响。

Abstract
Functional tissue Units (FTUs) are cell population neighborhoods local to a particular organ performing its main function. The FTUs provide crucial information to the pathologist in understanding the disease affecting a particular organ by providing information at the cellular level. In our research, we have developed a model to segment multi-organ FTUs across 5 organs namely: the kidney, large intestine, lung, prostate and spleen by utilizing the HuBMAP + HPA - Hacking the Human Body competition dataset. We propose adding shifted auxiliary loss for training models like the transformers to overcome the diminishing gradient problem which poses a challenge towards optimal training of deep models. Overall, our model achieved a dice score of 0.793 on the public dataset and 0.778 on the private dataset and shows a 1% improvement with the use of the proposed method. The findings also bolster the use of transformers models for dense prediction tasks in the field of medical image analysis. The study assists in understanding the relationships between cell and tissue organization thereby providing a useful medium to look at the impact of cellular functions on human health.

摘要
功能组织单元（FTU）是指器官本地的细胞群聚，提供了病理学家理解器官疾病的关键信息。在我们的研究中，我们开发了一种方法来在5种器官（肾脏、大小肠、肺、肾脏和脾脏）的FTU中进行多器官分割，使用了HuBMAP+HPA-Hacking the Human Body竞赛数据集。我们提议在训练模型时使用偏移 auxiliary loss，以解决深度模型训练过程中的减少梯度问题。总的来说，我们的模型在公共数据集上 achievement了0.793的 dice 分数，在私有数据集上 achievement了0.778的 dice 分数，与使用我们提议的方法相比，提高了1%。这些结果也证明了transformers模型在医学影像分析中的稠密预测任务中的可靠性。本研究帮助我们理解细胞和组织结构之间的关系，从而提供了一个有用的媒体来查看细胞功能对人类健康的影响。

Debiasing Counterfactuals In the Presence of Spurious Correlations

paper_url: http://arxiv.org/abs/2308.10984
repo_url: None
paper_authors: Amar Kumar, Nima Fathi, Raghav Mehta, Brennan Nichyporuk, Jean-Pierre R. Falet, Sotirios Tsaftaris, Tal Arbel
for: The paper is written for the task of medical imaging classification, specifically addressing the issue of deep learning models relying on spurious correlations in the training data.
methods: The paper proposes an end-to-end training framework that integrates popular debiasing classifiers (such as distributionally robust optimization) with counterfactual image generation to expose generalizable imaging markers of relevance to the task, and a novel metric (Spurious Correlation Latching Score) to quantify the extent of classifier reliance on spurious correlations.
results: The paper demonstrates through comprehensive experiments on two public datasets (with simulated and real visual artifacts) that the debiasing method (i) learns generalizable markers across the population and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了医学成像分类任务而写的，特别是处理深度学习模型在训练数据中遇到的假 correlate 问题。
methods: 这篇论文提出了一种结合流行的偏差纠正分类器（如分布式稳定优化）和对假 correlate 进行抗衡的末端训练框架，以及一个新的指标（假 correlate 抓取得分）来衡量分类器偏差的程度。
results: 这篇论文通过对公共数据集（包括模拟和实际视觉杂质）进行了广泛的实验，证明了偏差方法可以（一）学习人口中的通用标记，并（二）忽略假 correlate 并专注于下面疾病生物学。

Abstract
Deep learning models can perform well in complex medical imaging classification tasks, even when basing their conclusions on spurious correlations (i.e. confounders), should they be prevalent in the training dataset, rather than on the causal image markers of interest. This would thereby limit their ability to generalize across the population. Explainability based on counterfactual image generation can be used to expose the confounders but does not provide a strategy to mitigate the bias. In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. Additionally, we propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images. Through comprehensive experiments on two public datasets (with the simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.

摘要
In this work, we introduce the first end-to-end training framework that integrates both (i) popular debiasing classifiers (e.g. distributionally robust optimization (DRO)) to avoid latching onto the spurious correlations and (ii) counterfactual image generation to unveil generalizable imaging markers of relevance to the task. We also propose a novel metric, Spurious Correlation Latching Score (SCLS), to quantify the extent of the classifier reliance on the spurious correlation as exposed by the counterfactual images.Through comprehensive experiments on two public datasets (with simulated and real visual artifacts), we demonstrate that the debiasing method: (i) learns generalizable markers across the population, and (ii) successfully ignores spurious correlations and focuses on the underlying disease pathology.

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

paper_url: http://arxiv.org/abs/2308.11662
repo_url: https://github.com/ccychongyanchen/vqatherapycrowdsourcing
paper_authors: Chongyan Chen, Samreen Anjum, Danna Gurari
for: 这篇论文是关于视觉问答任务的研究，旨在更好地理解不同人对同一张图片的问题提出不同的答案的原因。
methods: 该论文引入了第一个可视地将每个答案与每个视觉问题相关联的数据集，称为VQAAnswerTherapy。然后，该论文提出了两个新的问题：一是判断视觉问题是否有唯一的答案基础，二是找到所有答案基础的地方。
results: 该论文使用现代算法对这两个新问题进行了评估，以示其在这些问题上的成功和缺点。数据集和评估服务器可以在https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/上公开获取。

Abstract
Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/.

摘要
“视觉问答是一项任务，旨在预测图像上的问题的答案。由于不同的人可能对同一个视觉问题提供不同的答案，我们想要更好地理解这些答案的原因。我们引入了首个可视地固定每个独特答案的视觉问题数据集，称之为VQAAnswerTherapy。我们then提出了两个新的问题：可视地判断问题是否有唯一的答案固定，以及本地化所有答案固定。我们对现代算法进行了测试，以示其在这些新问题上的表现。数据集和评估服务器可以在https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/上公共获取。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

SupEuclid: Extremely Simple, High Quality OoD Detection with Supervised Contrastive Learning and Euclidean Distance

paper_url: http://arxiv.org/abs/2308.10973
repo_url: None
paper_authors: Jarrod Haas
for: 本研究旨在提出一种简单而有效的Out-of-Distribution（OoD）检测方法，可以在标准benchmark上达到state-of-the-art的结果。
methods: 本研究使用Supervised Contrastive Learning（SCL）将ResNet18进行训练，并使用Euclidean distance作为分数规则进行评价。
results: 研究发现，使用SCL训练的ResNet18可以在近和远OoD检测benchmark上达到state-of-the-art的结果，无需使用更复杂的方法或更大的模型。

Abstract
Out-of-Distribution (OoD) detection has developed substantially in the past few years, with available methods approaching, and in a few cases achieving, perfect data separation on standard benchmarks. These results generally involve large or complex models, pretraining, exposure to OoD examples or extra hyperparameter tuning. Remarkably, it is possible to achieve results that can exceed many of these state-of-the-art methods with a very simple method. We demonstrate that ResNet18 trained with Supervised Contrastive Learning (SCL) produces state-of-the-art results out-of-the-box on near and far OoD detection benchmarks using only Euclidean distance as a scoring rule. This may obviate the need in some cases for more sophisticated methods or larger models, and at the very least provides a very strong, easy to use baseline for further experimentation and analysis.

摘要

MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

paper_url: http://arxiv.org/abs/2308.10968
repo_url: None
paper_authors: Guoyao Shen, Yancheng Zhu, Hernan Jara, Sean B. Andersson, Chad W. Farris, Stephan Anderson, Xin Zhang
for: 高品质的MRI重建方法
methods: 使用深度学习模型和风格传递的减噪方法
results: 使用 клиничеMRI扫描数据，能够显著提高图像质量

Abstract
Recent works have demonstrated success in MRI reconstruction using deep learning-based models. However, most reported approaches require training on a task-specific, large-scale dataset. Regularization by denoising (RED) is a general pipeline which embeds a denoiser as a prior for image reconstruction. The potential of RED has been demonstrated for multiple image-related tasks such as denoising, deblurring and super-resolution. In this work, we propose a regularization by neural style transfer (RNST) method to further leverage the priors from the neural transfer and denoising engine. This enables RNST to reconstruct a high-quality image from a noisy low-quality image with different image styles and limited data. We validate RNST with clinical MRI scans from 1.5T and 3T and show that RNST can significantly boost image quality. Our results highlight the capability of the RNST framework for MRI reconstruction and the potential for reconstruction tasks with limited data.

摘要
Here's the Simplified Chinese translation:近期研究表明，深度学习基于模型可以成功地进行MRI重建。然而，大多数报道的方法需要任务特定、大规模的数据集来进行训练。噪声除掉（RED）是一个通用管道，它将噪声除掉作为图像重建的前提。RED已经在多种图像相关任务中展示出色，如噪声除掉、锐化和超分辨率。在这个工作中，我们提出了基于神经风格传输的噪声除掉方法（RNST），以更好地利用神经传输和噪声除掉引擎中的前提。这使得RNST可以从噪声低质量图像中重建高质量图像，并且可以处理不同的图像风格和有限的数据。我们验证了RNST使用临床MRI扫描数据，并证明了RNST可以显著提高图像质量。我们的结果表明RNST框架可以用于MRI重建，并且可能在有限数据情况下实现高质量重建。

BundleSeg: A versatile, reliable and reproducible approach to white matter bundle segmentation

paper_url: http://arxiv.org/abs/2308.10958
repo_url: None
paper_authors: Etienne St-Onge, Kurt G Schilling, Francois Rheault
for: 提供一种可靠、可重复、快速的白 matter 路径EXTRACTING方法
methods: combinest 一种迭代注册过程与一种新发展的精确流线搜索算法，以高效地分割流线无需 трактogram clustering 或简化假设
results: 比state-of-the-art segmentation方法具有改进的重复性和复制性，以及显著的速度提高。通过增强精度和减少变化，提供了一种 valuabe 的工具 для neuroscience 研究，提高了轨迹学研究中white matter pathways的敏感性和特异性。

Abstract
This work presents BundleSeg, a reliable, reproducible, and fast method for extracting white matter pathways. The proposed method combines an iterative registration procedure with a recently developed precise streamline search algorithm that enables efficient segmentation of streamlines without the need for tractogram clustering or simplifying assumptions. We show that BundleSeg achieves improved repeatability and reproducibility than state-of-the-art segmentation methods, with significant speed improvements. The enhanced precision and reduced variability in extracting white matter connections offer a valuable tool for neuroinformatic studies, increasing the sensitivity and specificity of tractography-based studies of white matter pathways.

摘要
这个研究提出了一种可靠、可重复、快速的白 mater 血管路径EXTRACTING方法，称为BundleSeg。该方法结合了迭代注册过程和最近开发的精炼流线搜索算法，可以高效地 segments 流线，无需进行跟踪ogram clustering 或假设。我们展示了 BundleSeg 在EXTRACTING white matter 血管路径上的重复性和复制性得到了提高，同时速度也得到了显著提高。这种更高精度和减少的变化可以为 neuroscience 研究提供一个有价值的工具，提高了追踪学基于 white matter 血管路径的敏感性和特异性。

CamP: Camera Preconditioning for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.10902
repo_url: None
paper_authors: Keunhong Park, Philipp Henzler, Ben Mildenhall, Jonathan T. Barron, Ricardo Martin-Brualla
for: 高精度3D场景重建
methods: 使用Proxy问题计算抑制器，并使用该抑制器作为预Conditioner进行相机参数优化
results: 对Mip-NeRF 360 dataset中的场景进行重建，比对其他State-of-the-art NeRF方法（如Zip-NeRF）和State-of-the-art联合优化方法（如SCNeRF）减少错误率（RMSE）67%，相比减少29%Here’s a more detailed explanation of each point:
for: The paper is written for optimizing Neural Radiance Fields (NeRF) to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes.
methods: The paper proposes using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and using this transform as a preconditioner for the camera parameters during joint optimization.
results: The paper shows that the proposed approach significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset, reducing error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF.

Abstract
Neural Radiance Fields (NeRF) can be optimized to obtain high-fidelity 3D scene reconstructions of objects and large-scale scenes. However, NeRFs require accurate camera parameters as input -- inaccurate camera parameters result in blurry renderings. Extrinsic and intrinsic camera parameters are usually estimated using Structure-from-Motion (SfM) methods as a pre-processing step to NeRF, but these techniques rarely yield perfect estimates. Thus, prior works have proposed jointly optimizing camera parameters alongside a NeRF, but these methods are prone to local minima in challenging settings. In this work, we analyze how different camera parameterizations affect this joint optimization problem, and observe that standard parameterizations exhibit large differences in magnitude with respect to small perturbations, which can lead to an ill-conditioned optimization problem. We propose using a proxy problem to compute a whitening transform that eliminates the correlation between camera parameters and normalizes their effects, and we propose to use this transform as a preconditioner for the camera parameters during joint optimization. Our preconditioned camera optimization significantly improves reconstruction quality on scenes from the Mip-NeRF 360 dataset: we reduce error rates (RMSE) by 67% compared to state-of-the-art NeRF approaches that do not optimize for cameras like Zip-NeRF, and by 29% relative to state-of-the-art joint optimization approaches using the camera parameterization of SCNeRF. Our approach is easy to implement, does not significantly increase runtime, can be applied to a wide variety of camera parameterizations, and can straightforwardly be incorporated into other NeRF-like models.

摘要
神经辐射场（NeRF）可以优化以获得高精度3D场景重建。然而，NeRF需要准确的摄像头参数作为输入，否则会得到模糊的渲染。通常来说，摄像头参数的内在和外在参数通过Structure-from-Motion（SfM）方法进行估计，但这些技术很少能提供完美的估计。因此，先前的工作已经提议同时优化摄像头参数和NeRF，但这些方法容易陷入困难的设置中。在这个工作中，我们分析了不同的摄像头参数化对这个联合优化问题的影响，并发现标准参数化 exhibit 大量的差异幅度，这可能导致一个不正确的优化问题。我们提出使用一个代理问题计算一个卷积变换，该变换消除了摄像头参数与normalize 其效果的相关性，并我们提议使用这个变换作为摄像头参数的预处理器。我们的预处理后的摄像头优化显著提高了Mip-NeRF 360 dataset中的重建质量：我们降低了误差率（RMSE）相比领先的NeRF方法，Zip-NeRF，和相对于领先的联合优化方法使用摄像头参数化的 SCNeRF，降低了29%。我们的方法易于实现，不会增加运行时间，可以应用于多种摄像头参数化，并可以 straightforwardly 与其他NeRF-like模型结合使用。

Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation

paper_url: http://arxiv.org/abs/2308.10898
repo_url: None
paper_authors: Xueyi Liu, Bin Wang, He Wang, Li Yi
for: 本研究旨在解决具有少量示例的物理可知树状对象生成问题。通过观察含有只几个示例的人工骨架对象数据集，我们希望通过学习一个模型，以生成多样化的骨架，并保证其视觉准确性和物理可行性。
methods: 我们提出了两项关键创新，即1）基于分治哲学的层次骨架变换基本模型，以适应具有少量示例的问题，并且可以借鉴大规模的固定骨架上的可转移变换模式；2）基于物理学的变换修正方案，以促进物理可行的生成。
results: 我们在6种人工骨架类别上进行了广泛的实验，并证明了我们的方法在几何上比前方法更好，可以更好地生成具有多样性、高视觉准确性和物理可行性的骨架。此外，我们还进行了ablation研究，以验证我们的两项创新的准确性。研究页面及代码可以在https://meowuu7.github.io/few-arti-obj-gen中找到。

Abstract
We study the problem of few-shot physically-aware articulated mesh generation. By observing an articulated object dataset containing only a few examples, we wish to learn a model that can generate diverse meshes with high visual fidelity and physical validity. Previous mesh generative models either have difficulties in depicting a diverse data space from only a few examples or fail to ensure physical validity of their samples. Regarding the above challenges, we propose two key innovations, including 1) a hierarchical mesh deformation-based generative model based upon the divide-and-conquer philosophy to alleviate the few-shot challenge by borrowing transferrable deformation patterns from large scale rigid meshes and 2) a physics-aware deformation correction scheme to encourage physically plausible generations. We conduct extensive experiments on 6 articulated categories to demonstrate the superiority of our method in generating articulated meshes with better diversity, higher visual fidelity, and better physical validity over previous methods in the few-shot setting. Further, we validate solid contributions of our two innovations in the ablation study. Project page with code is available at https://meowuu7.github.io/few-arti-obj-gen.

摘要
我们研究几何物体生成中受限的几何物体生成问题。通过观察一个具有少量示例的柔软物体数据集，我们希望通过学习一个可以生成多样化的几何模型，以保证高Visual faithfulness和物理有效性。先前的几何生成模型容易在几何空间中示出少量示例的多样性或者缺乏物理有效性的问题。为了解决这些挑战，我们提出了两项关键创新：1. 基于分治理的几何变形生成模型，根据分治理的哲学，从大规模的固定几何中继承可质量的变形模式，以便在几何空间中减少几何数据的几何变形问题。2. 基于物理知识的几何变形修正方案，以促进物理可能的生成。我们对6种柔软物体类型进行了广泛的实验，并证明了我们的方法在几何数据中生成的几何模型具有更高的多样性、更高的Visual faithfulness和更高的物理有效性，相比之前的方法。此外，我们还进行了减少几何变形和物理变形修正的精细分析，以证明我们的两项创新的凝聚性。项目页面和代码可以在https://meowuu7.github.io/few-arti-obj-gen中找到。

Can Language Models Learn to Listen?

paper_url: http://arxiv.org/abs/2308.10897
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar
For: The paper is written for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker’s words.* Methods: The approach uses an autoregressive model that predicts a response of a listener, which is a sequence of listener facial gestures, quantized using a VQ-VAE. The model treats the quantized atomic motion elements as additional language token inputs to a transformer-based large language model.* Results: The generated listener motion is fluent and reflective of language semantics, as shown through quantitative metrics and a qualitative user study. The model demonstrates the ability to utilize temporal and semantic aspects of spoken text.

Abstract
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

摘要
我们提出了一个框架，用于基于说话人的话语生成适当的面部响应。给定输入词语讲解和时间戳，我们的方法通过自动递归预测一个听众的响应：一个序列化的听众面部姿势，使用VQ-VAE进行量化。由于姿势是语言成分，我们提议对量化的原子运动元素视为额外的语言标记输入，并将其传递给基于转换器的大语言模型进行处理。初始化我们的转换器使用已经预训练的语言模型的权重，比训练从零开始的转换器更有显著的质量提升。我们表示我们生成的听众动作是流畅的，并具有语言 semantics 的表达。在我们的评估中，我们分析了模型对说话文本的时间和 semantics 方面的使用。项目页面：https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

Differentiable Shadow Mapping for Efficient Inverse Graphics

paper_url: http://arxiv.org/abs/2308.10896
repo_url: https://github.com/mworchel/differentiable-shadow-mapping
paper_authors: Markus Worchel, Marc Alexa
for: 该论文主要研究如何高效地生成triangle mesh中的阴影。
methods: 该论文提出了一种将预filtered shadow mapping技术与现有的可导着色器结合使用，以实现 triangle mesh中的可导visibility信息。
results: 研究发现，使用可导阴影图可以与不同的 inverse graphics problems 比较快速，并且与不同的灯光传输模拟相比，可以达到类似的准确性水平，而不同的可导着色器无法收敛。

Abstract
We show how shadows can be efficiently generated in differentiable rendering of triangle meshes. Our central observation is that pre-filtered shadow mapping, a technique for approximating shadows based on rendering from the perspective of a light, can be combined with existing differentiable rasterizers to yield differentiable visibility information. We demonstrate at several inverse graphics problems that differentiable shadow maps are orders of magnitude faster than differentiable light transport simulation with similar accuracy -- while differentiable rasterization without shadows often fails to converge.

摘要
我们显示了如何有效地生成阴影在分割形状的几何 Rendering 中。我们的中心观察是可以将阴影预filtering，一种基于灯光的见识测试，与现有的可微化矿物 Rendering 结合，以获得可微化可视性信息。我们在一些逆图学问题中展示了这种方法比对于对称的阴影传递 Simulation 更快速，并且与不含阴影的可微化矿物 Rendering 相比，通常会出现不收敛的问题。

Unlocking Accuracy and Fairness in Differentially Private Image Classification

paper_url: http://arxiv.org/abs/2308.10888
repo_url: None
paper_authors: Leonard Berrada, Soham De, Judy Hanwen Shen, Jamie Hayes, Robert Stanforth, David Stutz, Pushmeet Kohli, Samuel L. Smith, Borja Balle
For: 这个研究的目的是让机器学习模型在保护敏感资料的情况下训练，以确保对敏感资料的训练不会泄露敏感信息。* Methods: 这个研究使用了差异调教（Differential Privacy）的金标准框架，以提供正式的隐私保证。* Results: 研究发现，使用预先训练的基础模型，并在这些模型上实现差异调教，可以实现与非隐私模型相似的准确性水平，甚至在资料分布shift的情况下仍能保持高度的准确性。

Abstract
Privacy-preserving machine learning aims to train models on private data without leaking sensitive information. Differential privacy (DP) is considered the gold standard framework for privacy-preserving training, as it provides formal privacy guarantees. However, compared to their non-private counterparts, models trained with DP often have significantly reduced accuracy. Private classifiers are also believed to exhibit larger performance disparities across subpopulations, raising fairness concerns. The poor performance of classifiers trained with DP has prevented the widespread adoption of privacy preserving machine learning in industry. Here we show that pre-trained foundation models fine-tuned with DP can achieve similar accuracy to non-private classifiers, even in the presence of significant distribution shifts between pre-training data and downstream tasks. We achieve private accuracies within a few percent of the non-private state of the art across four datasets, including two medical imaging benchmarks. Furthermore, our private medical classifiers do not exhibit larger performance disparities across demographic groups than non-private models. This milestone to make DP training a practical and reliable technology has the potential to widely enable machine learning practitioners to train safely on sensitive datasets while protecting individuals' privacy.

摘要
隐私保护机器学习的目标是在使用private数据进行训练时保持敏感信息的隐私。不同于其非私有对应的模型，模型通过隐私保护（DP）训练的准确率通常会下降 significatively。此外，private分类器可能会在不同的人口群体之间存在更大的性别差异，引发公平性的问题。由于DP训练的模型性能较差，因此在实业中广泛采用隐私保护机器学习的应用仍然受限。在这篇文章中，我们表明了基于预先训练的基础模型，通过DP进行细化训练可以达到与非私有模型相同的准确率，即使在数据分布变化较大的情况下。我们在四个数据集上达到了与非私有状态的准确率，包括两个医疗影像标准 benchmark。此外，我们的私有医疗分类器不会在不同的人口群体中存在更大的性别差异，与非私有模型相比。这一突破可能使DP训练成为实用和可靠的技术，使机器学习实践者可以在敏感数据上进行安全的训练，同时保护个人隐私。

Vision Transformer Pruning Via Matrix Decomposition

paper_url: http://arxiv.org/abs/2308.10839
repo_url: None
paper_authors: Tianyi Sun
for: 降低存储、运行时内存和计算需求
methods: 使用矩阵分解方法（包括各种QR分解和LU分解）来降低矩阵维度和复杂度
results: 结果表明使用Singular Value Decomposition方法可以保持重要特征的生成，同时降低存储、运行时内存和计算需求。

Abstract
This is a further development of Vision Transformer Pruning via matrix decomposition. The purpose of the Vision Transformer Pruning is to prune the dimension of the linear projection of the dataset by learning their associated importance score in order to reduce the storage, run-time memory, and computational demands. In this paper we further reduce dimension and complexity of the linear projection by implementing and comparing several matrix decomposition methods while preserving the generated important features. We end up selected the Singular Value Decomposition as the method to achieve our goal by comparing the original accuracy scores in the original Github repository and the accuracy scores of using those matrix decomposition methods, including Singular Value Decomposition, four versions of QR Decomposition, and LU factorization.

摘要
这是vision transformer减少的进一步发展，通过矩阵分解来减少数据集的维度。vision transformer减少的目的是学习数据集中每个特征的重要性分数，以降低存储、运行内存和计算成本。在这篇论文中，我们进一步减少了矩阵 projection 的维度和复杂性，并比较了多种矩阵分解方法，包括四种QR分解版本和LU分解。最终，我们选择了单值分解来实现我们的目标，并比较了原始数据集的准确率和使用这些矩阵分解方法得到的准确率。

EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition

paper_url: http://arxiv.org/abs/2308.10832
repo_url: https://github.com/gmberton/eigenplaces
paper_authors: Gabriele Berton, Gabriele Trivigno, Barbara Caputo, Carlo Masone
For: The paper is written for the task of visual place recognition, specifically to improve the robustness of the model to different viewpoints.* Methods: The proposed method, called EigenPlaces, uses a new approach to train the neural network on images from different viewpoints, which embeds viewpoint robustness into the learned global descriptors. The method clusters the training data to explicitly present the model with different views of the same points of interest, without the need for extra supervision.* Results: The paper presents experiments on the most comprehensive set of datasets in literature, showing that EigenPlaces outperforms previous state-of-the-art methods on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors.Here are the three key points in Simplified Chinese text:* For: 这篇论文是为了解决视觉地点识别任务中的视角不一致问题，以提高模型的视角Robustness。* Methods: 提议的方法是EigenPlaces，它使用了一种新的方法来在不同视角的图像上训练神经网络，从而在学习的全球描述符中嵌入视角Robustness。该方法通过对训练数据进行分组，以显式地给模型提供不同视角的同一个点的兴趣点。无需额外监督。* Results: 论文通过对Literature中最完整的数据集进行实验，发现EigenPlaces在大多数数据集上超过了之前的状态之为，而且需要60% menos的GPU内存进行训练，并使用50%更小的描述符。

Abstract
Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60\% less GPU memory for training and using 50\% smaller descriptors. The code and trained models for EigenPlaces are available at {\small{\url{https://github.com/gmberton/EigenPlaces}}, while results with any other baseline can be computed with the codebase at {\small{\url{https://github.com/gmberton/auto_VPR}}.

摘要
“视觉地标识任务的目标是根据图像（叫做查询）的视觉特征来预测图像的位置。通常通过图像检索，将查询图像与大量地标注的图像库中的最相似图像进行匹配，使用学习的全局描述符。一个主要挑战在这个任务中是识别不同视点下的地标。为解决这个限制，我们提出了一新的方法，叫做EigenPlaces，通过在不同视点下训练神经网络，将视点强度嵌入到学习的全局描述符中。这个思想是将训练数据集分为不同视点下的部分，以便显式地将模型暴露给不同视点下的同一个点感兴趣。无需额外监督，我们选择了这些点感兴趣。我们then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors. The code and trained models for EigenPlaces are available at [https://github.com/gmberton/EigenPlaces](https://github.com/gmberton/EigenPlaces), while results with any other baseline can be computed with the codebase at [https://github.com/gmberton/auto_VPR](https://github.com/gmberton/auto_VPR).”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction

paper_url: http://arxiv.org/abs/2308.10820
repo_url: None
paper_authors: Miaoyu Li, Ying Fu, Ji Liu, Yulun Zhang
for: 本文提出了一种基于深度 unfolding 框架的高spectral像素 (HSI) 重建方法，以解决现有方法尚未能充分匹配 HSI 数据的问题。
methods: 本文使用了一种拥有 pixele adaptive descent 步长的数据模块，并引入了 Non-local Spectral Transformer (NST) 来强调 HSI 的3D特性。另外，通过 Fast Fourier Transform (FFT) 改进了不同阶段和层次的特征表达，以解决不同阶段和层次之间的交互问题。
results: 实验结果表明， compared to 现有 HSI 重建方法，本文提出的方法在 simulated 和实际场景中具有更高的重建性能。代码可以在 https://github.com/MyuLi/PADUT 上下载。

Abstract
Hyperspectral Image (HSI) reconstruction has made gratifying progress with the deep unfolding framework by formulating the problem into a data module and a prior module. Nevertheless, existing methods still face the problem of insufficient matching with HSI data. The issues lie in three aspects: 1) fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel-level. 2) inadequate prior module for 3D HSI cube. 3) stage interaction ignoring the differences in features at different stages. To address these issues, in this work, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, a pixel adaptive descent step is employed to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, the stage interaction is improved by the Fast Fourier Transform (FFT). Experimental results on both simulated and real scenes exhibit the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is released at: https://github.com/MyuLi/PADUT.

摘要

Fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel level.2. Inadequate prior module for 3D HSI cube.3. Stage interaction ignoring the differences in features at different stages.To address these issues, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, we employ a pixel adaptive descent step to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, we improve the stage interaction by the Fast Fourier Transform (FFT).Experimental results on both simulated and real scenes demonstrate the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is available at: https://github.com/MyuLi/PADUT.

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

paper_url: http://arxiv.org/abs/2308.10814
repo_url: None
paper_authors: Natalia Frumkin, Dibakar Gope, Diana Marculescu
for: 提高量化神经网络的精度和效率
methods: 使用进化搜索和infoNCE损失函数 traverse非线性测试损失 landscape
results: 在不同量化级别（3-bit、4-bit、8-bit）下，提高全量化ViT-Base的top-1准确率，并在极端量化场景下保持稳定性和可靠性

Abstract
Quantization scale and bit-width are the most important parameters when considering how to quantize a neural network. Prior work focuses on optimizing quantization scales in a global manner through gradient methods (gradient descent \& Hessian analysis). Yet, when applying perturbations to quantization scales, we observe a very jagged, highly non-smooth test loss landscape. In fact, small perturbations in quantization scale can greatly affect accuracy, yielding a $0.5-0.8\%$ accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset ($1,000$ images) but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by $10.30\%$, $0.78\%$, and $0.15\%$ for $3$-bit, $4$-bit, and $8$-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at https://github.com/enyac-group/evol-q

摘要
《量化缩放和位宽是神经网络量化时的关键参数。先前的工作通过梯度方法优化量化缩放，但当应用扰动时，我们发现测试损失 landscape 非常峰峦，高度不平。事实上，小于扰动量化缩放可以大幅提高准确性，达到 $0.5-0.8\%$ 的准确率提升。在这种情况下，梯度方法失效，因为它们无法可靠地到达地方最优点。在我们的工作中，称为 Evol-Q，我们使用进化搜索来有效地探索非平坦的表面。此外，我们也提议使用 infoNCE 损失函数，它不仅能够降低在小训练集（1,000 张图像）中的溢出问题，而且使探索非平坦表面更加容易。 Evol-Q 在完全量化 ViT-Base 上提高了 top-1 准确率，分别为 $10.30\%$, $0.78\%$, $0.15\%$ для $3$-bit、$4$-bit 和 $8$-bit 量化 веса级别。我们还进行了对多种 CNN 和 ViT 架构的广泛实验，以证明它的稳定性在极端量化场景下。我们的代码可以在 https://github.com/enyac-group/evol-q 上获取。》

2023-08-22

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

LCCo: Lending CLIP to Co-Segmentation

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

Opening the Vocabulary of Egocentric Actions

Free Lunch for Gait Recognition: A Novel Relation Descriptor

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Pose2Gait: Extracting Gait Features from Monocular Video of Individuals with Dementia

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Multitemporal analysis in Google Earth Engine for detecting urban changes using optical data and machine learning algorithms

An Analysis of Initial Training Strategies for Exemplar-Free Class-Incremental Learning

Food Image Classification and Segmentation with Attention-based Multiple Instance Learning

Towards Discriminative Representations with Contrastive Instances for Real-Time UAV Tracking

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging

SDeMorph: Towards Better Facial De-morphing from Single Morph

Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection

PoseGraphNet++: Enriching 3D Human Pose with Orientation Estimation

ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes

MatFuse: Controllable Material Generation with Diffusion Models

Non-Redundant Combination of Hand-Crafted and Deep Learning Radiomics: Application to the Early Detection of Pancreatic Cancer

Targeted Data Augmentation for bias mitigation

DALNet: A Rail Detection Network Based on Dynamic Anchor Line

Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

Enhancing Interpretable Object Abstraction via Clustering-based Slot Initialization

Towards Clip-Free Quantized Super-Resolution Networks: How to Tame Representative Images

Exemplar-Free Continual Transformer with Convolutions

Integration of Sentinel-1 and Sentinel-2 data for Earth surface classification using Machine Learning algorithms implemented on Google Earth Engine

Object Detection Difficulty: Suppressing Over-aggregation for Faster and Better Video Object Detection

CiteTracker: Correlating Image and Text for Visual Tracking

Using and Abusing Equivariance

Approaching human 3D shape perception with neurally mappable models

BHSD: A 3D Multi-Class Brain Hemorrhage Segmentation Dataset

PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations

Video BagNet: short temporal receptive fields increase robustness in long-term action recognition

Are current long-term video understanding datasets long-term?

LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training

Affordance segmentation of hand-occluded containers from exocentric images

LDP-Feat: Image Features with Local Differential Privacy

DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment

Masked Cross-image Encoding for Few-shot Segmentation

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models

MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation

ReFit: Recurrent Fitting Network for 3D Human Recovery

A three in one bottom-up framework for simultaneous semantic segmentation, instance segmentation and classification of multi-organ nuclei in digital cancer histology

Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration

Decoupled Contrastive Multi-view Clustering with High-order Random Walks

A Preliminary Investigation into Search and Matching for Tumour Discrimination in WHO Breast Taxonomy Using Deep Networks

SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection

Domain Generalization via Rationale Invariance

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

High Dynamic Range Imaging of Dynamic Scenes with Saturation Compensation but without Explicit Motion Compensation

Efficient View Synthesis with Neural Radiance Distribution Field

Hey That’s Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs

Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction

Development of a Novel Quantum Pre-processing Filter to Improve Image Classification Accuracy of Neural Network Models

Classification of the lunar surface pattern by AI architectures: Does AI see a rabbit in the Moon?

Recursive Video Lane Detection

MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers

Addressing Fairness and Explainability in Image Classification Using Optimal Transport

Long-Term Prediction of Natural Video Sequences with Robust Video Predictors

Audio-Visual Class-Incremental Learning

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection

MetaGCD: Learning to Continually Learn in Generalized Category Discovery

UnLoc: A Unified Framework for Video Localization Tasks

Harmonization Across Imaging Locations(HAIL): One-Shot Learning for Brain MRI

Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction

Multi-Task Hypergraphs for Semi-supervised Learning using Earth Observations

Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

Autonomous Detection of Methane Emissions in Multispectral Satellite Data Using Deep Learning

Switched auxiliary loss for robust training of transformer models for histopathological image segmentation

Debiasing Counterfactuals In the Presence of Spurious Correlations

VQA Therapy: Exploring Answer Differences by Visually Grounding Answers

SupEuclid: Extremely Simple, High Quality OoD Detection with Supervised Contrastive Learning and Euclidean Distance

MRI Field-transfer Reconstruction with Limited Data: Regularization by Neural Style Transfer

BundleSeg: A versatile, reliable and reproducible approach to white matter bundle segmentation

CamP: Camera Preconditioning for Neural Radiance Fields