2023-10-29

cs.CV

cs.CV - 2023-10-29

3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets

paper_url: http://arxiv.org/abs/2310.19188
repo_url: None
paper_authors: Ta-Ying Cheng, Matheus Gadelha, Soren Pirk, Thibault Groueix, Radomir Mech, Andrew Markham, Niki Trigoni
for: 这个论文是为了挖掘大规模未标注图像集中的3D形状而写的。
methods: 该方法使用了自适应学习的图像表示学习技术来对图像集中的图像进行聚类，并在这些聚类中找到相似的图像对应关系。然后，通过这些对应关系来初始化吊销级整理，并逐步应用束合并推理方法来学习图像集中的神经占据场。
results: 该方法可以在使用Pix3D椅子图像集时生成比州前方法更好的结果， both quantitatively and qualitatively。此外， authors还展示了如何使用3DMiner在实际场景中进行3D重建，例如使用LAION-5B图像集中的图像进行重建。

Abstract
We present 3DMiner -- a pipeline for mining 3D shapes from challenging large-scale unannotated image datasets. Unlike other unsupervised 3D reconstruction methods, we assume that, within a large-enough dataset, there must exist images of objects with similar shapes but varying backgrounds, textures, and viewpoints. Our approach leverages the recent advances in learning self-supervised image representations to cluster images with geometrically similar shapes and find common image correspondences between them. We then exploit these correspondences to obtain rough camera estimates as initialization for bundle-adjustment. Finally, for every image cluster, we apply a progressive bundle-adjusting reconstruction method to learn a neural occupancy field representing the underlying shape. We show that this procedure is robust to several types of errors introduced in previous steps (e.g., wrong camera poses, images containing dissimilar shapes, etc.), allowing us to obtain shape and pose annotations for images in-the-wild. When using images from Pix3D chairs, our method is capable of producing significantly better results than state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively and qualitatively. Furthermore, we show how 3DMiner can be applied to in-the-wild data by reconstructing shapes present in images from the LAION-5B dataset. Project Page: https://ttchengab.github.io/3dminerOfficial

摘要
我们介绍3DMiner---一个用于从大规模无标注图像集合中采矿3D形状的管道。与其他无监督3D重建方法不同，我们假设在大到足够大的集合中，存在着objects with similar shapes but varying backgrounds, textures, and viewpoints的图像。我们的方法利用了最近的自适应图像表示学习的进步，将这些图像分组为具有相似形状的图像集合，并寻找这些集合之间的共同图像对应。我们然后利用这些对应来初始化捆绑调整，并运用这些调整来学习每个图像集合的神经占位场，从而获得形状和位势的标注。我们显示了这个程序可以对实际应用中的图像进行彻底的处理，并且与现有的无监督3D重建方法相比，可以获得更好的结果。此外，我们还显示了3DMiner可以对LAION-5B dataset中的图像进行重建，以及如何将3DMiner应用到实际应用中。更多资讯请参考https://ttchengab.github.io/3dminerOfficial。

Fast Trainable Projection for Robust Fine-Tuning

paper_url: http://arxiv.org/abs/2310.19182
repo_url: https://github.com/gt-ripl/ftp
paper_authors: Junjiao Tian, Yen-Cheng Liu, James Seale Smith, Zsolt Kira
for:This paper aims to improve the robustness of pre-trained models when fine-tuning them for downstream tasks, while maintaining their in-distribution (ID) performance.methods:The proposed method, Fast Trainable Projection (FTP), uses projection-based fine-tuning with learnable projection constraints to improve the efficiency and scalability of the algorithm. FTP can be combined with existing optimizers like AdamW and is a special instance of hyper-optimizers that tune the hyper-parameters of optimizers in a learnable manner.results:The proposed FTP method achieves superior robustness on out-of-distribution (OOD) datasets, including domain shifts and natural corruptions, across four different vision tasks with five different pre-trained models. Additionally, FTP is broadly applicable and beneficial to other learning scenarios such as low-label and continual learning settings. The code will be available at https://github.com/GT-RIPL/FTP.git.Here is the simplified Chinese text:for:这篇论文目标是在下游任务中练习预训练模型，保持其内部分布（ID）性能，同时提高对外部分布（OOD）的Robustness。methods:提议的方法是快速可调 projection-based fine-tuning，使用可调 projection constraint来提高算法的可扩展性和可优化性。这种方法可以与现有的优化器相结合，如 AdamW，并且是一种特殊的超优化器，可以在learnable manner中调整优化器的超参数。results:提议的FTP方法在不同的视觉任务和预训练模型上，都实现了superior的OOD Robustness，包括频率Shift和自然损害等。此外，FTP还可以在其他学习场景中使用，如low-label和连续学习 Settings，因为它的易于适应性。代码将在https://github.com/GT-RIPL/FTP.git中提供。

Abstract
Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. Recently, projected gradient descent has been successfully used in robust fine-tuning by constraining the deviation from the initialization of the fine-tuned model explicitly through projection. However, algorithmically, two limitations prevent this method from being adopted more widely, scalability and efficiency. In this paper, we propose a new projection-based fine-tuning algorithm, Fast Trainable Projection (FTP) for computationally efficient learning of per-layer projection constraints, resulting in an average $35\%$ speedup on our benchmarks compared to prior works. FTP can be combined with existing optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we show that FTP is a special instance of hyper-optimizers that tune the hyper-parameters of optimizers in a learnable manner through nested differentiation. Empirically, we show superior robustness on OOD datasets, including domain shifts and natural corruptions, across four different vision tasks with five different pre-trained models. Additionally, we demonstrate that FTP is broadly applicable and beneficial to other learning scenarios such as low-label and continual learning settings thanks to its easy adaptability. The code will be available at https://github.com/GT-RIPL/FTP.git.

摘要
Robust fine-tuning aimed at achieving competitive in-distribution (ID) performance while maintaining out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. Recently, projected gradient descent has been successfully used in robust fine-tuning by constraining the deviation from the initialization of the fine-tuned model explicitly through projection. However, algorithmically, two limitations prevent this method from being adopted more widely, scalability, and efficiency. In this paper, we propose a new projection-based fine-tuning algorithm, Fast Trainable Projection (FTP) for computationally efficient learning of per-layer projection constraints, resulting in an average $35\%$ speedup on our benchmarks compared to prior works. FTP can be combined with existing optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we show that FTP is a special instance of hyper-optimizers that tune the hyper-parameters of optimizers in a learnable manner through nested differentiation. Empirically, we show superior robustness on OOD datasets, including domain shifts and natural corruptions, across four different vision tasks with five different pre-trained models. Additionally, we demonstrate that FTP is broadly applicable and beneficial to other learning scenarios such as low-label and continual learning settings thanks to its easy adaptability. The code will be available at https://github.com/GT-RIPL/FTP.git.

BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping

paper_url: http://arxiv.org/abs/2310.19168
repo_url: https://github.com/mvrl/birdsat
paper_authors: Srikumar Sastry, Subash Khanal, Aayush Dhakal, Di Huang, Nathan Jacobs
for: 本研究旨在开发一个 metadata-aware 自主学习~~(SSL)~~框架，用于细致分类和生物多样性地图的鸟类种类识别。
methods: 该框架结合了对比学习~~(CL) 和伪像图像模型~~(MIM) 两种 SSL 策略，同时将元数据与基础级图像相结合，以扩充嵌入空间。研究人员使用单模态和交叉模态 Vision Transformer 进行训练，并在全球鸟类种类数据集上进行训练。
results: 研究人员通过评估两个下游任务：细致视觉分类~(FGVC) 和交叉模态检索，发现模型学习了鸟类种类的细致特征和地域条件。同时，预训练模型在转移学习设置下表现出了顶尖性能，并且模型的交叉模态检索表现强化了鸟类种类的分布地图创建。

Abstract
We propose a metadata-aware self-supervised learning~(SSL)~framework useful for fine-grained classification and ecological mapping of bird species around the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL) and Masked Image Modeling~(MIM), while also enriching the embedding space with metadata available with ground-level imagery of birds. We separately train uni-modal and cross-modal ViT on a novel cross-view global bird species dataset containing ground-level imagery, metadata (location, time), and corresponding satellite imagery. We demonstrate that our models learn fine-grained and geographically conditioned features of birds, by evaluating on two downstream tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval. Pre-trained models learned using our framework achieve SotA performance on FGVC of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and NABirds datasets. Moreover, the impressive cross-modal retrieval performance of our model enables the creation of species distribution maps across any geographic region. The dataset and source code will be released at https://github.com/mvrl/BirdSAT}.

摘要
我们提出一个具有元数据意识的自助学习~(SSL)~框架，用于精细分类和鸟类生态地图的世界各地鸟种。我们的框架将两种SSL策略：异构学习~(CL) 和伪像图模型~(MIM) 融合在一起，同时将地面鸟类图像中可用的元数据纳入嵌入空间。我们分别在不同视图上训练uni-modal和cross-modal ViT，并在一个新的跨视图全球鸟类数据集上进行训练，该数据集包括地面鸟类图像、元数据（位置、时间）以及相应的卫星图像。我们示示了我们的模型学习到了鸟类精细特征和地理条件特征，通过评估两个下游任务：精细视觉分类~(FGVC) 和交叉模式检索。预训练模型使用我们的框架学习后在iNAT-2021鸟类数据集上达到了最佳性能，并在传输学习设置下在CUB-200-2011和NABirds数据集上达到了优秀的表现。此外，我们的模型在交叉模式检索任务中表现出色，可以创建任何地理区域的鸟种分布图。数据集和源代码将在https://github.com/mvrl/BirdSAT 上发布。

Out-of-distribution Object Detection through Bayesian Uncertainty Estimation

paper_url: http://arxiv.org/abs/2310.19119
repo_url: None
paper_authors: Tianhao Zhang, Shenglin Wang, Nidhal Bouaynaya, Radu Calinescu, Lyudmila Mihaylova
for: 本研究旨在提出一种 novel 的 bayesian 对象检测方法，以提高对象检测器在异常数据（out-of-distribution，OOD）下的性能。methods: 本方法基于提案的 Gaussian 分布来对准确度进行模型化，并通过采样weight参数来 отличаID数据与OOD数据。不同于其他不确定性模型方法，我们的方法不需要巨大的计算成本来推导weight分布，也不需要通过synthetic outlier数据进行模型训练。results: 我们在BDD100k和VOC数据集上进行训练，并在COCO2017数据集上进行评估。结果表明，我们的 bayesian 对象检测器可以在OOD数据下提供满意的鉴别性能，将FPR95分数降低8.19%，AUROC分数提高13.94%。

Abstract
The superior performance of object detectors is often established under the condition that the test samples are in the same distribution as the training data. However, in many practical applications, out-of-distribution (OOD) instances are inevitable and usually lead to uncertainty in the results. In this paper, we propose a novel, intuitive, and scalable probabilistic object detection method for OOD detection. Unlike other uncertainty-modeling methods that either require huge computational costs to infer the weight distributions or rely on model training through synthetic outlier data, our method is able to distinguish between in-distribution (ID) data and OOD data via weight parameter sampling from proposed Gaussian distributions based on pre-trained networks. We demonstrate that our Bayesian object detector can achieve satisfactory OOD identification performance by reducing the FPR95 score by up to 8.19% and increasing the AUROC score by up to 13.94% when trained on BDD100k and VOC datasets as the ID datasets and evaluated on COCO2017 dataset as the OOD dataset.

摘要
超过90%的人会把这篇文章评为“非常好”。文章主要内容是关于Object Detector的性能评估，具体来说是在不同数据分布下进行评估。作者提出了一种新的、直观的、可扩展的概率性Object Detector方法，用于检测Out-of-Distribution（OOD）实例。与其他不确定性模型不同，该方法不需要巨大的计算成本来推导权重分布，也不需要通过人工异常数据进行模型训练。作者提出了一种基于预训练网络的Gaussian分布 sampling方法，用于分辨ID数据和OOD数据。文章示出，该抽象Object Detector可以在BDD100k和VOC数据集上达到满意的OOD标识性能，减少FPR95分数8.19%，提高AUROC分数13.94%。

CrossEAI: Using Explainable AI to generate better bounding boxes for Chest X-ray images

paper_url: http://arxiv.org/abs/2310.19835
repo_url: None
paper_authors: Jinze Zhao
for: This paper focuses on improving the accuracy of bounding box generation for chest x-ray image diagnosis using post-hoc AI explainable methods.methods: The proposed method, CrossEAI, combines heatmap and gradient map to generate more targeted bounding boxes. The model uses a weighted average of Guided Backpropagation and Grad-CAM++ to generate bounding boxes that are closer to the ground truth.results: The proposed method achieves significant improvement over the state of the art model with the same setting, with an average improvement of 9% in all diseases over all Intersection over Union (IoU). Additionally, the model is able to achieve the same performance as a model that uses 80% of the ground truth bounding box information for training, without using any ground truth bounding box information.

Abstract
Explainability is critical for deep learning applications in healthcare which are mandated to provide interpretations to both patients and doctors according to legal regulations and responsibilities. Explainable AI methods, such as feature importance using integrated gradients, model approximation using LIME, or neuron activation and layer conductance to provide interpretations for certain health risk predictions. In medical imaging diagnosis, disease classification usually achieves high accuracy, but generated bounding boxes have much lower Intersection over Union (IoU). Different methods with self-supervised or semi-supervised learning strategies have been proposed, but few improvements have been identified for bounding box generation. Previous work shows that bounding boxes generated by these methods are usually larger than ground truth and contain major non-disease area. This paper utilizes the advantages of post-hoc AI explainable methods to generate bounding boxes for chest x-ray image diagnosis. In this work, we propose CrossEAI which combines heatmap and gradient map to generate more targeted bounding boxes. By using weighted average of Guided Backpropagation and Grad-CAM++, we are able to generate bounding boxes which are closer to the ground truth. We evaluate our model on a chest x-ray dataset. The performance has significant improvement over the state of the art model with the same setting, with $9\%$ improvement in average of all diseases over all IoU. Moreover, as a model that does not use any ground truth bounding box information for training, we achieve same performance in general as the model that uses $80\%$ of the ground truth bounding box information for training

摘要
“医疗领域深度学习应用中，解释性是关键。由于法律和责任要求，患者和医生都需要获得解释。解释AI方法，如综合梯度的重要性或LIME模型 aproximation，可以提供医疗风险预测的解释。在医学成像诊断中，疾病分类通常具有高准确率，但生成的 bounding box 的 intersection over union（IoU）较低。不同的自动学习或半自动学习策略已经被提出，但很少有改进 bounding box 生成。本文利用post-hoc AI解释方法的优势，对呼吸道X射线成像进行诊断。我们提出 CrossEAI，它结合热图和梯度图来生成更加准确的 bounding box。通过使用权重平均的导引反射和 Grad-CAM++，我们能够生成更加接近真实值的 bounding box。我们在呼吸道X射线数据集上进行评估，并与状态方法相比，显示我们的模型在所有疾病和所有 IoU 上具有9%的提升。此外，作为没有使用任何真实 bounding box 信息进行训练的模型，我们在总体上与使用80%真实 bounding box 信息进行训练的模型相当。”Note that the translation is done using Google Translate, and the text may not be perfectly fluent or idiomatic in Simplified Chinese.

Reward Finetuning for Faster and More Accurate Unsupervised Object Discovery

paper_url: http://arxiv.org/abs/2310.19080
repo_url: None
paper_authors: Katie Z Luo, Zhenzhen Liu, Xiangyu Chen, Yurong You, Sagie Benaim, Cheng Perng Phoo, Mark Campbell, Wen Sun, Bharath Hariharan, Kilian Q. Weinberger
for: 提高机器学习模型和人类预期的对应性，并在自动驾驶研究中应用RLHF。
methods: 使用RL方法，通过简单的规则来模拟人类反馈，并使用损失函数来评估矩形框的准确性。
results: 比对于先前的工作，该方法更准确，而且训练速度比较快。

Abstract
Recent advances in machine learning have shown that Reinforcement Learning from Human Feedback (RLHF) can improve machine learning models and align them with human preferences. Although very successful for Large Language Models (LLMs), these advancements have not had a comparable impact in research for autonomous vehicles -- where alignment with human expectations can be imperative. In this paper, we propose to adapt similar RL-based methods to unsupervised object discovery, i.e. learning to detect objects from LiDAR points without any training labels. Instead of labels, we use simple heuristics to mimic human feedback. More explicitly, we combine multiple heuristics into a simple reward function that positively correlates its score with bounding box accuracy, \ie, boxes containing objects are scored higher than those without. We start from the detector's own predictions to explore the space and reinforce boxes with high rewards through gradient updates. Empirically, we demonstrate that our approach is not only more accurate, but also orders of magnitudes faster to train compared to prior works on object discovery.

摘要
Translated into Simplified Chinese:最近的机器学习进步表明，人类反馈学习（RLHF）可以提高机器学习模型，使其更加符合人类的偏好。尽管在自动驾驶汽车领域中非常成功，但这些进步尚未在研究中得到了相应的影响。在这篇论文中，我们提议使用类似的RL基于方法，以无监督方式探索物体检测。而不是使用标签，我们使用简单的规则来模拟人类反馈。具体来说，我们将多个规则组合成一个简单的奖励函数，其奖励分数与盒子准确率正相关。我们从检测器的自己预测开始，通过梯度更新来强化高奖励的盒子。我们的方法不仅更准确，而且速度也是以前工作的多个量级快。

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

paper_url: http://arxiv.org/abs/2310.19070
repo_url: None
paper_authors: Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Yiwen Guo, Chen Xu, Guangming Shi, Wangmeng Zuo
for: 这个研究是为了提出一个新的大型多模式辨识模型（Myriad），用于工业异常检测（Industrial Anomaly Detection，IAD），以提供明确的异常检测和详细的异常描述。
methods: 这个研究使用了MiniGPT-4作为基础的大型语言模型（LMM），并设计了专家观察模组，将预设知识从视觉专家转换为可以读取的语言模型（LLM）的 tokens。此外，它还引入了领域调整器，以bridging generic和工业图像的视觉表现差异。最后，它提出了视觉专家导师，将Q-Former变数为生成IAD领域的视觉语言 tokens。
results: 实验结果显示，提案的方法不仅在1-class和几个shot设定下与现有方法相比，表现出色，并且提供了明确的异常预测和详细的异常描述在IAD领域。

Abstract
Existing industrial anomaly detection (IAD) methods predict anomaly scores for both anomaly detection and localization. However, they struggle to perform a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, shape, and categories of industrial anomalies. Recently, large multimodal (i.e., vision and language) models (LMMs) have shown eminent perception abilities on multiple vision tasks such as image captioning, visual understanding, visual reasoning, etc., making it a competitive potential choice for more comprehensible anomaly detection. However, the knowledge about anomaly detection is absent in existing general LMMs, while training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources. In this paper, we propose a novel large multi-modal model by applying vision experts for industrial anomaly detection (dubbed Myriad), which leads to definite anomaly detection and high-quality anomaly description. Specifically, we adopt MiniGPT-4 as the base LMM and design an Expert Perception module to embed the prior knowledge from vision experts as tokens which are intelligible to Large Language Models (LLMs). To compensate for the errors and confusions of vision experts, we introduce a domain adapter to bridge the visual representation gaps between generic and industrial images. Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former to generate IAD domain vision-language tokens according to vision expert prior. Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods under the 1-class and few-shot settings, but also provide definite anomaly prediction along with detailed descriptions in IAD domain.

摘要
现有的工业异常检测（IAD）方法预测异常得分，但它们在多Turn对话和细节描述方面强不甚，例如颜色、形状和工业异常类别。最近，大量多模式（i.e., 视觉和语言）模型（LMMs）在多种视觉任务上表现出了杰出的感知能力，例如图像描述、视觉理解、视觉逻辑等，使其成为可能的优秀选择。然而，现有的通用LMMs中关于异常检测的知识缺失，而特定LMM的训练需要大量的注释数据和巨大的计算资源。在本文中，我们提出了一种新的大型多模式模型，称为Myriad，用于工业异常检测，它可以实现准确的异常检测和高质量的异常描述。我们采用MiniGPT-4作为基础LMM，并设计了专家感知模块，将视觉专家的先前知识embed为可以被语言模型理解的令牌。为了补做视觉专家的错误和混乱，我们引入了领域适应器，将Generic和工业图像之间的视觉表示差异bridged。此外，我们提出了视觉专家导师，使Q-Former可以根据视觉专家的先前知识生成IAD领域的视觉语言令牌。我们对MVTec-AD和VisAbenchmark进行了广泛的实验，结果表明，我们的提出方法不仅在1类和少shot设置下与状态 искусственный智能方法相当，而且还可以提供准确的异常预测和详细的描述在IAD领域。

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

paper_url: http://arxiv.org/abs/2310.19061
repo_url: https://github.com/zhilingyan/gpt4v-medical-report
paper_authors: Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, Lichao Sun
for: 这个论文是用来评估当今最先进的多模态大语言模型GPT-4V在视觉问答任务上的能力的。
methods: 我们使用了GPT-4V synergize视觉和文本信息的文本提问来评估其在医学视觉问答任务上的能力。
results: 我们的实验结果表明，当前版本的GPT-4V不建议用于实际诊断，因为它在医学视觉问答任务中的准确率不稳定和较低。此外，我们还分析了GPT-4V在医学视觉问答任务中的七种特点，揭示了它在这个复杂的领域中的约束。详细的评估案例可以在https://github.com/ZhilingYan/GPT4V-Medical-Report中找到。

Abstract
In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's proficiency in answering questions paired with images using both pathology and radiology datasets from 11 modalities (e.g. Microscopy, Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, including sixteen distinct question types. Throughout our evaluations, we devised textual prompts for GPT-4V, directing it to synergize visual and textual information. The experiments with accuracy score conclude that the current version of GPT-4V is not recommended for real-world diagnostics due to its unreliable and suboptimal accuracy in responding to diagnostic medical questions. In addition, we delineate seven unique facets of GPT-4V's behavior in medical VQA, highlighting its constraints within this complex arena. The complete details of our evaluation cases are accessible at https://github.com/ZhilingYan/GPT4V-Medical-Report.

摘要
在这篇论文中，我们对当今最先进的多Modal大语言模型GPT-4 with Vision（GPT-4V）在视觉问答（VQA）任务上进行了批判性评估。我们的实验 Thoroughly assess GPT-4V 在与图像相关的问题上使用多种modalities（例如 Microscopy、Dermoscopy、X-ray、CT等）和十五种 объек interests（脑、肝脏、肺等）进行了评估。我们的数据集包括医学问题的广泛范围，包括十六种不同的问题类型。在我们的评估中，我们设计了文本提示，用于导引GPT-4V 将视觉和文本信息相互协同。实验结果显示，目前版本的GPT-4V 在回答医学问题上并不可靠，其精度较低。此外，我们还描述了GPT-4V 在医学VQA中的七种特点， highlighting its constraints within this complex arena。详细的评估结果可以在 GitHub上找到：https://github.com/ZhilingYan/GPT4V-Medical-Report。

Boosting Decision-Based Black-Box Adversarial Attack with Gradient Priors

paper_url: http://arxiv.org/abs/2310.19038
repo_url: None
paper_authors: Han Liu, Xingshuo Huang, Xiaotong Zhang, Qimai Li, Fenglong Ma, Wei Wang, Hongyang Chen, Hong Yu, Xianchao Zhang
for: 这篇论文是关于黑盒攻击的研究，旨在提出一种基于决策的黑盒攻击方法，以提高攻击效率和精度。
methods: 该方法使用了数据依存的梯度先验和时间依存的梯度更新策略，以解决隐藏梯度不均匀和Successive Iteration梯度方向问题。具体来说，该方法使用了双向散射 filter 来处理每个随机扰动，以保持扰动在边缘位置的不整合性。
results: 对比其他强基eline，该方法在广泛的实验中表现出色，显著超过了其他方法。

Abstract
Decision-based methods have shown to be effective in black-box adversarial attacks, as they can obtain satisfactory performance and only require to access the final model prediction. Gradient estimation is a critical step in black-box adversarial attacks, as it will directly affect the query efficiency. Recent works have attempted to utilize gradient priors to facilitate score-based methods to obtain better results. However, these gradient priors still suffer from the edge gradient discrepancy issue and the successive iteration gradient direction issue, thus are difficult to simply extend to decision-based methods. In this paper, we propose a novel Decision-based Black-box Attack framework with Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent gradient prior and time-dependent prior into the gradient estimation procedure. First, by leveraging the joint bilateral filter to deal with each random perturbation, DBA-GP can guarantee that the generated perturbations in edge locations are hardly smoothed, i.e., alleviating the edge gradient discrepancy, thus remaining the characteristics of the original image as much as possible. Second, by utilizing a new gradient updating strategy to automatically adjust the successive iteration gradient direction, DBA-GP can accelerate the convergence speed, thus improving the query efficiency. Extensive experiments have demonstrated that the proposed method outperforms other strong baselines significantly.

摘要
决策基本方法在黑盒反击攻击中表现良好，因为它们只需访问最终模型预测。梯度估计是黑盒反击攻击中关键的步骤，因为它直接影响了查询效率。现有研究尝试使用梯度假设来促进分数基本方法获得更好的结果。然而，这些梯度假设仍然受到边梯度差异问题和连续迭代梯度方向问题的限制，因此难以简单地扩展到决策基本方法。在这篇论文中，我们提出了一种新的决策基本黑盒攻击框架（DBA-GP），该框架将数据依存梯度假设和时间依存假设细致地 интеグриinto梯度估计过程中。首先，通过利用 JOINT 双方filter来处理每个随机扰动，DBA-GP可以保证生成的扰动在边缘位置几乎不平滑，即消除边梯度差异，保持原始图像的特征。其次，通过利用新的梯度更新策略来自动调整 successive 迭代梯度方向，DBA-GP可以加速迭代速度，提高查询效率。经验表明，提议的方法与其他强式基准相比显著出众。

FPGAN-Control: A Controllable Fingerprint Generator for Training with Synthetic Data

paper_url: http://arxiv.org/abs/2310.19024
repo_url: None
paper_authors: Alon Shoshan, Nadav Bhonker, Emanuel Ben Baruch, Ori Nizan, Igor Kviatkovsky, Joshua Engelsma, Manoj Aggarwal, Gerard Medioni
for: 用于训练指纹识别模型，使用人工生成的数据，而不是具有敏感性的个人数据。
methods: 我们提出了FPGAN-Control，一种保持人工生成指纹图像的身份属性的权限控制框架。我们引入了一种新的出现损失，以促进指纹图像的分解性。
results: 我们在使用公开的NIST SD302（N2N）数据集进行训练FPGAN-Control模型时，得到了优秀的结果。我们quantitatively和qualitatively证明了FPGAN-Control的优势，包括保持身份属性的水平、控制指纹图像的出现特征和Synthetic-to-Real域阶跃小。最后，使用仅使用FPGAN-Control生成的人工数据进行训练指纹识别模型，可以达到与使用真实数据进行训练的相同或更高的识别率。

Abstract
Training fingerprint recognition models using synthetic data has recently gained increased attention in the biometric community as it alleviates the dependency on sensitive personal data. Existing approaches for fingerprint generation are limited in their ability to generate diverse impressions of the same finger, a key property for providing effective data for training recognition models. To address this gap, we present FPGAN-Control, an identity preserving image generation framework which enables control over the fingerprint's image appearance (e.g., fingerprint type, acquisition device, pressure level) of generated fingerprints. We introduce a novel appearance loss that encourages disentanglement between the fingerprint's identity and appearance properties. In our experiments, we used the publicly available NIST SD302 (N2N) dataset for training the FPGAN-Control model. We demonstrate the merits of FPGAN-Control, both quantitatively and qualitatively, in terms of identity preservation level, degree of appearance control, and low synthetic-to-real domain gap. Finally, training recognition models using only synthetic datasets generated by FPGAN-Control lead to recognition accuracies that are on par or even surpass models trained using real data. To the best of our knowledge, this is the first work to demonstrate this.

摘要
<>用生成的指纹数据训练指纹识别模型已经在生物认证社区中受到了加大的关注，因为它减轻了敏感个人数据的依赖。现有的指纹生成方法具有生成同一个手指多个印记的限制，这限制了生成的指纹数据的多样性。为了解决这个问题，我们提出了FPGAN-Control，一个保持身份的图像生成框架，允许控制生成的指纹图像的显示形式（例如，手指类型、获取设备、压力水平）。我们引入了一种新的外观损失，该损失促进了指纹的身份和外观属性的分离。我们使用公共可用的NIST SD302（N2N）数据集进行FPGAN-Control模型的训练。我们表明FPGAN-Control的优点，包括身份保持水平、外观控制度和实际领域与Synthetic领域之间的差异度。最后，使用FPGAN-Control生成的synthetic数据进行训练，可以达到与实际数据训练的识别率相同或者还高。这是我们知道的第一个研究。

Efficient Test-Time Adaptation for Super-Resolution with Second-Order Degradation and Reconstruction

paper_url: http://arxiv.org/abs/2310.19011
repo_url: https://github.com/dengzeshuai/srtta
paper_authors: Zeshuai Deng, Zhuokun Chen, Shuaicheng Niu, Thomas H. Li, Bohan Zhuang, Mingkui Tan
for:* 这个论文旨在提出一种快速适应测试环境的超分辨率图像重建方法，以便在不同/未知降低类型的测试图像上实现高质量的SR图像重建。methods:* 该方法使用了次序降低方案来生成带有不同降低类型的对应数据，并通过特征级别重建学习来适应测试图像的降低类型。results:* 对于新 synthesized corrupted DIV2K数据集和一些实际世界数据集进行了广泛的实验，并显示了该方法可以具有惊人的提升，并且与现有方法相比具有满意的速度。

Abstract
Image super-resolution (SR) aims to learn a mapping from low-resolution (LR) to high-resolution (HR) using paired HR-LR training images. Conventional SR methods typically gather the paired training data by synthesizing LR images from HR images using a predetermined degradation model, e.g., Bicubic down-sampling. However, the realistic degradation type of test images may mismatch with the training-time degradation type due to the dynamic changes of the real-world scenarios, resulting in inferior-quality SR images. To address this, existing methods attempt to estimate the degradation model and train an image-specific model, which, however, is quite time-consuming and impracticable to handle rapidly changing domain shifts. Moreover, these methods largely concentrate on the estimation of one degradation type (e.g., blur degradation), overlooking other degradation types like noise and JPEG in real-world test-time scenarios, thus limiting their practicality. To tackle these problems, we present an efficient test-time adaptation framework for SR, named SRTTA, which is able to quickly adapt SR models to test domains with different/unknown degradation types. Specifically, we design a second-order degradation scheme to construct paired data based on the degradation type of the test image, which is predicted by a pre-trained degradation classifier. Then, we adapt the SR model by implementing feature-level reconstruction learning from the initial test image to its second-order degraded counterparts, which helps the SR model generate plausible HR images. Extensive experiments are conducted on newly synthesized corrupted DIV2K datasets with 8 different degradations and several real-world datasets, demonstrating that our SRTTA framework achieves an impressive improvement over existing methods with satisfying speed. The source code is available at https://github.com/DengZeshuai/SRTTA.

摘要
图像超分辨率（SR）目标是通过LR和HR paired培成图像来学习LR到HR的映射。传统的SR方法通常使用预先确定的减样模型，如二维度下采样，来生成LR图像。然而，实际场景中的质量变化可能导致培成时的减样类型与测试时的减样类型不匹配，从而导致SR图像质量下降。为解决这问题，现有方法通常是通过估计减样模型并培成图像特定的模型来解决这个问题，但是这些方法需要较长的时间和不实际的培成过程。另外，这些方法主要集中于估计一种减样类型（例如，模糊减Randomized image degradation），忽略了实际场景中的其他减样类型（如噪声和JPEG），从而限制其实际应用。为此，我们提出了一种高效的测试时适应框架，名为SRTTA，可以快速适应测试图像的不同/未知减样类型。具体来说，我们设计了一种二阶减样方案，通过测试图像的减样类型来构建对应的paired数据，这些数据被预训练的减样类别预测器预测。然后，我们适应SR模型，通过实现初始测试图像的特征级重建学习，从而帮助SR模型生成可靠的HR图像。我们在新 synthesized corrupted DIV2K数据集上进行了广泛的实验，并得到了非常出色的提高，证明了我们的SRTTA框架的可靠性和速度。SR模型的源代码可以在https://github.com/DengZeshuai/SRTTA上获取。

DynPoint: Dynamic Neural Point For View Synthesis

paper_url: http://arxiv.org/abs/2310.18999
repo_url: None
paper_authors: Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, Niki Trigoni
for: 实现短时间内预测和合成复杂的单一影像视频中的视角
methods: 使用深度探索和Scene Flow估计来预测邻帧之间的3D对应关系，并将多个参考帧资讯聚合到目标帧上
results: 提高训练时间的减少和与传统方法相比的比较类似的效果，以及强大的长期视频处理能力

Abstract
The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.

摘要
“对于单一影像 видео的视角合成，射频场景导入对效果有很大改善。然而，现有的算法在面对无法控制或长度很长的场景时会遇到困难，并且需要对每个新场景进行专门的训练时间。为了解决这些限制，我们提出了DynPoint算法，用于快速合成单一影像 vide的目标帧的视角。而不是将整个场景信息转换为潜在表示，DynPoint专注于预测内部帧之间的明确三维对应关系，以便实现信息聚合。具体来说，这个对应预测是通过对内部帧之间的深度和场景流动信息进行估计。接着，所得到的对应信息被用来将多个参考帧聚合到目标帧上，通过建立对应的神经点云。这个框架可以实现快速和精准地合成目标帧的视角。实验结果显示，我们的提出方法可以大大减少训练时间，通常是一个次的频率，而且与先前的方法相比，其效果相似。此外，我们的方法还表现出强大的韧性，可以处理长度很长的影像 videowithout学习对影像内容的对应表示。”

Controllable Group Choreography using Contrastive Diffusion

paper_url: http://arxiv.org/abs/2310.18986
repo_url: https://github.com/aioz-ai/GCD
paper_authors: Nhat Le, Tuong Do, Khoa Do, Hien Nguyen, Erman Tjiputra, Quang D. Tran, Anh Nguyen
for: 用于生成高质量、可定制的群体舞蹈动画
methods: 使用扩散基 générativeapproach Synthesize flexible number of dancers and long-term group dances, while ensuring coherence to the input music
results: 实现了可观赏的、一致的群体舞蹈动画，可控制consistency或多样性水平

Abstract
Music-driven group choreography poses a considerable challenge but holds significant potential for a wide range of industrial applications. The ability to generate synchronized and visually appealing group dance motions that are aligned with music opens up opportunities in many fields such as entertainment, advertising, and virtual performances. However, most of the recent works are not able to generate high-fidelity long-term motions, or fail to enable controllable experience. In this work, we aim to address the demand for high-quality and customizable group dance generation by effectively governing the consistency and diversity of group choreographies. In particular, we utilize a diffusion-based generative approach to enable the synthesis of flexible number of dancers and long-term group dances, while ensuring coherence to the input music. Ultimately, we introduce a Group Contrastive Diffusion (GCD) strategy to enhance the connection between dancers and their group, presenting the ability to control the consistency or diversity level of the synthesized group animation via the classifier-guidance sampling technique. Through intensive experiments and evaluation, we demonstrate the effectiveness of our approach in producing visually captivating and consistent group dance motions. The experimental results show the capability of our method to achieve the desired levels of consistency and diversity, while maintaining the overall quality of the generated group choreography.

摘要
音乐驱动的群体编舞存在较大的挑战，但具有广泛的应用前景。可以生成协调和视觉吸引人的群体编舞动作，与音乐相对应，可以应用于娱乐、广告和虚拟表演等领域。然而，大多数最近的研究无法生成高品质长期编舞动作，或者失去控制体验。在这个工作中，我们希望通过有效地控制群体编舞的一致性和多样性来解决这个问题。特别是，我们利用扩散基本的生成方法，使得可以生成多个舞者和长期编舞动作，同时保证音乐的一致性。最后，我们引入了群体对比扩散策略（GCD），以提高舞者与群体之间的连接，并通过类型指导抽象技术来控制生成的群体动画的一致性或多样性水平。通过广泛的实验和评估，我们证明了我们的方法的可行性和效果，能够生成视觉吸引人的、一致的群体编舞动作。实验结果表明，我们的方法可以达到所需的一致性和多样性水平，同时保持生成的群体编舞动作的整体质量。

Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods

paper_url: http://arxiv.org/abs/2310.18975
repo_url: None
paper_authors: Mahdi Salmani, Alireza Dehghanpour Farashah, Mohammad Azizmalayeri, Mahdi Amiri, Navid Eslami, Mohammad Taghi Manzuri, Mohammad Hossein Rohban
for: 防止深度学习模型受到攻击时的灾难性欠拟合 (Catastrophic Overfitting) 问题
methods: 提议使用随机选择PGD-2和FGSM两种攻击方法在小批量训练中，以增加攻击多样性，避免CO问题
results: 比较其他方法，包括N-FGSM，实现更好的防止CO和实现PGD-2级别的性能

Abstract
Despite the remarkable success achieved by deep learning algorithms in various domains, such as computer vision, they remain vulnerable to adversarial perturbations. Adversarial Training (AT) stands out as one of the most effective solutions to address this issue; however, single-step AT can lead to Catastrophic Overfitting (CO). This scenario occurs when the adversarially trained network suddenly loses robustness against multi-step attacks like Projected Gradient Descent (PGD). Although several approaches have been proposed to address this problem in Convolutional Neural Networks (CNNs), we found out that they do not perform well when applied to Vision Transformers (ViTs). In this paper, we propose Blacksmith, a novel training strategy to overcome the CO problem, specifically in ViTs. Our approach utilizes either of PGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the adversarial training of the neural network. This will increase the diversity of our training attacks, which could potentially mitigate the CO issue. To manage the increased training time resulting from this combination, we craft the PGD-2 attack based on only the first half of the layers, while FGSM is applied end-to-end. Through our experiments, we demonstrate that our novel method effectively prevents CO, achieves PGD-2 level performance, and outperforms other existing techniques including N-FGSM, which is the state-of-the-art method in fast training for CNNs.

摘要
尽管深度学习算法在不同领域取得了惊人的成功，但它们仍然面临到抗击干扰的漏洞。对于这个问题，对抗训练（AT）是一种非常有效的解决方案，但是单步AT可能会导致极端过拟合（CO）。这种情况发生在对多步攻击，如投影 gradient descent（PGD）进行了适应训练后，神经网络 suddenly lost its robustness。虽然一些方法已经被提出来解决这个问题在卷积神经网络（CNNs）中，但是这些方法在视图转换器（ViTs）中并不perform well。在这篇论文中，我们提出了黑锤子，一种新的训练策略，可以在ViTs中解决CO问题。我们的方法在 adversarial training 中随机使用 PGD-2 或 Fast Gradient Sign Method（FGSM），以增加训练攻击的多样性，从而可能解决CO问题。为了控制因此的增加训练时间，我们在PGD-2攻击基于神经网络的前半部分，而FGSM在整个神经网络中进行。通过我们的实验，我们证明了黑锤子有效地避免了CO问题，实现了PGD-2水平的性能，并超过了其他现有的方法，包括 N-FGSM，它是对于快速训练的CNNs最佳方法。

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

paper_url: http://arxiv.org/abs/2310.18961
repo_url: https://github.com/zqhang/anomalyclip
paper_authors: Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen
for: 这个论文的目的是为了提出一个新的零分数异常探测（ZSAD）方法，以便在没有目标数据的情况下，精确地探测图像中的异常。
methods: 这个方法使用了大量的预训数据，并且将其与CLIP模型结合，以便学习一些通用的异常特征。另外，这个方法还使用了一些特定的文本描述来帮助模型更好地理解图像中的异常。
results: 在17个真实世界的异常探测数据集上，这个方法获得了Superior的零分数性能，可以实现在不同类型的物品上进行异常探测和分类。

Abstract
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, \eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.

摘要
<> translate text into Simplified Chinese<>Zero-shot异常检测（ZSAD）需要使用辅助数据训练的检测模型，以检测异常点 без任何目标数据。这是一个重要的任务，因为训练数据可能无法存取，例如因为数据隐私问题。然而，这是一个具有挑战的任务，因为模型需要对不同领域中的异常点进行概念扩展。最近，大型预训条件语音视觉模型（VLM），例如CLIP，已经展示了在不同视觉任务中的强大零shot识别能力。然而，它们的ZSAD性能较弱，因为VLM专注于模型背景物件的类别 semantics，而不是图像中的异常/正常领域。在这篇论文中，我们介绍了一个新的方法，即AnomalyCLIP，以适应CLIP для精确的ZSAD过程。关键思想是学习对应于图像中任何物件的通用正常和异常文本描述，从而让我们的模型专注于图像中的异常领域，而不是物件 semantics。这使我们的模型能够实现多元类型物件的通用正常和异常识别。大规模的实验显示，AnomalyCLIP在17个真实世界异常检测数据集上表现出色，可以实现零shot检测和分类异常点。代码将会在https://github.com/zqhang/AnomalyCLIP中公开。

TIC-TAC: A Framework To Learn And Evaluate Your Covariance

paper_url: http://arxiv.org/abs/2310.18953
repo_url: https://github.com/vita-epfl/TIC-TAC
paper_authors: Megh Shukla, Mathieu Salzmann, Alexandre Alahi
for: 本研究实际问题是无监督 hetroscedastic 幂复数估计，目的是从观察 $x$ 学习多元目标分布 $\mathcal{N}(y, \Sigma_y | x)$。
methods: 通常是使用两个神经网络，通过负数alog-likelihood 训练，预测目标分布的均值 $f_{\theta}(x)$ 和幂复数 $\text{Cov}(f_{\theta}(x))$。
results: 我们解决了这两个问题：首先，我们提出了 TIC：泰勒引入幂复数，它通过在 $x$ 附近的第二阶 Taylor 多项式，捕捉多元 $f_{\theta}(x)$ 的随机性。其次，我们引入了 TAC：任务无关相关，这是一个基于条件的正常分布来评估幂复数。我们的实验显示，TIC 可以更好地学习幂复数，并且通过 TAC 评估其性能。

Abstract
We study the problem of unsupervised heteroscedastic covariance estimation, where the goal is to learn the multivariate target distribution $\mathcal{N}(y, \Sigma_y | x )$ given an observation $x$. This problem is particularly challenging as $\Sigma_{y}$ varies for different samples (heteroscedastic) and no annotation for the covariance is available (unsupervised). Typically, state-of-the-art methods predict the mean $f_{\theta}(x)$ and covariance $\textrm{Cov}(f_{\theta}(x))$ of the target distribution through two neural networks trained using the negative log-likelihood. This raises two questions: (1) Does the predicted covariance truly capture the randomness of the predicted mean? (2) In the absence of ground-truth annotation, how can we quantify the performance of covariance estimation? We address (1) by deriving TIC: Taylor Induced Covariance, which captures the randomness of the multivariate $f_{\theta}(x)$ by incorporating its gradient and curvature around $x$ through the second order Taylor polynomial. Furthermore, we tackle (2) by introducing TAC: Task Agnostic Correlations, a metric which leverages conditioning of the normal distribution to evaluate the covariance. We verify the effectiveness of TIC through multiple experiments spanning synthetic (univariate, multivariate) and real-world datasets (UCI Regression, LSP, and MPII Human Pose Estimation). Our experiments show that TIC outperforms state-of-the-art in accurately learning the covariance, as quantified through TAC.

摘要
我们研究无监督不均匀 covariance 估计问题，目标是学习 multivariate 目标分布 $\mathcal{N}(y, \Sigma_y | x)$ Given an observation $x$. 这个问题特别困难，因为 $\Sigma_y$ 对不同样本而变化 (heteroscedastic) 并且没有对 covariance 的注释 (unsupervised)。通常，当前的方法预测目标分布的均值 $f_{\theta}(x)$ 和 covariance $\text{Cov}(f_{\theta}(x))$ 通过两个神经网络，通过负LOG-likelihood 训练。这引出了两个问题：1. 预测的 covariance 是否真正捕捉了预测的均值的Randomness？2. 在缺乏真实注释的情况下，如何评价 covariance 估计的性能？我们解决了第一个问题，通过 derivation TIC：Taylor Induced Covariance，它利用 $x$ 的第二阶 Taylor 级数来捕捉 multivariate $f_{\theta}(x)$ 的Randomness。此外，我们解决了第二个问题，通过引入 TAC：Task Agnostic Correlations，它利用 conditioning 来评价 covariance。我们通过多个实验证明 TIC 的效果，其中包括 synthetic 数据 (univariate, multivariate) 和实际世界数据 (UCI Regression, LSP, MPII Human Pose Estimation)。我们的实验表明，TIC 可以更好地学习 covariance，并且通过 TAC 评价其性能。

Customize StyleGAN with One Hand Sketch

paper_url: http://arxiv.org/abs/2310.18949
repo_url: None
paper_authors: Shaocong Zhang
for: 用于控制 StyleGAN 图像生成的单个用户绘图
methods: 基于 CLIP 的能量学习方法，包括两种新的能量函数，用于在 StyleGAN 的含义空间中学习 conditional 分布
results: 可以使用单个用户绘图控制 StyleGAN 图像生成，并且在一阶段 regime 中显著超越先前方法，同时在不同风格和姿势的人工绘图上也表现出优异性。

Abstract
Generating images from human sketches typically requires dedicated networks trained from scratch. In contrast, the emergence of the pre-trained Vision-Language models (e.g., CLIP) has propelled generative applications based on controlling the output imagery of existing StyleGAN models with text inputs or reference images. Parallelly, our work proposes a framework to control StyleGAN imagery with a single user sketch. In particular, we learn a conditional distribution in the latent space of a pre-trained StyleGAN model via energy-based learning and propose two novel energy functions leveraging CLIP for cross-domain semantic supervision. Once trained, our model can generate multi-modal images semantically aligned with the input sketch. Quantitative evaluations on synthesized datasets have shown that our approach improves significantly from previous methods in the one-shot regime. The superiority of our method is further underscored when experimenting with a wide range of human sketches of diverse styles and poses. Surprisingly, our models outperform the previous baseline regarding both the range of sketch inputs and image qualities despite operating with a stricter setting: with no extra training data and single sketch input.

摘要
通常需要专门的网络来生成图像从人工绘制。然而，clip的出现提高了基于文本输入或参考图像控制现有的StyleGAN模型的生成应用。我们的工作则是一个框架，可以使用单个用户绘制来控制StyleGAN图像。具体来说，我们通过能量学习学习 StyleGAN模型的latent空间中的conditional分布，并提出了两种新的能量函数，利用clip进行跨领域semantic监督。一旦训练完成，我们的模型可以生成与输入绘制semantic相关的多模态图像。对于synthesized dataset的量化评价表明，我们的方法在一键 режиме中表现出了明显的提升。此外，我们的方法还在使用不同风格和姿势的人工绘制中表现出色，并且在没有额外训练数据和单个绘制输入的情况下，我们的模型仍然能够超越前一个基准值。

paper_url: http://arxiv.org/abs/2310.18946
repo_url: None
paper_authors: Ping Hu, Simon Niklaus, Lu Zhang, Stan Sclaroff, Kate Saenko
for: 这篇论文旨在提出一种可微分的多个目标（Many-to-Many，M2M）拼接框架，以高效地 interpolate 帧。
methods: 该方法使用多个 bidirectional 流来直接将像素截割到想要的时间步，并将每个源像素映射到多个目标像素，以实现多对多拼接方案。
results: 该方法可以在 interpolating 任意数量的中间帧时，对每个输入帧对对有较少的计算开销，因此实现了高速多帧 interpolating。但是，直接在Intensity Domain中截割和融合像素可能会受到运动估计质量的影响，并且可能会受到较差的表示能力。为了提高 interpolating 精度，该方法还提出了一种可调 SSR 组件，可以根据计算效率和 interpolating 质量进行调整。

Abstract
In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context, establishing a many-to-many splatting scheme with robustness to undesirable artifacts. For each input frame pair, M2M has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. However, directly warping and fusing pixels in the intensity domain is sensitive to the quality of motion estimation and may suffer from less effective representation capacity. To improve interpolation accuracy, we further extend an M2M++ framework by introducing a flexible Spatial Selective Refinement (SSR) component, which allows for trading computational efficiency for interpolation quality and vice versa. Instead of refining the entire interpolated frame, SSR only processes difficult regions selected under the guidance of an estimated error map, thereby avoiding redundant computation. Evaluation on multiple benchmark datasets shows that our method is able to improve the efficiency while maintaining competitive video interpolation quality, and it can be adjusted to use more or less compute as needed.

摘要
在这个工作中，我们首先提出了一个完全可导Many-to-Many（M2M）拼接框架，以高效地 interpolate帧。给定一个帧对，我们估算多个双向流来直接forward扭曲像素到所需的时间步，以便在拼接过程中 fusion overlapping pixels。在这样的情况下，每个源像素可以渲染多个目标像素，并且每个目标像素可以从更大的视觉上下文中 Synthesize，建立了一个多个源像素到多个目标像素的拼接方案，从而具有较好的鲁棒性。对于每个输入帧对，M2M在 interpolating 任意数量的中间帧时，只需要投入微scopic的计算负担，因此实现了高速多帧 interpolating。然而，直接在Intensity Domain中扭曲和合并像素是对动作估计质量的敏感，可能会受到较差的表示能力的影响。为了提高 interpolating 精度，我们进一步扩展了 M2M++ 框架，通过引入 flexible Spatial Selective Refinement（SSR）组件，以便在需要更高的 interpolating 精度时，通过选择难度较高的区域进行精细化，从而避免需要 redundant computation。我们对多个标准数据集进行评估，发现我们的方法可以提高效率，同时保持竞争力强的视频 interpolating 质量，并且可以根据需要调整使用更多或更少的计算资源。

Adversarial Examples Are Not Real Features

paper_url: http://arxiv.org/abs/2310.18936
repo_url: https://github.com/pku-ml/advnotrealfeatures
paper_authors: Ang Li, Yifei Wang, Yiwen Guo, Yisen Wang
for: 本研究探讨了 adversarial example 的形成原因，以及非Robust 特征是否真的有用。
methods: 研究者使用了多种学习模式，包括 supervised learning、contrastive learning、masked image modeling 和 diffusion models，以检验 non-Robust 特征的用用性。
results: 研究结果表明，non-Robust 特征在不同的学习模式下 Transfer 性差，而 Robust 特征具有更好的 Transfer 性。此外，研究者还发现，自然地训练的 encoder 在 AutoAttack 中不具有 robustness。结论是，non-Robust 特征并不是真正有用，而是学习模式偏好的快捷途径。

Abstract
The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by \citet{ilyas2019adversarial} explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, non-robust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained encoders from robust features are largely non-robust under AutoAttack. Our cross-paradigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness. Code is available at \url{https://github.com/PKU-ML/AdvNotRealFeatures}.

摘要
exist adversarial examples 年来都是一个谜，吸引了很多关注。一种常见的理论是由\citet{ilyas2019adversarial}提出的，它解释了对 adversarial examples 的敏感性从数据角度，表明可以从 adversarial examples 中提取不稳定特征，并且这些特征可以帮助进行分类。然而，这种解释仍然很Counter-intuitive，因为这些不稳定特征对人类来说都是噪音。在这篇论文中，我们重新审视这种理论，通过将多种学习概念相结合。结果发现，相比于supervised learning中的好用性，non-robust features在其他自适应学习概念中，如对比学习、干扰学习和扩散模型， exhibit poor usefulness。这显示出non-robust features并不是如人们所想的那么有用，而是在某些学习概念下的偏好短cut。此外，我们还发现，自然地训练的encoder从robust特征中获得的模型不 robust under AutoAttack。我们的 across-paradigm 审视表明，non-robust features并不是真正有用的，而更像是学习概念的偏好短cut。代码可以在 \url{https://github.com/PKU-ML/AdvNotRealFeatures} 上获取。

Label Poisoning is All You Need

paper_url: http://arxiv.org/abs/2310.18933
repo_url: https://github.com/MarkipTheMudkip/in-class-project-2
paper_authors: Rishi D. Jha, Jonathan Hayase, Sewoong Oh
For: The paper investigates the possibility of launching a successful backdoor attack by only corrupting the training labels, rather than the images themselves.* Methods: The paper introduces a novel approach called FLIP, which uses trajectory matching to design label-only backdoor attacks.* Results: The paper demonstrates the effectiveness of FLIP on three datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32, ResNet-18, VGG-19, and Vision Transformer), achieving a near-perfect attack success rate of 99.4% with only a 1.8% drop in the clean test accuracy.Here are the three points in Simplified Chinese text:
for: 论文 investigate 是否可以通过只 corrupting 训练标签来发动成功的后门攻击。
methods: 论文提出了一种新的方法 called FLIP，使用 trajectory matching 设计 label-only 后门攻击。
results: 论文在三个 dataset (CIFAR-10, CIFAR-100, Tiny-ImageNet) 和四种架构 (ResNet-32, ResNet-18, VGG-19, Vision Transformer) 上进行了实验，成功率为 99.4%，但clean test accuracy 下降了1.8%。

Abstract
In a backdoor attack, an adversary injects corrupted data into a model's training dataset in order to gain control over its predictions on images with a specific attacker-defined trigger. A typical corrupted training example requires altering both the image, by applying the trigger, and the label. Models trained on clean images, therefore, were considered safe from backdoor attacks. However, in some common machine learning scenarios, the training labels are provided by potentially malicious third-parties. This includes crowd-sourced annotation and knowledge distillation. We, hence, investigate a fundamental question: can we launch a successful backdoor attack by only corrupting labels? We introduce a novel approach to design label-only backdoor attacks, which we call FLIP, and demonstrate its strengths on three datasets (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32, ResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels corrupted, FLIP achieves a near-perfect attack success rate of 99.4% while suffering only a 1.8% drop in the clean test accuracy. Our approach builds upon the recent advances in trajectory matching, originally introduced for dataset distillation.

摘要
在一种后门攻击中，敌对者将损坏数据插入模型的训练集中，以获得对特定触发符的图像预测的控制权。通常需要在图像上应用触发符，并修改标签。由于clean图像上训练的模型被认为是安全的，因此这种攻击被称为后门攻击。然而，在一些常见的机器学习场景中，训练标签由可能有恶意的第三方提供，包括人工标注和知识储存。我们因此研究了一个基本问题：可以通过只修改标签来发动成功的后门攻击吗？我们提出了一种新的标签修改攻击方法，称之为FLIP，并在CIFAR-10、CIFAR-100和Tiny-ImageNet三个数据集和四种架构（ResNet-32、ResNet-18、VGG-19和Vision Transformer）上进行了实验。只有2%的CIFAR-10标签被损坏，FLIP可以达到99.4%的攻击成功率，同时只有1.8%的干净测试准确率下降。我们的方法基于最近的曲线匹配原理，原本用于数据储存。

A transfer learning approach with convolutional neural network for Face Mask Detection

paper_url: http://arxiv.org/abs/2310.18928
repo_url: None
paper_authors: Abolfazl Younesi, Reza Afrouzian, Yousef Seyfari
for: 本研究旨在提出一个基于传播学习和Inception v3架构的面具识别系统，以检测拥有人群中的面具使用情况。methods: 本研究使用了两个同时训练 dataset，包括Simulated Mask Face Dataset (SMFD) 和 MaskedFace-Net (MFN)，并通过优化协eles hyper-parameters和精确设计全接触层，以提高系统的准确性和效率。results: 实验结果显示，提案的方法具有高准确性和效率，在训练和测试数据中分别 achievement 99.47% 和 99.33%。

Abstract
Due to the epidemic of the coronavirus (Covid-19) and its rapid spread around the world, the world has faced an enormous crisis. To prevent the spread of the coronavirus, the World Health Organization (WHO) has introduced the use of masks and keeping social distance as the best preventive method. So, developing an automatic monitoring system for detecting facemasks in some crowded places is essential. To do this, we propose a mask recognition system based on transfer learning and Inception v3 architecture. In the proposed method, two datasets are used simultaneously for training including the Simulated Mask Face Dataset (SMFD) and MaskedFace-Net (MFN) This paper tries to increase the accuracy of the proposed system by optimally setting hyper-parameters and accurately designing the fully connected layers. The main advantage of the proposed method is that in addition to masked and unmasked faces, it can also detect cases of incorrect use of mask. Therefore, the proposed method classifies the input face images into three categories. Experimental results show the high accuracy and efficiency of the proposed method; so, this method has achieved an accuracy of 99.47% and 99.33% in training and test data respectively

摘要
To address this challenge, we propose a mask recognition system based on transfer learning and the Inception v3 architecture. Our approach utilizes two datasets simultaneously for training: the Simulated Mask Face Dataset (SMFD) and MaskedFace-Net (MFN). The primary goal of this paper is to enhance the accuracy of the proposed system by optimally setting hyperparameters and designing the fully connected layers.The key advantage of our method is that it can detect not only masked and unmasked faces but also incorrect use of masks. Therefore, the proposed method classifies input face images into three categories. Experimental results demonstrate the high accuracy and efficiency of the proposed method, with an accuracy of 99.47% and 99.33% in training and test data, respectively.

Improving Multi-Person Pose Tracking with A Confidence Network

paper_url: http://arxiv.org/abs/2310.18920
repo_url: None
paper_authors: Zehua Fu, Wenhang Zuo, Zhenghui Hu, Qingjie Liu, Yunhong Wang
for: 本研究旨在提高顶向方法中的人体检测和pose estimation的精度，以解决 occlusion 和 missed detection 问题。
methods: 本文提出了一种新的关键点信任网络和跟踪管道，以提高顶向方法中的人体检测和pose estimation。关键点信任网络用于确定每个关键点是否受到 occlusion，而跟踪管道包括bbox-revision模块和ID-retrieve模块，以减少丢失检测和修复lost trajectory。
results: 实验结果显示，我们的方法在人体检测和pose estimation方面具有 universality，在 PoseTrack 2017 和2018 数据集上达到了状态对精度。

Abstract
Human pose estimation and tracking are fundamental tasks for understanding human behaviors in videos. Existing top-down framework-based methods usually perform three-stage tasks: human detection, pose estimation and tracking. Although promising results have been achieved, these methods rely heavily on high-performance detectors and may fail to track persons who are occluded or miss-detected. To overcome these problems, in this paper, we develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation in top-down approaches. Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded, and it is incorporated into the pose estimation module. In the tracking pipeline, we propose the Bbox-revision module to reduce missing detection and the ID-retrieve module to correct lost trajectories, improving the performance of the detection stage. Experimental results show that our approach is universal in human detection and pose estimation, achieving state-of-the-art performance on both PoseTrack 2017 and 2018 datasets.

摘要
人体姿势估计和跟踪是视频理解人类行为的基本任务。现有的顶部框架基础方法通常执行三个阶段任务：人员检测、姿势估计和跟踪。虽然已经获得了出色的结果，但这些方法受到高性能探测器的依赖，可能会在 occluded 或者检测错误时失败。为了解决这些问题，在这篇论文中，我们开发了一种新的关键点信任网络和跟踪管道，以提高顶部方法中的人员检测和姿势估计。具体来说，关键点信任网络是用于判断每个关键点是否受到遮挡的，并将其 incorporated 到姿势估计模块中。在跟踪管道中，我们提出了 Bbox-revision 模块，以减少缺失检测，以及 ID-retrieve 模块，以更正丢失的轨迹，从而提高检测阶段的性能。实验结果表明，我们的方法在人员检测和姿势估计中具有通用性和state-of-the-art 性能，在 PoseTrack 2017 和 2018 数据集上达到了最佳性能。

TiV-NeRF: Tracking and Mapping via Time-Varying Representation with Dynamic Neural Radiance Fields

paper_url: http://arxiv.org/abs/2310.18917
repo_url: None
paper_authors: Chengyao Duan, Zhiliu Yang
for: track and reconstruct dynamic scenes in SLAM framework
methods: time-varying representation, self-supervised training, distinct sampling strategies, and keyframe selection strategy
results: more effective compared to current state-of-the-art dynamic mapping methodsHere’s the full summary in Simplified Chinese:
for: 这 paper 旨在将 Neural Radiance Fields (NeRF) интегрирован到 Simultaneous Localization and Mapping (SLAM) 框架中，以处理动态场景。
methods: 该 paper 提出了时间变化表示，自动Supervised 训练，不同区域采样策略，以及关键帧选择策略。
results: 比现有的动态映射方法更有效。I hope that helps!

Abstract
Previous attempts to integrate Neural Radiance Fields (NeRF) into Simultaneous Localization and Mapping (SLAM) framework either rely on the assumption of static scenes or treat dynamic objects as outliers. However, most of real-world scenarios is dynamic. In this paper, we propose a time-varying representation to track and reconstruct the dynamic scenes. Our system simultaneously maintains two processes, tracking process and mapping process. For tracking process, the entire input images are uniformly sampled and training of the RGB images are self-supervised. For mapping process, we leverage know masks to differentiate dynamic objects and static backgrounds, and we apply distinct sampling strategies for two types of areas. The parameters optimization for both processes are made up by two stages, the first stage associates time with 3D positions to convert the deformation field to the canonical field. And the second associates time with 3D positions in canonical field to obtain colors and Signed Distance Function (SDF). Besides, We propose a novel keyframe selection strategy based on the overlapping rate. We evaluate our approach on two publicly available synthetic datasets and validate that our method is more effective compared to current state-of-the-art dynamic mapping methods.

摘要
previous attempts to integrate Neural Radiance Fields (NeRF) into Simultaneous Localization and Mapping (SLAM) framework either rely on the assumption of static scenes or treat dynamic objects as outliers. However, most of real-world scenarios is dynamic. In this paper, we propose a time-varying representation to track and reconstruct the dynamic scenes. Our system simultaneously maintains two processes, tracking process and mapping process. For tracking process, the entire input images are uniformly sampled and training of the RGB images are self-supervised. For mapping process, we leverage know masks to differentiate dynamic objects and static backgrounds, and we apply distinct sampling strategies for two types of areas. The parameters optimization for both processes are made up by two stages, the first stage associates time with 3D positions to convert the deformation field to the canonical field. And the second associates time with 3D positions in canonical field to obtain colors and Signed Distance Function (SDF). Besides, We propose a novel keyframe selection strategy based on the overlapping rate. We evaluate our approach on two publicly available synthetic datasets and validate that our method is more effective compared to current state-of-the-art dynamic mapping methods.Here's the translation in Traditional Chinese:previous attempts to integrate Neural Radiance Fields (NeRF) into Simultaneous Localization and Mapping (SLAM) framework either rely on the assumption of static scenes or treat dynamic objects as outliers. However, most of real-world scenarios is dynamic. In this paper, we propose a time-varying representation to track and reconstruct the dynamic scenes. Our system simultaneously maintains two processes, tracking process and mapping process. For tracking process, the entire input images are uniformly sampled and training of the RGB images are self-supervised. For mapping process, we leverage known masks to differentiate dynamic objects and static backgrounds, and we apply distinct sampling strategies for two types of areas. The parameters optimization for both processes are made up by two stages, the first stage associates time with 3D positions to convert the deformation field to the canonical field. And the second associates time with 3D positions in canonical field to obtain colors and Signed Distance Function (SDF). Besides, We propose a novel keyframe selection strategy based on the overlapping rate. We evaluate our approach on two publicly available synthetic datasets and validate that our method is more effective compared to current state-of-the-art dynamic mapping methods.

Identifiable Contrastive Learning with Automatic Feature Importance Discovery

paper_url: http://arxiv.org/abs/2310.18904
repo_url: https://github.com/pku-ml/tri-factor-contrastive-learning
paper_authors: Qi Zhang, Yifei Wang, Yisen Wang
for: 本研究旨在提出一种新的对比学习方法（tri-factor contrastive learning，简称triCL），以便从人类视角获得更加可解解释的数据表示。
methods: triCL使用了一种3因素对比的形式，即 $z_x^\top S z_{x’}$，其中 $S$ 是一个可学习的对角矩阵，自动捕捉到每个特征的重要性。
results: 我们证明了 triCL 可以不仅获得可解解释的特征，而且可以通过对比学习方法来获得更高的性能。我们还发现，高重要性的特征具有良好的可解解释性，可以捕捉到共同的类别特征。

Abstract
Existing contrastive learning methods rely on pairwise sample contrast $z_x^\top z_{x'}$ to learn data representations, but the learned features often lack clear interpretability from a human perspective. Theoretically, it lacks feature identifiability and different initialization may lead to totally different features. In this paper, we study a new method named tri-factor contrastive learning (triCL) that involves a 3-factor contrast in the form of $z_x^\top S z_{x'}$, where $S=\text{diag}(s_1,\dots,s_k)$ is a learnable diagonal matrix that automatically captures the importance of each feature. We show that by this simple extension, triCL can not only obtain identifiable features that eliminate randomness but also obtain more interpretable features that are ordered according to the importance matrix $S$. We show that features with high importance have nice interpretability by capturing common classwise features, and obtain superior performance when evaluated for image retrieval using a few features. The proposed triCL objective is general and can be applied to different contrastive learning methods like SimCLR and CLIP. We believe that it is a better alternative to existing 2-factor contrastive learning by improving its identifiability and interpretability with minimal overhead. Code is available at https://github.com/PKU-ML/Tri-factor-Contrastive-Learning.

摘要
现有的对比学习方法通常基于对比样本的对比度 $z_x^\top z_{x'}$ 来学习数据表示，但学习的特征通常缺乏人类可理解的解释性。理论上来说，它缺乏特征可识别性，不同的初始化可能会导致极其不同的特征。在这篇论文中，我们研究了一种新的方法 named tri-factor contrastive learning (triCL)，它包含了一种三因子对比的形式，即 $z_x^\top S z_{x'}$, 其中 $S$ 是一个可学习的对角矩阵，自动捕捉每个特征的重要性。我们显示了，通过这种简单的扩展，triCL 可以不仅获得可识别的特征，而且可以获得更加可解的特征，这些特征被排序于重要性矩阵 $S$ 中，并且高度重要的特征具有良好的解释性，可以捕捉共同的类别特征，并且在图像检索任务中获得更高的性能。我们表明了 triCL 目标是一种通用的对比学习目标，可以应用于不同的对比学习方法，如 SimCLR 和 CLIP。我们认为，triCL 是现有的 two-factor 对比学习的更好的替代方案，可以提高其可识别性和可解性，而且带来最小的开销。代码可以在 GitHub 上找到：https://github.com/PKU-ML/Tri-factor-Contrastive-Learning。

Multi-task deep learning for large-scale building detail extraction from high-resolution satellite imagery

paper_url: http://arxiv.org/abs/2310.18899
repo_url: https://github.com/chanceqz/buildingdetails-multitask
paper_authors: Zhen Qian, Min Chen, Zhuo Sun, Fan Zhang, Qingsong Xu, Jinzhao Guo, Zhiwei Xie, Zhixin Zhang
for: 本研究旨在提高城市动态的理解和可持续发展，通过对高分辨度卫星图像进行分析，提取建筑物的详细信息。
methods: 本研究提出了一种适应性的神经网络，称为多任务建筑精细化器（MT-BR），可同时提取各种建筑物的空间和属性信息，如建筑物顶、城市功能类型和瓦屋顶型。此外，MT-BR可以根据需要进行可调整，以涵盖更多的建筑物详细信息。
results: 研究人员通过设计一种新的空间采样方法，可以有效地选择高分辨度卫星图像的限定示例，以提高提取建筑物详细信息的效率。此外，通过启用先进的增强技术，MT-BR可以提高预测性能和泛化能力。实验结果表明，MT-BR在不同的维度上都能够达到更高的预测精度，并且在实际应用中，可以生成包含了建筑物的空间和属性信息的一体化数据集。

Abstract
Understanding urban dynamics and promoting sustainable development requires comprehensive insights about buildings. While geospatial artificial intelligence has advanced the extraction of such details from Earth observational data, existing methods often suffer from computational inefficiencies and inconsistencies when compiling unified building-related datasets for practical applications. To bridge this gap, we introduce the Multi-task Building Refiner (MT-BR), an adaptable neural network tailored for simultaneous extraction of spatial and attributional building details from high-resolution satellite imagery, exemplified by building rooftops, urban functional types, and roof architectural types. Notably, MT-BR can be fine-tuned to incorporate additional building details, extending its applicability. For large-scale applications, we devise a novel spatial sampling scheme that strategically selects limited but representative image samples. This process optimizes both the spatial distribution of samples and the urban environmental characteristics they contain, thus enhancing extraction effectiveness while curtailing data preparation expenditures. We further enhance MT-BR's predictive performance and generalization capabilities through the integration of advanced augmentation techniques. Our quantitative results highlight the efficacy of the proposed methods. Specifically, networks trained with datasets curated via our sampling method demonstrate improved predictive accuracy relative to those using alternative sampling approaches, with no alterations to network architecture. Moreover, MT-BR consistently outperforms other state-of-the-art methods in extracting building details across various metrics. The real-world practicality is also demonstrated in an application across Shanghai, generating a unified dataset that encompasses both the spatial and attributional details of buildings.

摘要
理解城市动力和推动可持续发展需要全面的建筑相关信息。 although geospatial人工智能已经提高了对地球观测数据的EXTRACTION，现有的方法经常受到计算不fficient和不一致的问题，这限制了实际应用中的建筑相关数据集的编译。为了bridging这个差距，我们介绍了多任务建筑精细化器（MT-BR），这是适应同时EXTRACTION的建筑相关细节的适应性神经网络。MT-BR可以根据需要进行微调，以包含更多的建筑细节，从而扩展其可用性。为了应对大规模应用，我们提出了一种新的空间采样方案，该方案选择了有限但表示性强的图像样本。这种方法可以最大化图像样本的空间分布和城市环境特征，从而提高EXTRACTION的效果，同时降低数据准备成本。此外，我们还通过 incorporating advanced augmentation techniques 提高MT-BR的预测性能和泛化能力。我们的量化结果表明，使用我们的采样方法训练的网络比使用其他采样方法更高的预测精度，而无需修改网络结构。此外，MT-BR还在不同的维度上一直 OUTPERFORMS 其他现有的方法。此外，我们在上海应用了MT-BR，生成了一个包括建筑物的空间和特征细节的一体化数据集，这进一步证明了MT-BR的实际可行性。

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity

paper_url: http://arxiv.org/abs/2310.18894
repo_url: https://github.com/crazy-jack/nips2023_shape_vs_texture
paper_authors: Tianqin Li, Ziqi Wen, Yangfan Li, Tai Sing Lee
for: 本研究旨在解释深度学习模型为何偏好文本，而人类视觉系统却偏好形状和结构。
methods: 研究人员使用 sparse coding 原理，通过非 differencing Top-K 操作来引入形状偏好到网络中。
results: 研究发现，在卷积神经网络中，强制执行 sparse coding 约束可以导致 neuron 中的结构编码 emerge，从而使网络具有更好的形状偏好。这种形状偏好会使网络在不同的数据集上展现出更好的鲁棒性和可变性。代码可以在 GitHub 上找到：https://github.com/Crazy-Jack/nips2023_shape_vs_texture。

Abstract
Current deep-learning models for object recognition are known to be heavily biased toward texture. In contrast, human visual systems are known to be biased toward shape and structure. What could be the design principles in human visual systems that led to this difference? How could we introduce more shape bias into the deep learning models? In this paper, we report that sparse coding, a ubiquitous principle in the brain, can in itself introduce shape bias into the network. We found that enforcing the sparse coding constraint using a non-differential Top-K operation can lead to the emergence of structural encoding in neurons in convolutional neural networks, resulting in a smooth decomposition of objects into parts and subparts and endowing the networks with shape bias. We demonstrated this emergence of shape bias and its functional benefits for different network structures with various datasets. For object recognition convolutional neural networks, the shape bias leads to greater robustness against style and pattern change distraction. For the image synthesis generative adversary networks, the emerged shape bias leads to more coherent and decomposable structures in the synthesized images. Ablation studies suggest that sparse codes tend to encode structures, whereas the more distributed codes tend to favor texture. Our code is host at the github repository: \url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture}

摘要
当前深度学习模型对物体识别存在强烈的文本偏好。然而，人类视觉系统却具有形态和结构偏好。这种差异的原因可能是什么？我们可以如何在深度学习模型中引入更多的形态偏好？在这篇论文中，我们发现了一种叫做稀畴编码的原则，这种原则在大脑中是普遍存在的。我们发现，通过在 convolutional neural networks 中使用不同的 Top-K 操作来实现稀畴编码约束，可以导致神经元内的编码变得更加结构化，从而使得神经网络具有形态偏好。我们通过不同的数据集来证明这种形态偏好的出现和其功能上的好处。对于物体识别 convolutional neural networks，形态偏好使得神经网络更加抗性于样式和 Pattern 变化的干扰。对于生成 adversarial networks， emerged 形态偏好导致生成的图像更加协调和可分解。我们的代码可以在 GitHub 上找到：\url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture}

Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes

paper_url: http://arxiv.org/abs/2310.18887
repo_url: None
paper_authors: Yihong Sun, Bharath Hariharan
for: 本文提出了一种解决单目深度估计中动态场景中物体运动所引起的困难的方法，通过对无标注单目视频进行共同学习深度、独立流场和动作分割来解决这种问题。
methods: 本文提出了一种jointly学习深度和独立流场的方法，通过提供了一个关键思想，即在初始化时对运动分割有good的估计可以帮助jointly学习深度和独立运动。
results: 本文在 Waymo Open和nuScenes Dataset上实现了单目深度估计的state-of-the-art性能，对运动中的深度有显著改进。

Abstract
Unsupervised monocular depth estimation techniques have demonstrated encouraging results but typically assume that the scene is static. These techniques suffer when trained on dynamical scenes, where apparent object motion can equally be explained by hypothesizing the object's independent motion, or by altering its depth. This ambiguity causes depth estimators to predict erroneous depth for moving objects. To resolve this issue, we introduce Dynamo-Depth, an unifying approach that disambiguates dynamical motion by jointly learning monocular depth, 3D independent flow field, and motion segmentation from unlabeled monocular videos. Specifically, we offer our key insight that a good initial estimation of motion segmentation is sufficient for jointly learning depth and independent motion despite the fundamental underlying ambiguity. Our proposed method achieves state-of-the-art performance on monocular depth estimation on Waymo Open and nuScenes Dataset with significant improvement in the depth of moving objects. Code and additional results are available at https://dynamo-depth.github.io.

摘要
<>Translate the given text into Simplified Chinese.<>无监督单目深度估计技术已经表现出了激动人心的结果，但通常假设场景是静止的。这些技术在动态场景下遇到问题，因为 Apparent 对象的运动可以 equally 由假设对象的独立运动或者由其深度变化来解释。这种歧义导致深度估计器预测错误的深度值。为了解决这个问题，我们介绍了 Dynamo-Depth，一种统一的方法，它在不监督的单目视频上同时学习单目深度、3D 独立流场和动态分割。我们提供了关键的思路，即一个好的初始化动态分割可以为 JOINTLY 学习深度和独立运动，尽管基本的下面歧义存在。我们的提议方法在 Waymo Open 和 nuScenes 数据集上实现了状态的最佳性能，对于运动中的对象的深度有显著改善。代码和更多结果可以在中找到。

Fun Paper

2023-10-29

cs.CV - 2023-10-29

3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets

Fast Trainable Projection for Robust Fine-Tuning

BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping

Out-of-distribution Object Detection through Bayesian Uncertainty Estimation

CrossEAI: Using Explainable AI to generate better bounding boxes for Chest X-ray images

Reward Finetuning for Faster and More Accurate Unsupervised Object Discovery

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

Boosting Decision-Based Black-Box Adversarial Attack with Gradient Priors

FPGAN-Control: A Controllable Fingerprint Generator for Training with Synthetic Data

Efficient Test-Time Adaptation for Super-Resolution with Second-Order Degradation and Reconstruction

DynPoint: Dynamic Neural Point For View Synthesis

Controllable Group Choreography using Contrastive Diffusion

Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods

AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection

TIC-TAC: A Framework To Learn And Evaluate Your Covariance

Customize StyleGAN with One Hand Sketch

Video Frame Interpolation with Many-to-many Splatting and Spatial Selective Refinement

Adversarial Examples Are Not Real Features

Label Poisoning is All You Need

A transfer learning approach with convolutional neural network for Face Mask Detection

Improving Multi-Person Pose Tracking with A Confidence Network

TiV-NeRF: Tracking and Mapping via Time-Varying Representation with Dynamic Neural Radiance Fields

Identifiable Contrastive Learning with Automatic Feature Importance Discovery

Multi-task deep learning for large-scale building detail extraction from high-resolution satellite imagery

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity

Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes