2023-08-19

cs.CV

cs.CV - 2023-08-19

DPL: Decoupled Prompt Learning for Vision-Language Models

paper_url: http://arxiv.org/abs/2308.10061
repo_url: None
paper_authors: Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin Chen, Gangshan Wu, Limin Wang
for: 本研究旨在提高CLIP模型的下游任务迁移效果，但现有方法通常会过拟合已经看过的类别， thereby limiting its generalization ability for unseen classes.
methods: 我们提出了一种新的方法，即分离提示学习（DPL），它通过重新定义提示学习的注意力进程来解决这个问题。 Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes.
results: 我们的方法可以在三个代表性的 benchmark 上取得状态机器的表现，包括15个图像识别数据集，而且不需要任何辅助正则化任务或额外训练数据，进一步表明了其惊人的泛化能力。

Abstract
Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e.g., CLIP) to downstream tasks. However, current methods tend to overfit to seen categories, thereby limiting their generalization ability for unseen classes. In this paper, we propose a new method, Decoupled Prompt Learning (DPL), which reformulates the attention in prompt learning to alleviate this problem. Specifically, we theoretically investigate the collaborative process between prompts and instances (i.e., image patches/text tokens) by reformulating the original self-attention into four separate sub-processes. Through detailed analysis, we observe that certain sub-processes can be strengthened to bolster robustness and generalizability by some approximation techniques. Furthermore, we introduce language-conditioned textual prompting based on decoupled attention to naturally preserve the generalization of text input. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning. By combining the proposed techniques, our approach achieves state-of-the-art performance on three representative benchmarks encompassing 15 image recognition datasets, while maintaining parameter-efficient. Moreover, our DPL does not rely on any auxiliary regularization task or extra training data, further demonstrating its remarkable generalization ability.

摘要
提高学习（Prompt Learning）已经成为跨类 Task 迁移Foundational Vision-Language Models（例如 CLIP）的有效和高效的方法。然而，现有方法往往会遇到seen类，从而限制其对未seen类的泛化能力。在这篇论文中，我们提出了一种新的方法：异步Prompt Learning（DPL），该方法通过重新推导提问的注意力来缓解这个问题。 Specifically，我们对提问和实例（即图像 patches/文本token）之间的协作过程进行了理论性的调查，并将原始自注意转化为四个独立的子过程。通过详细分析，我们发现了一些子过程可以通过一些抽象技巧加强，以提高抗衰假性和泛化能力。此外，我们引入了基于异步注意力的语言条件文本提问，以保持文本输入的泛化。我们的方法可以轻松扩展到多模态提问学习。通过结合我们的方法，我们的DPL实现了三个代表性的标准准则，包括15个图像识别数据集，而不需要额外的训练数据或auxiliary regularization任务，再次证明了它的强大泛化能力。

R-C-P Method: An Autonomous Volume Calculation Method Using Image Processing and Machine Vision

paper_url: http://arxiv.org/abs/2308.10058
repo_url: None
paper_authors: MA Muktadir, Sydney Parker, Sun Yi
for: 这个论文的目的是提供一种基于多个2D相机的实时三维空间量化和变化信息获取方法，以取代深度摄像头。
methods: 该方法使用图像处理和边检测技术，开发了ROW-COLUMN-PIXEL（纵列像素）方法，可以实时测量物体的面积和变化。
results: 实验结果表明，ROW-COLUMN-PIXEL方法可以准确地测量物体的面积和变化，并且可以检测物体上的不连续边或体积。

Abstract
Machine vision and image processing are often used with sensors for situation awareness in autonomous systems, from industrial robots to self-driving cars. The 3D depth sensors, such as LiDAR (Light Detection and Ranging), Radar, are great invention for autonomous systems. Due to the complexity of the setup, LiDAR may not be suitable for some operational environments, for example, a space environment. This study was motivated by a desire to get real-time volumetric and change information with multiple 2D cameras instead of a depth camera. Two cameras were used to measure the dimensions of a rectangular object in real-time. The R-C-P (row-column-pixel) method is developed using image processing and edge detection. In addition to the surface areas, the R-C-P method also detects discontinuous edges or volumes. Lastly, experimental work is presented for illustration of the R-C-P method, which provides the equations for calculating surface area dimensions. Using the equations with given distance information between the object and the camera, the vision system provides the dimensions of actual objects.

摘要

ControlCom: Controllable Image Composition using Diffusion Model

paper_url: http://arxiv.org/abs/2308.10040
repo_url: https://github.com/bcmi/controlcom-image-composition
paper_authors: Bo Zhang, Yuxuan Duan, Jun Lan, Yan Hong, Huijia Zhu, Weiqiang Wang, Li Niu
for: 本研究旨在实现一种可控的图像组合方法，能够生成真实、自然的组合图像。
methods: 本方法基于大型预训练的扩散模型，并实现了四个任务的混合：图像融合、图像和谐、视觉合成和生成组合。
results: 对于公共benchmark数据和实际数据，我们的方法可以生成更加忠诚和可控的组合图像，比现有方法更高效。

Abstract
Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.

摘要
Image composition targets at synthesizing a realistic composite image from a pair of foreground and background images. Recently, generative composition methods are built on large pretrained diffusion models to generate composite images, considering their great potential in image generation. However, they suffer from lack of controllability on foreground attributes and poor preservation of foreground identity. To address these challenges, we propose a controllable image composition method that unifies four tasks in one diffusion model: image blending, image harmonization, view synthesis, and generative composition. Meanwhile, we design a self-supervised training framework coupled with a tailored pipeline of training data preparation. Moreover, we propose a local enhancement module to enhance the foreground details in the diffusion model, improving the foreground fidelity of composite images. The proposed method is evaluated on both public benchmark and real-world data, which demonstrates that our method can generate more faithful and controllable composite images than existing approaches. The code and model will be available at https://github.com/bcmi/ControlCom-Image-Composition.

CRC-ICM: Colorectal Cancer Immune Cell Markers Pattern Dataset

paper_url: http://arxiv.org/abs/2308.10033
repo_url: None
paper_authors: Zahra Mokhtari, Elham Amjadi, Hamidreza Bolhasani, Zahra Faghih, AmirReza Dehghanian, Marzieh Rezaei
For: The paper is written to explore the differences in immune checkpoints expression in primary tumors located in the right and left sides of the colon, and to investigate the prognostic value of these checkpoints in colorectal cancer (CRC).* Methods: The study uses a dataset of 1756 images related to 136 patients, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3, and Tim3.* Results: The study found that tumors on the left and right sides of the colon have different immune landscapes, with differences in the expression of immune checkpoints such as PD-1, LAG3, and Tim3. These differences may have implications for the prognosis of CRC patients.

Abstract
Colorectal Cancer (CRC) is the second most common cause of cancer death in the world, ad can be identified by the location of the primary tumor in the large intestine: right and left colon, and rectum. Based on the location, CRC shows differences in chromosomal and molecular characteristics, microbiomes incidence, pathogenesis, and outcome. It has been shown that tumors on left and right sides also have different immune landscape, so the prognosis may be different based on the primary tumor locations. It is widely accepted that immune components of the tumor microenvironment (TME) plays a critical role in tumor development. One of the critical regulatory molecules in the TME is immune checkpoints that as the gatekeepers of immune responses regulate the infiltrated immune cell functions. Inhibitory immune checkpoints such as PD-1, Tim3, and LAG3, as the main mechanism of immune suppression in TME overexpressed and result in further development of the tumor. The images of this dataset have been taken from colon tissues of patients with CRC, stained with specific antibodies for CD3, CD8, CD45RO, PD-1, LAG3 and Tim3. The name of this dataset is CRC-ICM and contains 1756 images related to 136 patients. The initial version of CRC-ICM is published on Elsevier Mendeley dataset portal, and the latest version is accessible via: https://databiox.com

摘要
“幽门肉瘤癌（CRC）是全球第二常见的癌症死亡原因，可以根据主癌肿 locus在大小肠中进行定位：右和左大肠、RECTUM。根据位置，CRC具有不同的染色体和分子特征、微生物发生率、生成和结果。已经证明左右两侧的肿均具有不同的免疫景观，因此诊断结果可能因primary tumor location而异。广泛认可的是，免疫组件在肿瘤微环境（TME）中扮演了关键的角色，其中一种关键的调控分子是免疫检查点。压缩性免疫检查点如PD-1、Tim3和LAG3等，是TME中免疫抑制的主要机制，过度表达导致肿瘤进一步发展。这些图像来自于患有CRC的大肠组织，用specific抗体染色CD3、CD8、CD45RO、PD-1、LAG3和Tim3。该数据集名为CRC-ICM，包含1756张图像，与136名患者相关。最初版本在Elsevier Mendeley数据集门户上发布，最新版本可以通过以下链接访问：https://databiox.com”

Single Image Reflection Separation via Component Synergy

paper_url: http://arxiv.org/abs/2308.10027
repo_url: https://github.com/mingcv/dsrnet
paper_authors: Qiming Hu, Xiaojie Guo
for: 本研究旨在提出更一般的折射模型，以更好地捕捉剩下的信息，使分解层更加完整。
methods: 该研究基于现有模型的缺陷进行调查，并引入学习式偏移项，以捕捉剩下的信息。同时，我们还设计了网络结构，包括一种新型的双流交互机制和一种强大的分解网络。
results: 我们通过广泛的实验和减少研究，证明我们的方法在多个真实世界的标准 benchmark 数据集上表现出色，胜过当前的状态艺术方法。我们的代码可以在 https://github.com/mingcv/DSRNet 上获取。

Abstract
The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet.

摘要
<> translate "The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. Our code is publicly available at https://github.com/mingcv/DSRNet." into Simplified Chinese.Translation:<>现实世界中各种复杂的反射积累现象，导致了许多简化的线性和非线性形式化问题。在这篇论文中，我们基于现有模型的缺陷进行调查，并提出了更一般的积累模型，通过引入学习型剩余项，能够有效地捕捉反射信息的剩余信息，导引分解层成为完整的。为了充分利用其优势，我们还设计了网络结构，包括一种新的双流交互机制和一个强大的分解网络，具有含义层Encoder。我们进行了广泛的实验和减少研究，以证明我们在多个实际世界标准数据集上的优越性。我们的代码可以在https://github.com/mingcv/DSRNet中获得。

paper_url: http://arxiv.org/abs/2308.10019
repo_url: None
paper_authors: Hao Chen, Haoran Zhou, Yongjian Deng
for: 本文提出了一种分析框架和一种新的评估指标，用于解释多Modal视觉社区中的解释。
methods: 我们的方法包括在不同modalities和层次上测量提议的semantic variance和特征相似性，并通过广泛的实验进行semantic和量化分析。
results: 我们的研究发现了跨modalities的特征不一致和多modal的合作规则，这些发现有助于重新评估和设计多Modal视觉融合模型。

Abstract
In this paper, we present an analytical framework and a novel metric to shed light on the interpretation of the multimodal vision community. Our approach involves measuring the proposed semantic variance and feature similarity across modalities and levels, and conducting semantic and quantitative analyses through comprehensive experiments. Specifically, we investigate the consistency and speciality of representations across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a multi-modality model. Our studies reveal several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and speciality simultaneously for complementary inference. Through our dissection and findings on multi-modal fusion, we facilitate a rethinking of the reasonability and necessity of popular multi-modal vision fusion strategies. Furthermore, our work lays the foundation for designing a trustworthy and universal multi-modal fusion model for a variety of tasks in the future.

摘要
在这篇论文中，我们提出了一种分析框架和一种新的度量来探讨多模态视觉社区的解释。我们的方法是在不同的modalities和层次上测量提议的 semantics variance和特征相似性，并通过全面的实验进行semantic和量化的分析。我们专门investigates the consistency和特点of representations across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a multi-modality model。我们的研究发现了一些重要的发现，如cross-modal features的差异和hybrid multi-modal cooperation rule，这些发现揭示了同时具有一致和特点的hybrid multi-modal合作。通过我们的分析和发现，我们促进了对多模态视觉混合的重新思考，并为未来多种任务的多模态混合模型设计一个可靠和通用的基础。

Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation

paper_url: http://arxiv.org/abs/2308.10016
repo_url: https://github.com/yanghai-1218/pseudoflow
paper_authors: Yang Hai, Rui Song, Jiaojiao Li, David Ferstl, Yinlin Hu
for: 这篇论文主要针对的是无需额外信息的自主学习6D对象pose估计问题。
methods: 该方法首先从使用synthetic图像生成的网络获得粗略的pose初始值，然后引入了一种基于干扰图像对数据集的geometry约束的精度调整策略。该约束由多个不同视角的synthetic-to-real图像对形成，并通过动态生成的pseudo标签来表示。
results: 对三个复杂的数据集进行评估，该方法与无需2D标注和额外深度图像的前一代自主学习方法相比，显著地提高了性能。

Abstract
Most self-supervised 6D object pose estimation methods can only work with additional depth information or rely on the accurate annotation of 2D segmentation masks, limiting their application range. In this paper, we propose a 6D object pose estimation method that can be trained with pure RGB images without any auxiliary information. We first obtain a rough pose initialization from networks trained on synthetic images rendered from the target's 3D mesh. Then, we introduce a refinement strategy leveraging the geometry constraint in synthetic-to-real image pairs from multiple different views. We formulate this geometry constraint as pixel-level flow consistency between the training images with dynamically generated pseudo labels. We evaluate our method on three challenging datasets and demonstrate that it outperforms state-of-the-art self-supervised methods significantly, with neither 2D annotations nor additional depth images.

摘要
大多数自我指导的6D对象姿态估计方法需要额外深度信息或者精确的2D分割标签，这限制了它们的应用范围。在这篇论文中，我们提出了不需要 auxiliary信息的6D对象姿态估计方法。我们首先从目标对象的3D meshRendered的Synthetic图像中获取初步姿态 initialization。然后，我们引入了一种改善策略，利用多个不同视角的Synthetic-to-real图像对的geometry约束。我们将这种geometry约束表示为多个不同视角的Synthetic图像之间的像素水平流consistency。我们对三个具有挑战性的数据集进行评估，并证明了我们的方法在自我指导方法中具有明显的优势，无需2D标签也无需额外深度图像。

DyFFPAD: Dynamic Fusion of Convolutional and Handcrafted Features for Fingerprint Presentation Attack Detection

paper_url: http://arxiv.org/abs/2308.10015
repo_url: None
paper_authors: Anuj Rai, Parsheel Kumar Tiwari, Jyotishna Baishya, Ram Prakash Sharma, Somnath Dey
for: The paper is written for the purpose of detecting presentation attacks in automatic fingerprint recognition systems, which are a threat to their wide range of applications in areas including national borders and commercial applications.
methods: The paper proposes a dynamic ensemble of deep learning and handcrafted features to detect presentation attacks in known-material and unknown-material protocols. The proposed model combines both deep CNN and handcrafted features, and learns their parameters together to exhibit better performance than individual results.
results: The proposed model is validated using benchmark LivDet 2015, 2017, and 2019 databases, and achieves an overall accuracy of 96.10%, 96.49%, and 95.99% on them, respectively. The proposed model outperforms state-of-the-art methods in benchmark protocols of presentation attack detection in terms of classification accuracy.Here are the three key points in Simplified Chinese text:
for: 这篇论文是为检测自动指纹识别系统中的示现攻击而写的。
methods: 这篇论文提出了一种动态集成深度学习和手工特征的方法来检测示现攻击。该模型将深度学习和手工特征结合使用，并同时学习它们的参数。
results: 该模型在livDet2015、livDet2017和livDet2019等数据库上进行了验证，并达到了96.10%、96.49%和95.99%的总准确率。这个模型在示现攻击检测中的benchmark协议中表现出了更好的性能。

Abstract
Automatic fingerprint recognition systems suffer from the threat of presentation attacks due to their wide range of applications in areas including national borders and commercial applications. Presentation attacks can be performed by fabricating the fake fingerprint of a user with or without the intention of the subject. This paper presents a dynamic ensemble of deep learning and handcrafted features to detect presentation attacks in known-material and unknown-material protocols. The proposed model is a dynamic ensemble of deep CNN and handcrafted features empowered deep neural networks both of which learn their parameters together. The proposed presentation attack detection model, in this way, utilizes the capabilities of both classification techniques and exhibits better performance than their individual results. The proposed model's performance is validated using benchmark LivDet 2015, 2017, and 2019 databases, with an overall accuracy of 96.10\%, 96.49\%, and 95.99\% attained on them, respectively. The proposed model outperforms state-of-the-art methods in benchmark protocols of presentation attack detection in terms of classification accuracy.

摘要

Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts

paper_url: http://arxiv.org/abs/2308.10005
repo_url: https://github.com/Jiaxuan-Li/PnD
paper_authors: Jiaxuan Li, Duc Minh Vo, Hideki Nakayama
for: 减少图像分类中的偏见（bias mitigation），特别是面对不确定或多种偏见的情况。
methods: 提出了一种名为Partition-and-Debias（PnD）方法，通过一组偏见特定的专家来分解偏见空间，并使用一个阻断模块来实现专家之间的协调，以达到减少偏见的分类。
results: 在公共和自定义的benchmark上进行了实验，并证明了PnD方法的有效性。

Abstract
Bias mitigation in image classification has been widely researched, and existing methods have yielded notable results. However, most of these methods implicitly assume that a given image contains only one type of known or unknown bias, failing to consider the complexities of real-world biases. We introduce a more challenging scenario, agnostic biases mitigation, aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets. To address this difficult task, we present the Partition-and-Debias (PnD) method that uses a mixture of biases-specific experts to implicitly divide the bias space into multiple subspaces and a gating module to find a consensus among experts to achieve debiased classification. Experiments on both public and constructed benchmarks demonstrated the efficacy of the PnD. Code is available at: https://github.com/Jiaxuan-Li/PnD.

摘要
<<音� simpified>>偏调缓和图像分类中的研究已经广泛，现有的方法已经获得了 Notable results。然而，大多数这些方法预设假设一个图像只包含一种已知或未知的偏调，忽略了实际世界中的复杂偏调。我们引入一个更加问题的场景：agnostic biases mitigation， aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets。为了解决这个困难的任务，我们提出了 Partition-and-Debias（PnD）方法，使用一种混合偏调特有的专家来隐式地分解偏调空间 into multiple subspaces，并使用一个闸道模组来获得专家们之间的一致，以达到 debiased classification。实验结果显示，PnD方法在公共和自己建立的benchmark上具有优良的效果。代码可以在：https://github.com/Jiaxuan-Li/PnD 中找到。

Efficient Multi-View Inverse Rendering Using a Hybrid Differentiable Rendering Method

paper_url: http://arxiv.org/abs/2308.10003
repo_url: https://github.com/HsiangYangChu/DRBIR
paper_authors: Xiangyang Zhu, Yiling Pan, Bailin Deng, Bin Wang
for: 这篇论文的目的是用hybrid differentiable rendering方法 efficiently recovering real-world object的3D geometry和reflectance从多视图图像中。
methods: 该方法包括两个阶段：初始化阶段使用传统的SfM和MVS方法重建虚拟场景，优化阶段使用混合方法同时优化几何和反射特性，其中几何使用伪优化方法，反射使用物理基于优化方法。
results: 对于synthetic和实际数据，我们的方法可以生成与状态前方法相同或更高质量的重建结果，同时更高效。

Abstract
Recovering the shape and appearance of real-world objects from natural 2D images is a long-standing and challenging inverse rendering problem. In this paper, we introduce a novel hybrid differentiable rendering method to efficiently reconstruct the 3D geometry and reflectance of a scene from multi-view images captured by conventional hand-held cameras. Our method follows an analysis-by-synthesis approach and consists of two phases. In the initialization phase, we use traditional SfM and MVS methods to reconstruct a virtual scene roughly matching the real scene. Then in the optimization phase, we adopt a hybrid approach to refine the geometry and reflectance, where the geometry is first optimized using an approximate differentiable rendering method, and the reflectance is optimized afterward using a physically-based differentiable rendering method. Our hybrid approach combines the efficiency of approximate methods with the high-quality results of physically-based methods. Extensive experiments on synthetic and real data demonstrate that our method can produce reconstructions with similar or higher quality than state-of-the-art methods while being more efficient.

摘要
recuperar la forma y apariencia de objetos del mundo real desde imágenes 2D naturales es un problema de inverse rendering largo standing y desafiante. En este artículo, presentamos un método de renderizado diferenciable híbrido para eficientemente reconstruir la geometría 3D y la refracción de una escena a partir de imágenes multiview capturadas por cámaras portátiles convencionales. Nuestro método sigue un enfoque de análisis por síntesis y consta de dos fases. En la fase de inicio, utilizamos métodos SfM y MVS tradicionales para reconstruir una escena virtual que approximate la escena real. Luego, en la fase de optimización, adoptamos un enfoque híbrido para refinar la geometría y la refracción, donde la geometría se optimiza primero utilizando un método de renderizado diferenciable aproximado, y la refracción se optimiza después utilizando un método de renderizado diferenciable físicamente basado. Nuestro enfoque híbrido combina la eficiencia de los métodos aproximados con los resultados de alta calidad de los métodos físicamente basados. Los experimentos extensivos en datos sintéticos y reales demuestran que nuestro método puede producir reconstrucciones con calidad similar o superior a los métodos estado del arte mientras es más eficiente.

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

paper_url: http://arxiv.org/abs/2308.10001
repo_url: None
paper_authors: Kun Wang, Zhiqiang Yan, Huang Tian, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang
for: 实现高品质的新视角生成（Novel View Synthesis） from sparse scene images.
methods: 使用自我监督的单目深度测量（SMDE）自MONOCULAR VIDEOS中学习深度和pose prior，并与NeRF进行交互式调整。
results: 产生高传真度和可靠的新视角，并且能够处理不确定的摄像机位置和缺乏明确的3D超级vision。

Abstract
Neural Radiance Fields (NeRF) have shown promise in generating realistic novel views from sparse scene images. However, existing NeRF approaches often encounter challenges due to the lack of explicit 3D supervision and imprecise camera poses, resulting in suboptimal outcomes. To tackle these issues, we propose AltNeRF -- a novel framework designed to create resilient NeRF representations using self-supervised monocular depth estimation (SMDE) from monocular videos, without relying on known camera poses. SMDE in AltNeRF masterfully learns depth and pose priors to regulate NeRF training. The depth prior enriches NeRF's capacity for precise scene geometry depiction, while the pose prior provides a robust starting point for subsequent pose refinement. Moreover, we introduce an alternating algorithm that harmoniously melds NeRF outputs into SMDE through a consistence-driven mechanism, thus enhancing the integrity of depth priors. This alternation empowers AltNeRF to progressively refine NeRF representations, yielding the synthesis of realistic novel views. Additionally, we curate a distinctive dataset comprising indoor videos captured via mobile devices. Extensive experiments showcase the compelling capabilities of AltNeRF in generating high-fidelity and robust novel views that closely resemble reality.

摘要
neural radiance fields (NeRF) 已经示示出在使用稀疏场景图像生成真实的新视图时的承诺。然而，现有的 NeRF 方法经常遇到缺乏显式的3D监督和不准确的相机位置问题，导致优化结果不佳。为解决这些问题，我们提出了 AltNeRF -- 一种新的框架，通过使用单目视频中的自我监督深度估计 (SMDE)，不需要知道相机位置，创建了鲜明的 NeRF 表示。SMDE 在 AltNeRF 中熟练地学习了深度和pose prior，以规则 NeRF 训练。深度先导提高了 NeRF 的场景几何描述能力，而pose先导提供了 robust的开始点，用于后续的pose修正。此外，我们还引入了一种相互律动的算法，将 NeRF 输出与 SMDE 融合在一起，通过一种具有一致性的机制，以强化深度先导的完整性。这种相互律动使得 AltNeRF 能够不断地加工 NeRF 表示，Synthesize 出真实的新视图。此外，我们还制作了一个特有的indoor视频 captured via mobile devices的数据集。广泛的实验表明，AltNeRF 能够生成高质量和Robust的新视图，与实际相似。

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

paper_url: http://arxiv.org/abs/2308.09993
repo_url: None
paper_authors: Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Renjing Xu, Bojun Cheng
for: 本研究旨在提出一种轻量级、通用的点云网络（TTPOINT），用于行动识别任务。
methods: 该模型采用点云方式进行数据采集，并使用tensor-train压缩特征提取器来减少计算复杂度和参数量。
results: TTPOINT在三个 dataset 上达到了状态平台（SOTA）水平，并在所有五个 dataset 上达到了点云方法的SOTA水平。此外，通过使用tensor-train压缩方法，模型的精度几乎不受参数大小压缩的影响。

Abstract
Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of events. Still, it typically cannot reach a comparable accuracy as the frame-based method. We propose a lightweight and generalized point cloud network called TTPOINT which achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks while only using 1.5 % of the computational resources. The model is adept at abstracting local and global geometry by hierarchy structure. By leveraging tensor-train compressed feature extractors, TTPOINT can be designed with minimal parameters and computational complexity. Additionally, we developed a straightforward downsampling algorithm to maintain the spatio-temporal feature. In the experiment, TTPOINT emerged as the SOTA method on three datasets while also attaining SOTA among point cloud methods on all five datasets. Moreover, by using the tensor-train decomposition method, the accuracy of the proposed TTPOINT is almost unaffected while compressing the parameter size by 55 % in all five datasets.

摘要
事件摄像机在计算机视觉中得到了普遍应用，因为它们的数据稀疏、高动态范围和低延迟时间。作为生物体发现的感知器，事件摄像机生成的数据是不兼容传统框架方法的异常快照式数据。相反，点云方法可以避免额外模态变换，并自然适应事件的稀疏性。然而，它通常无法达到与框架方法相当的准确性。我们提出了一种轻量级、通用的点云网络 called TTPOINT，它在动作识别任务中达到了与状态前方法相当的结果，只使用了1.5%的计算资源。该模型能够层次结构中抽象本地和全局几何。通过利用张量约束压缩特征提取器，TTPOINT可以设计为最小参数和计算复杂度。此外，我们开发了一种简单的下采样算法，以保持空间时间特征。在实验中，TTPOINT被认为状态前方法在三个数据集上，同时在五个数据集上也成为点云方法的最佳方法。此外，通过使用张量约束压缩方法，提议的TTPOINT的准确率几乎不受参数大小压缩55%的影响。

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

paper_url: http://arxiv.org/abs/2308.09991
repo_url: https://github.com/superhero-7/altdiffuson
paper_authors: Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
for: 这个论文是为了推广大语言环境下的文本到图像（T2I）扩散模型，以便为不同语言用户提供更好的服务。methods: 这篇论文使用了知识传播学习（KD）来训练一个多语言文本编码器，然后把其插入到一个预训练的英文只 diffusion 模型中，通过两个阶段的Schema来提高多语言能力。results: 这篇论文在一个大规模多语言 dataset 上进行了两个阶段的Schema，包括概念对接和质量改进阶段，并在多语言总体评价和文化特有概念评价中表现出色，超过了现有的状态之艺 T2I 模型。

Abstract
Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.

摘要
大型文本到图像（T2I）扩散模型已经显示出产生高质量和多样化的图像的remarkable能力，但现有的工作只支持有限的语言输入，例如英语、中文和日语，这使得用户使用其他语言被忽略和限制了全球的扩展。因此，本文提出了AltDiffusion，一种新的多语言T2I扩散模型，支持 eighteen种不同的语言。 Specifically，我们首先训练了一个多语言文本编码器，基于知识传承。然后，我们插入到一个预训练的英语只 diffusion model 中，并在一个两个阶段的Schema中进行增强多语言能力，包括概念对齐和质量提升阶段。此外，我们 introduce a new benchmark，包括 Multilingual-General-18（MG-18）和 Multilingual-Cultural-18（MC-18）数据集，以评估 T2I 扩散模型在生成高质量图像和捕捉不同语言文化特性方面的能力。实验结果表明，AltDiffusion 在 MG-18 和 MC-18 上表现出优于当前状态的扩散模型，特别是在文化特性方面，而且仍然与生成高质量图像有相同的能力。所有源代码和检查点可以在找到。

TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo

paper_url: http://arxiv.org/abs/2308.09990
repo_url: None
paper_authors: Zhenlong Yuan, Jiakai Cao, Hao Jiang, Zhaoqi Wang, Zhaoxin Li
for: 增强多视图ステレオ（MVS）中缺失文本区域的重建问题的解决方案。
methods: 提议的方法包括：一、 joint hypothesis filtering，二、iterative correlation refinement，三、 textureless-aware segmentation。
results: 实验结果表明，提议的方法在大量数据集上表现出色，与大多数非学习方法比较，具有较高的精度和稳定性，同时能够保留细节。

Abstract
The reconstruction of textureless areas has long been a challenging problem in MVS due to lack of reliable pixel correspondences between images. In this paper, we propose the Textureless-aware Segmentation And Correlative Refinement guided Multi-View Stereo (TSAR-MVS), a novel method that effectively tackles challenges posed by textureless areas in 3D reconstruction through filtering, refinement and segmentation. First, we implement joint hypothesis filtering, a technique that merges a confidence estimator with a disparity discontinuity detector to eliminate incorrect depth estimations. Second, to spread the pixels with confident depth, we introduce a iterative correlation refinement strategy that leverages RANSAC to generate superpixels, succeeded by a median filter for broadening the influence of accurately determined pixels.Finally, we present a textureless-aware segmentation method that leverages edge detection and line detection for accurately identify large textureless regions to be fitted using 3D planes. Experiments on extensive datasets demonstrate that our method significantly outperforms most non-learning methods and exhibits robustness to textureless areas while preserving fine details.

摘要
文本无纹区域重建问题在多视图深度（MVS）中一直是一个挑战，因为缺乏可靠的像素对应关系。在这篇论文中，我们提出了Textureless-aware Segmentation And Correlative Refinement guided Multi-View Stereo（TSAR-MVS），一种新的方法，可以有效地解决由缺纹区域引起的3D重建问题。首先，我们实现了联合假设筛选，这是一种将信度估计器与分辨率缺乏检测器结合在一起，以消除错误的深度估计。其次，为了将具有信任度的像素扩展到更多的像素，我们引入了一种迭代相关级联策略，该策略利用RANSAC生成超像素，然后使用中值滤波器来扩大正确确定的像素的影响。最后，我们提出了一种具有纹理性的分割方法，该方法利用边检测和直线检测来准确地识别大面积的缺纹区域，并将其适用3D平面来适应。我们对广泛的数据集进行了实验，结果显示，我们的方法在非学习方法中显著超越，并且具有对缺纹区域的鲁棒性，同时保留细节。

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

paper_url: http://arxiv.org/abs/2308.09983
repo_url: None
paper_authors: Yichen Zhang, Yifang Yin, Ying Zhang, Zhenguang Liu, Zheng Wang, Roger Zimmermann
for: 这个研究的目的是提高自动诊断阴道异常的准确率，以便在低资源设置下提供更好的阴道癌诊断。methods: 我们提出了一种使用跨领域阴道照片进行学习，以增强模型的性能。我们还提出了一种具有转移性的知识范例选择方法，以便对目标阴道照片集进行训练。results: 我们的方法在三个真实世界 benchmark 阴道照片集上进行了实验，结果显示我们的方法在阴道异常诊断中的顶尖一致率、精度、准确率和ROC-AUC上有4.7%、7.0%、1.4%和4.6%的提升，优于现有的阴道异常诊断方法。

Abstract
Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.

摘要
早期检测颈部癌变是肺癌治疗的关键，但自动诊断颈部癌变via视觉检查，更适合在低资源环境中进行，仍然是一个挑战。虽然最近的深度学习模型已经获得了有前途的结果，但它们的性能受到有限的颈部数据集的限制。与前方方法不同，我们提议利用跨频道颈部图像，从不同但相关的临床研究中收集的图像来提高模型的性能。为了强制学习跨频道图像中的共同信息，我们提出了一种新的原型基于知识筛选方法，以估算跨频道样本的传输性。此外，我们还使用了同时对域级和类别级进行对齐，以提高共享特征空间的定制。通过这种方法，我们可以强制训练模型并传输知识，使其在目标颈部数据集上表现更加稳定。实验结果表明，我们的提议方法在三个实际预测颈部癌变的数据集上表现出色，与状态之前的诊断性能相比，提高了4.7%的排名前一精度、7.0%的精度、1.4%的准确率、4.6%的F1分数和0.05的ROC-AUC。

Breast Lesion Diagnosis Using Static Images and Dynamic Video

paper_url: http://arxiv.org/abs/2308.09980
repo_url: None
paper_authors: Yunwen Huang, Hongyu Hu, Ying Zhu, Yi Xu
for: 这个研究旨在开发一个基于深度学习的电脑助诊系统，以帮助诊断乳腺癌。methods: 这个系统使用多 modal 的乳腺ultrasound影像和动态影像，并将它们融合为一个多Modal Feature。在这个过程中，我们将使用专家选择的静止影像来导航动态影像的特征聚合。results: 我们在一个包含897个乳腺ultrasound影像和动态影像的 dataset上进行验证，结果显示我们的模型可以提高了诊断的精度，其AUC值为90.0%，并且精度为81.7%。

Abstract
Deep learning based Computer Aided Diagnosis (CAD) systems have been developed to treat breast ultrasound. Most of them focus on a single ultrasound imaging modality, either using representative static images or the dynamic video of a real-time scan. In fact, these two image modalities are complementary for lesion diagnosis. Dynamic videos provide detailed three-dimensional information about the lesion, while static images capture the typical sections of the lesion. In this work, we propose a multi-modality breast tumor diagnosis model to imitate the diagnosing process of radiologists, which learns the features of both static images and dynamic video and explores the potential relationship between the two modalities. Considering that static images are carefully selected by professional radiologists, we propose to aggregate dynamic video features under the guidance of domain knowledge from static images before fusing multi-modality features. Our work is validated on a breast ultrasound dataset composed of 897 sets of ultrasound images and videos. Experimental results show that our model boosts the performance of Benign/Malignant classification, achieving 90.0% in AUC and 81.7% in accuracy.

摘要
深度学习基于计算机辅助诊断（CAD）系统已经开发以治疗乳腺癌。大多数其中都专注于单一的乳腺ultrasound图像模式，可以是使用代表性的静止图像或实时扫描的动态视频。事实上，这两种图像模式是诊断癌变的补充。动态视频提供癌变三维信息的详细信息，而静止图像捕捉癌变典型部分。在这种工作中，我们提议一种多Modal breast tumor诊断模型，以模拟医生的诊断过程，学习静止图像和动态视频中的特征，并探索这两种模式之间的潜在关系。考虑到静止图像由专业医生 manually selects，我们提议将动态视频特征与静止图像特征相互融合，然后将多模式特征进行融合。我们的工作被验证在897组乳腺ultrasound图像和视频组成的 dataset上。实验结果表明，我们的模型可以提高了抑准/癌变分类的性能，达到了90.0%的AUC和81.7%的准确率。

Whether you can locate or not? Interactive Referring Expression Generation

paper_url: http://arxiv.org/abs/2308.09977
repo_url: https://github.com/superhero-7/ireg
paper_authors: Fulong Ye, Yuxing Long, Fangxiang Feng, Xiaojie Wang
for: 本研究旨在生成不ambiguous的 Referring Expressions (REs)，并实现 Referring Expression Comprehension (REC) 任务。
methods: 我们提出了一种Interactive REG (IREG) 模型，可以与真实的 REC 模型交互，利用显示对象所在视觉区域和对象是否已经找到的信号来慢慢修改 REs。
results: 我们在 RefCOCO、RefCOCO+ 和 RefCOCOg 三个 RE benchmark 数据集上进行了实验，结果显示，IREG 可以比前一代方法在各种评价指标上取得更高的性能。此外，人工评价也表明，IREG 能够更好地生成 REs 并具有交互能力。

Abstract
Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object. Existing methods construct REG models independently by using only the REs as ground truth for model training, without considering the potential interaction between REG and REC models. In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. Our experimental results on three RE benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art methods on popular evaluation metrics. Furthermore, a human evaluation shows that IREG generates better REs with the capability of interaction.

摘要
《referring表达生成（REG）》的目标是生成清晰无ambiguity的引用表达（RE），同时通过引用表达理解（REC）来确定引用的对象所在位置。现有方法通常是独立构建REG模型，只使用RE作为模型训练的真实值，而不考虑REG和REC模型之间的可能的互动。在这篇论文中，我们提出了一种互动式REG（IREG）模型，可以与真实的REC模型进行互动，通过视觉区域和对象所在位置的信号来慢慢修改RE。我们的实验结果表明，IREG在三个REFCOCO数据集上表现出色，与前一代方法相比，在流行的评价指标上表现出色。此外，人工评价也表明，IREG能够更好地生成引用表达，并具有互动的能力。

DESOBAv2: Towards Large-scale Real-world Dataset for Shadow Generation

paper_url: http://arxiv.org/abs/2308.09972
repo_url: None
paper_authors: Qingyang Liu, Jianting Wang, Li Niu
for: 本研究旨在生成真实的阴影，以使composite image更加真实。
methods: 使用object-shadow detection和inpainting技术生成阴影，并使用pretrained填充模型进行填充。
results: 创建了一个大规模的DESOBAv2 dataset，可以用于评估阴影生成器的性能。

Abstract
Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadow for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset DESOBA, we create a large-scale dataset called DESOBAv2 by using object-shadow detection and inpainting techniques. Specifically, we collect a large number of outdoor scene images with object-shadow pairs. Then, we use pretrained inpainting model to inpaint the shadow region, resulting in the deshadowed images. Based on real images and deshadowed images, we can construct pairs of synthetic composite images and ground-truth target images. Dataset is available at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.

摘要
Image composition指的是将前景对象 inserts 到背景图像中获得复合图像。在这种工作中，我们关注于生成合理的阴影，以使复合图像更加真实。为了补充现有的小规模数据集DESOBA，我们创建了一个大规模数据集DESOBAv2，使用对象阴影检测和填充技术。具体来说，我们收集了大量的室外场景图像和对象阴影对。然后，我们使用预训练的填充模型填充阴影区域，得到了抹掉阴影的图像。基于真实图像和抹掉阴影图像，我们可以构建对应的 sintetic 复合图像和实际目标图像的对。数据集可以在https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2上下载。

NeutrEx: A 3D Quality Component Measure on Facial Expression Neutrality

paper_url: http://arxiv.org/abs/2308.09963
repo_url: None
paper_authors: Marcel Grimmer, Christian Rathgeb, Raymond Veldhuis, Christoph Busch
for: 这个论文的目的是提出一个基于3D人脸重建的表达质量评估方法，以便确保低质量的人脸图像不会影响辨识率。
methods: 这个方法使用了一个基于支持向量机器的expression辨识方法，并使用了一个预训练的对应神经网络来提取面孔嵌入。
results: 该方法比基于面孔嵌入的基eline方法表现出色，并且可以提供可解释的评估结果，包括每个颅骨的距离值，以帮助操作员给予有用的反馈。

Abstract
Accurate face recognition systems are increasingly important in sensitive applications like border control or migration management. Therefore, it becomes crucial to quantify the quality of facial images to ensure that low-quality images are not affecting recognition accuracy. In this context, the current draft of ISO/IEC 29794-5 introduces the concept of component quality to estimate how single factors of variation affect recognition outcomes. In this study, we propose a quality measure (NeutrEx) based on the accumulated distances of a 3D face reconstruction to a neutral expression anchor. Our evaluations demonstrate the superiority of our proposed method compared to baseline approaches obtained by training Support Vector Machines on face embeddings extracted from a pre-trained Convolutional Neural Network for facial expression classification. Furthermore, we highlight the explainable nature of our NeutrEx measures by computing per-vertex distances to unveil the most impactful face regions and allow operators to give actionable feedback to subjects.

摘要
正确的面部识别系统在敏感应用中如国境控制或移民管理中日益重要。因此，它成为了评估面部图像质量的关键。在这个上下文中，ISO/IEC 29794-5 的现行稿引入了面部成分质量的概念，以估计单一因素的变化对于识别结果的影响。在这个研究中，我们提出了基于三维面部重建的累累距离对于中性表情参照点的质量指标（NeutrEx）。我们的评估结果显示，我们的提案方法比基于预训条件支持向量机器学习的面部嵌入式表情分类模型来得到更高的表现。此外，我们还高亮了我们的NeutrEx指标的可解释性，通过计算每个颅骨距离以揭露面部最重要的影响区域，并允许操作者给予适当的反馈给主体。

UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

paper_url: http://arxiv.org/abs/2308.09953
repo_url: https://github.com/rese1f/UniAP
paper_authors: Meiqi Sun, Zhonghan Zhao, Wenhao Chai, Hanjun Luo, Shidong Cao, Yanting Zhang, Jenq-Neng Hwang, Gaoang Wang
for: 用于自动监测动物健康、理解动物行为和助助动物研究。
methods: 使用几拟学习来实现跨种动物视觉识别模型，通过共享视觉特征来传递知识。
results: 实现了跨种动物视觉任务的泛化和适应，可以快速适应新种类和有限数量的标注数据。

Abstract
Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples.

摘要
生物视觉技术是重要的自动监测动物健康、理解动物行为和帮助动物研究的方法。然而，设计一个基于深度学习的视觉感知模型，能够自由地适应不同的动物和多种视觉任务，是一项挑战。这是因为动物的姿势变化多样、罕见的种类缺乏数据，以及不同任务的semantic不一致。我们介绍UniAP，一种新的通用动物感知模型，利用几何学学习来实现跨种pecie的视觉感知。我们提posed模型通过在支持图像和标签作为引导图像的query图像。图像和标签通过Transformer基于encoder和轻量级标签encoder处理。然后，我们设计了一个匹配模块，用于聚合引导图像和查询图像之间的信息，并由一个多头标签解码器生成多种任务的输出。通过利用不同动物和任务之间的共同视觉特征，UniAP允许知识从已经学习的种类传递到有限数据或者even未经见过的种类。我们通过对多种动物种类的pose估计、分割和分类任务进行了广泛的实验，证明UniAP的有效性，并显示其能够通过少量标签示例进行泛化和适应新类。

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

paper_url: http://arxiv.org/abs/2308.09951
repo_url: None
paper_authors: Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin
for: 本 paper 旨在强化对象中心表示，提高视频对象发现和分类性能。
methods: 该 paper 使用了自助学习方法，包括 query slot attention 和 random sampling based slot attention，以提取高级 semantics 和低级时间匹配信息。另外，它还提出了一种新的masked slot attention方法，以强化对象中心表示。
results: 该 paper 的实验结果表明，使用自助学习方法和masked slot attention可以提高视频对象发现和分类性能，并且可以实现对象中心表示。此外，它还达到了 dense label propagation 任务的最佳性能，demonstrating the potential for object-centric analysis。

Abstract
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.

摘要
自我监督方法已经在学习高级 semantics 和低级时间匹配方面进行了非常出色的进步。基于这些结果，我们尝试一步更远，探索将这两个特征集成到对象中心表示中以提高对象识别的可能性。我们的初步实验表明，Query slot attention可以从 RGB 特征地图中提取不同的semantic 分量，而Random sampling based slot attention可以在帧中使用时间匹配规律来帮助实例识别。这些灵感下，我们提议一种新的semantic-aware masked slot attention，它包括两个槽注意阶段，每个阶段都有一组共享学习的Gaussian 分布。在第一个阶段，我们使用极值向量作为槽初始化，以iterative attention decomposition potential semantics并生成semantic segmentation mask。在第二个阶段，对每个semantics，我们随机从相应的Gaussian分布中选择槽，并在semantic区域内进行masked feature aggregation，以利用时间匹配模式来提高实例识别。我们采用semantic-和instance-level时间一致性自我监督，以鼓励对象中心表示的时间一致性。我们的模型能够有效地识别多个对象实例，同时保持semantic结构，在无监督视频对象发现任务中达到了可观的结果。此外，我们在 dense label propagation 任务中实现了state-of-the-art 性能，表明对象中心分析的潜力。代码可以在上下载。

Scene-Aware Feature Matching

paper_url: http://arxiv.org/abs/2308.09949
repo_url: https://github.com/USTCPCS/CVPR2018_attention
paper_authors: Xiaoyong Lu, Yaping Yan, Tong Wei, Songlin Du
for: The paper is written for improving the performance of feature matching in computer vision tasks, particularly in handling challenging scenes with large viewpoint and illumination changes.
methods: The paper proposes a novel model named SAM, which applies attentional grouping to guide Scene-Aware feature Matching. The model uses attention layers to handle multi-level features, including image tokens and group tokens, and groups the image tokens with the proposed token grouping module.
results: The paper achieves state-of-the-art performance on various applications, including homography estimation, pose estimation, and image matching, and demonstrates that the proposed model is more accurate, robust, and interpretable than conventional feature matching models.

Abstract
Current feature matching methods focus on point-level matching, pursuing better representation learning of individual features, but lacking further understanding of the scene. This results in significant performance degradation when handling challenging scenes such as scenes with large viewpoint and illumination changes. To tackle this problem, we propose a novel model named SAM, which applies attentional grouping to guide Scene-Aware feature Matching. SAM handles multi-level features, i.e., image tokens and group tokens, with attention layers, and groups the image tokens with the proposed token grouping module. Our model can be trained by ground-truth matches only and produce reasonable grouping results. With the sense-aware grouping guidance, SAM is not only more accurate and robust but also more interpretable than conventional feature matching models. Sufficient experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that our model achieves state-of-the-art performance.

摘要
当前的特征匹配方法主要关注点级匹配，尝试更好地学习个体特征的表示学习，但缺乏更深入的场景理解。这会导致对复杂场景（如大视角和照明变化）的处理而受到显著性能下降。为解决这个问题，我们提出了一种新的模型，即SAM，它通过注意力组合来引导场景意识的特征匹配。SAM处理多级特征，即图像token和组token，使用注意力层，并通过我们提议的token grouping模块将图像token分组。我们的模型可以通过真实匹配只需要训练，并且生成合理的分组结果。与传统的特征匹配模型相比，SAM不仅更准确和Robust，还更易于解释。我们在多种应用，包括Homography估计、pose估计和图像匹配等，进行了详细的实验，结果显示我们的模型在状态艺术性能。

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

paper_url: http://arxiv.org/abs/2308.09946
repo_url: None
paper_authors: Guiqin Wang, Peng Zhao, Cong Zhao, Shusen Yang, Jie Cheng, Luziwei Leng, Jianxing Liao, Qinghai Guo
for: 这个论文主要针对的是强度不足的动作地理学问题，即在没有时间标注的视频中识别和地理化动作实例。
methods: 我们提出了一种新的注意力基于层次结构的隐藏模型，用于学习视频特征的时间变化 semantics。该模型包括两个组件：第一个是一个不supervised的变化点检测模块，通过学习视频特征的时间层次结构来检测变化点；第二个是一个注意力基于分类模型，用于选择变化点的背景。
results: 我们在THUMOS-14和ActivityNet-v1.3两个 benchmark dataset上进行了广泛的实验，结果显示，我们的方法可以比现有的方法表现更好，甚至与完全监督的方法相当。

Abstract
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.

摘要
weakly-supervised action localization aims to recognize and localize action instances in untrimmed videos with only video-level labels. most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. the MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. to address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. to evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. the experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.Here's the text with traditional Chinese characters:weakly-supervised action localization aims to recognize and localize action instances in untrimmed videos with only video-level labels. most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. the MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. to address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. to evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. the experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.

Dual Branch Deep Learning Network for Detection and Stage Grading of Diabetic Retinopathy

paper_url: http://arxiv.org/abs/2308.09945
repo_url: None
paper_authors: Hossein Shakibania, Sina Raoufi, Behnam Pourafkham, Hassan Khotanlou, Muharram Mansoorizadeh
for: 这篇论文旨在提出一种基于深度学习的视网膜病变检测和分级方法，以帮助早期识别和治疗糖尿病 relacionais complications。
methods: 本论文使用了两个现有的优秀预训练模型作为特征提取器，并对其进行了微调，以适应一个新的数据集。模型在一个大型多中心数据集上进行了训练，包括APTOS 2019数据集。
results: 本论文的提出的方法在APTOS 2019数据集上实现了优秀的视网膜病变检测和分级性能，比过去Literature中的成果更高。在二分类任务中，提出的方法取得了98.50%的准确率，99.46%的敏感度和97.51%的特异度。在分级任务中，它取得了93.00%的 quadratic weighted kappa，89.60%的准确率，89.60%的敏感度和97.72%的特异度。

Abstract
Diabetic retinopathy is a severe complication of diabetes that can lead to permanent blindness if not treated promptly. Early and accurate diagnosis of the disease is essential for successful treatment. This paper introduces a deep learning method for the detection and stage grading of diabetic retinopathy, using a single fundus retinal image. Our model utilizes transfer learning, employing two state-of-the-art pre-trained models as feature extractors and fine-tuning them on a new dataset. The proposed model is trained on a large multi-center dataset, including the APTOS 2019 dataset, obtained from publicly available sources. It achieves remarkable performance in diabetic retinopathy detection and stage classification on the APTOS 2019, outperforming the established literature. For binary classification, the proposed approach achieves an accuracy of 98.50%, a sensitivity of 99.46%, and a specificity of 97.51%. In stage grading, it achieves a quadratic weighted kappa of 93.00%, an accuracy of 89.60%, a sensitivity of 89.60%, and a specificity of 97.72%. The proposed approach serves as a reliable screening and stage grading tool for diabetic retinopathy, offering significant potential to enhance clinical decision-making and patient care.

摘要
糖尿病 retinopathy 是糖尿病的严重合并症状，可能会导致永久潦倒盲视，如果不及时治疗。早期和准确的诊断是治疗的关键。本文介绍了一种深度学习方法，用于检测和评分糖尿病 retinopathy，只需要一张背景照片。我们的模型使用了传输学习，利用了两个现有的状态体验模型作为特征提取器，并在新的数据集上进行了微调。我们的模型在APTOS 2019 数据集上进行了训练，包括公共可用的数据集。它在糖尿病 retinopathy 检测和评分中实现了显著的表现，超过了现有文献。对二分类问题，我们的方法实现了98.50%的准确率，99.46%的敏感度和97.51%的特异度。在评分方面，我们的方法实现了93.00%的 quadratic 权重κ值，89.60%的准确率，89.60%的敏感度和97.72%的特异度。我们的方法可以作为糖尿病 retinopathy 的可靠检测和评分工具，提供了对临床决策和患者护理的显著潜在优势。

On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion

paper_url: http://arxiv.org/abs/2308.09942
repo_url: https://github.com/yushu-li/owttt
paper_authors: Yushu Li, Xun Xu, Yongyi Su, Kui Jia
for: 这篇论文旨在提高 unknown target domain distribution 下的 deep learning 模型鲁棒性，并且可以在低延迟下进行 test-time training/adaptation (TTT/TTA)。
methods: 该论文提出了一种 adaptive strong OOD pruning 技术，以及一种动态扩展 prototype 以区分强 OOD 样本和弱 OOD 样本。此外，论文还提出了一种 distribution alignment REG regularization，以提高 self-training 的效果。
results: 该论文在 5 个 OWTTT benchmark 上达到了 state-of-the-art 性能。

Abstract
Generalizing deep learning models to unknown target domain distribution with low latency has motivated research into test-time training/adaptation (TTT/TTA). Existing approaches often focus on improving test-time training performance under well-curated target domain data. As figured out in this work, many state-of-the-art methods fail to maintain the performance when the target domain is contaminated with strong out-of-distribution (OOD) data, a.k.a. open-world test-time training (OWTTT). The failure is mainly due to the inability to distinguish strong OOD samples from regular weak OOD samples. To improve the robustness of OWTTT we first develop an adaptive strong OOD pruning which improves the efficacy of the self-training TTT method. We further propose a way to dynamically expand the prototypes to represent strong OOD samples for an improved weak/strong OOD data separation. Finally, we regularize self-training with distribution alignment and the combination yields the state-of-the-art performance on 5 OWTTT benchmarks. The code is available at https://github.com/Yushu-Li/OWTTT.

摘要
通用深度学习模型到未知目标频率分布下进行普通化（Test-Time Training/Adaptation，TTT/TTA）已经引起了研究者的关注。现有的方法 oftentimes 专注于在well-curated 目标频率数据下提高测试时训练性能。在这项工作中，我们发现许多状态 искусственный智能方法在受到强外部数据杂化（Out-of-Distribution，OOD）影响时失效，主要原因是不能正确分辨强OOD样本和弱OOD样本。为了改善OWTTT的Robustness，我们首先开发了适应强OOD检索，该方法可以提高自动训练 TTT 方法的效果。我们还提出了在运行时动态扩展prototype来表示强OOD样本，以提高弱/强OOD数据的分离。最后，我们使用分布对齐和组合，并得到了5个OWTTT benchmark上的状态对齐性。代码可以在https://github.com/Yushu-Li/OWTTT中找到。

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

paper_url: http://arxiv.org/abs/2308.09936
repo_url: https://github.com/mlpc-ucsd/bliva
paper_authors: Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
for: 提高实际图像中文本的理解和处理能力，以便更好地解决实际场景中的视觉问答任务。
methods: combines InstructBLIP和Visual Assistant，Directly project encoded patch embeddings into the LLM，以帮助模型更好地捕捉图像中的细节。
results: 在处理文本含有图像的VQA benchmark tasks上显著提高性能（up to 17.76% in OCR-VQA benchmark），并在典型VQA benchmark tasks上也获得了显著提高（up to 7.9% in Visual Spatial Reasoning benchmark），并且可以在实际图像中处理文本不存在的情况下也表现出色。

Abstract
Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76\% in OCR-VQA benchmark) and in undertaking typical VQA benchmarks (up to 7.9\% in Visual Spatial Reasoning benchmark), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 13 diverse categories. For researchers interested in further exploration, our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.git

摘要
大多数视语模型（VLM）通过结合视觉理解能力和大语言模型（LLM）来解决开放式视觉问答（VQA）任务，但是这些模型无法正确地理解包含文本的图像，这是现实世界中非常常见的情况。标准的图像信息抽取方法通常包括学习固定的查询嵌入。这些嵌入是用于捕捉图像上下文，并且在LLM中使用为软提示输入。然而，这种过程受到固定嵌入数量的限制，可能会遮盖捕捉场景中的文本背景。为了解决这个问题，本研究提出了BLIVA：一种基于InstructBLIP的增强版，它包括InstructBLIP的查询嵌入以及直接将编码补丁嵌入 proyect到LLM中，这种方法灵感于LLaVA。这种方法帮助模型捕捉细节信息，可能在查询解码过程中被遗弃。实验证明，我们的模型BLIVA在处理具有文本背景的VQA benchmark中表现出色，与基线InstructBLIP相比，提高了17.76%（在OCR-VQA benchmark中）和7.9%（在Visual Spatial Reasoning benchmark中）。BLIVA示出了在现实图像中捕捉文本背景的能力，不受文本存在或不存在的限制。为了展示BLIVA在广泛的 industrielles 应用中的可能性，我们使用了一个新的 YouTube 频道封面和问答集合，并对13种不同的类别进行评估。如果您想进一步探索，我们的代码和模型都可以免费下载于https://github.com/mlpc-ucsd/BLIVA.git。

TDG: Text-guided Domain Generalization

paper_url: http://arxiv.org/abs/2308.09931
repo_url: None
paper_authors: Geng Liu, Yuxi Wang
for: 本文旨在推广基于单个或多个源领域的模型到未见的目标领域。
methods: 我们提出了一种新的文本引导领域泛化（TDG）方法，包括三个方面：首先，我们开发了一种自动生成域特有词汇的方法，以扩展当前领域的描述。然后，我们使用提案学习基于文本特征生成方法，将生成的域信息与图像特征共同卷积在同一个表示空间中。最后，我们使用输入图像特征和生成的文本特征来训练一个特殊的分类器，以便在未见目标领域中进行泛化。
results: 我们的实验结果表明，通过在TDG中引入生成的文本信息，可以提高领域泛化的性能，而且这种方法的实现非常容易。我们在多个领域泛化 benchmark 上进行了实验，并证明了我们的提出的框架可以在不同的领域中达到更高的性能。

Abstract
Domain generalization (DG) attempts to generalize a model trained on single or multiple source domains to the unseen target domain. Benefiting from the success of Visual-and-Language Pre-trained models in recent years, we argue that it is crucial for domain generalization by introducing extra text information. In this paper, we develop a novel Text-guided Domain Generalization (TDG) paradigm for domain generalization, which includes three following aspects. Specifically, we first devise an automatic words generation method to extend the description of current domains with novel domain-relevant words. Then, we embed the generated domain information into the text feature space, by the proposed prompt learning-based text feature generation method, which shares a common representation space with the image feature. Finally, we utilize both input image features and generated text features to train a specially designed classifier that generalizes well on unseen target domains, while the image encoder is also updated under the supervision of gradients back propagated from the classifier. Our experimental results show that the techniques incorporated by TDG contribute to the performance in an easy implementation manner. Experimental results on several domain generalization benchmarks show that our proposed framework achieves superior performance by effectively leveraging generated text information in domain generalization.

摘要
域间泛化（DG）目标是将单个或多个源域模型泛化到未看过的目标域。受最近几年视觉语言预训模型的成功影响，我们认为在域间泛化中具有重要作用的是引入文本信息。在这篇论文中，我们提出了一种新的文本准导域泛化（TDG）方法，它包括以下三个方面：首先，我们开发了一种自动生成域关键词方法，以扩展当前域的描述，并添加新域相关的词语。然后，我们将生成的域信息embedded到文本特征空间中，通过我们提出的提示学习基于文本特征生成方法，这个方法与图像特征空间共享表示。最后，我们利用输入图像特征和生成的文本特征来训练一个专门设计的分类器，这个分类器可以在未看过的目标域上进行广泛的泛化，而图像编码器也在分类器的监督下更新。我们的实验结果表明，TDG方法可以在易于实现的情况下提高域泛化的性能。我们在多个域泛化 benchmark 上进行了实验，并证明了我们提出的方法可以有效地利用生成的文本信息，以提高域泛化的性能。

MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition

paper_url: http://arxiv.org/abs/2308.09922
repo_url: https://github.com/fistyee/mdcs
paper_authors: Qihao Zhao, Chen Jiang, Wei Hu, Fan Zhang, Jun Liu
for: 提高长尾识别（LTR）精度
methods: 使用更多的专家和自我泛化（CS）提高模型的多样性和稳定性
results: 与先前方法比较，MDCS方法可以提高识别精度，降低模型的偏差，并提高专家的多样性。在五个流行的长尾识别 bencmarks 上，MDCS方法比前一代 лучperform，提高精度1% 至 2%。

Abstract
Recently, multi-expert methods have led to significant improvements in long-tail recognition (LTR). We summarize two aspects that need further enhancement to contribute to LTR boosting: (1) More diverse experts; (2) Lower model variance. However, the previous methods didn't handle them well. To this end, we propose More Diverse experts with Consistency Self-distillation (MDCS) to bridge the gap left by earlier methods. Our MDCS approach consists of two core components: Diversity Loss (DL) and Consistency Self-distillation (CS). In detail, DL promotes diversity among experts by controlling their focus on different categories. To reduce the model variance, we employ KL divergence to distill the richer knowledge of weakly augmented instances for the experts' self-distillation. In particular, we design Confident Instance Sampling (CIS) to select the correctly classified instances for CS to avoid biased/noisy knowledge. In the analysis and ablation study, we demonstrate that our method compared with previous work can effectively increase the diversity of experts, significantly reduce the variance of the model, and improve recognition accuracy. Moreover, the roles of our DL and CS are mutually reinforcing and coupled: the diversity of experts benefits from the CS, and the CS cannot achieve remarkable results without the DL. Experiments show our MDCS outperforms the state-of-the-art by 1% $\sim$ 2% on five popular long-tailed benchmarks, including CIFAR10-LT, CIFAR100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The code is available at https://github.com/fistyee/MDCS.

摘要
近些时间，多专家方法已经导致长尾识别（LTR）中的显著改进。我们总结了两个需要进一步改进以提高LTR的方面：（1）更多的专家；（2）模型变量下降。然而，之前的方法没有很好地处理这两个方面。为此，我们提出了更多的专家与自我照成（MDCS），以填补之前方法留下的差距。我们的MDCS方法包括两个核心组成部分：多样性损失（DL）和自我照成（CS）。在详细说明下，DL使得专家们对不同类别的焦点控制，以提高多样性。为了降低模型变量，我们使用KL散度来让弱增强的实例对专家自我照成进行泛化。特别是，我们设计了信心实例选择（CIS），以确保选择正确分类的实例，以避免偏倚/噪音知识。我们的分析和割裁研究表明，与之前的工作相比，我们的方法可以有效增加专家的多样性，显著降低模型变量，并提高识别精度。此外，我们的DL和CS之间存在互相强化和结合关系：专家的多样性受益于CS，而CS无法取得显著成果不包括DL。实验表明，我们的MDCS在五个流行的长尾benchmark上比前一个最佳的实现1% $\sim$ 2%的提高。代码可以在https://github.com/fistyee/MDCS上获取。

VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

paper_url: http://arxiv.org/abs/2308.09916
repo_url: https://github.com/jiehonglin/vi-net
paper_authors: Jiehong Lin, Zewei Wei, Yabin Zhang, Kui Jia
for: 高精度 объекpose数据集上的6D对象pose估计，即使没有可用的CAD模型。
methods: 提议一种新的旋转估计网络，名为VI-Net，它通过分解旋转为视点旋转和平面旋转的组合来简化非线性空间SO(3)中的学习。
results: 实验表明，提议的VI-Net方法在高精度 regime下可以大幅超过现有方法。

Abstract
Rotation estimation of high precision from an RGB-D object observation is a huge challenge in 6D object pose estimation, due to the difficulty of learning in the non-linear space of SO(3). In this paper, we propose a novel rotation estimation network, termed as VI-Net, to make the task easier by decoupling the rotation as the combination of a viewpoint rotation and an in-plane rotation. More specifically, VI-Net bases the feature learning on the sphere with two individual branches for the estimates of two factorized rotations, where a V-Branch is employed to learn the viewpoint rotation via binary classification on the spherical signals, while another I-Branch is used to estimate the in-plane rotation by transforming the signals to view from the zenith direction. To process the spherical signals, a Spherical Feature Pyramid Network is constructed based on a novel design of SPAtial Spherical Convolution (SPA-SConv), which settles the boundary problem of spherical signals via feature padding and realizesviewpoint-equivariant feature extraction by symmetric convolutional operations. We apply the proposed VI-Net to the challenging task of category-level 6D object pose estimation for predicting the poses of unknown objects without available CAD models; experiments on the benchmarking datasets confirm the efficacy of our method, which outperforms the existing ones with a large margin in the regime of high precision.

摘要
rotation 估计高精度从RGB-D对象观察是6D对象pose估计中的巨大挑战，由于非线性空间SO(3)学习的困难。在本文中，我们提出了一种新的 rotate estimation network，称为VI-Net，以使任务更加容易，通过分解旋转为两个因素旋转的组合。更具体地说，VI-Net基于特有的SPAtial Spherical Convolution（SPA-SConv）设计，在特定的圆形信号上进行特征学习，并通过将信号变换到zenith方向来估计平面旋转。为处理圆形信号，我们构建了一个叫做Spherical Feature Pyramid Network（SFPN），该网络通过特有的SPA-SConv设计解决了圆形信号的边界问题，并实现了视角平衡的特征提取。我们应用提出的VI-Net来解决6D对象pose估计中的category-level高精度任务，对于未知对象而言，不需要可用的CAD模型；在标准测试集上进行了实验，并证明了我们的方法在高精度 режи下表现出了明显的优势。

EGANS: Evolutionary Generative Adversarial Network Search for Zero-Shot Learning

paper_url: http://arxiv.org/abs/2308.09915
repo_url: None
paper_authors: Shiming Chen, Shihuang Chen, Wenjin Hou, Weiping Ding, Xinge You
for: 这篇论文的目的是提出一种基于进化的对抗学习方法（EGANS），以自动设计适应性和稳定性优化的生成网络，并在不同数据集/场景下进行可靠的视觉特征样本生成，以提高零对零学习（ZSL）的性能。
methods: 这篇论文使用了协同对抗进化搜索（cooperative dual evolution）来进行神经网络架构搜索，包括生成器和检测器两个部分。在演化生成器架构搜索阶段，运用了多对一对抗训练策略来演化生成器。然后，使用了相似的演化搜索算法来进行检测器架构搜索。
results: 实验结果显示，EGANS可以稳定地提高现有的生成ZSL方法的性能，在标准的CUB、SUN、AWA2和FLO数据集上均有显著的表现提升。这些表现提升显示了进化性的神经架构搜索在ZSL领域中的可能性。

Abstract
Zero-shot learning (ZSL) aims to recognize the novel classes which cannot be collected for training a prediction model. Accordingly, generative models (e.g., generative adversarial network (GAN)) are typically used to synthesize the visual samples conditioned by the class semantic vectors and achieve remarkable progress for ZSL. However, existing GAN-based generative ZSL methods are based on hand-crafted models, which cannot adapt to various datasets/scenarios and fails to model instability. To alleviate these challenges, we propose evolutionary generative adversarial network search (termed EGANS) to automatically design the generative network with good adaptation and stability, enabling reliable visual feature sample synthesis for advancing ZSL. Specifically, we adopt cooperative dual evolution to conduct a neural architecture search for both generator and discriminator under a unified evolutionary adversarial framework. EGANS is learned by two stages: evolution generator architecture search and evolution discriminator architecture search. During the evolution generator architecture search, we adopt a many-to-one adversarial training strategy to evolutionarily search for the optimal generator. Then the optimal generator is further applied to search for the optimal discriminator in the evolution discriminator architecture search with a similar evolution search algorithm. Once the optimal generator and discriminator are searched, we entail them into various generative ZSL baselines for ZSL classification. Extensive experiments show that EGANS consistently improve existing generative ZSL methods on the standard CUB, SUN, AWA2 and FLO datasets. The significant performance gains indicate that the evolutionary neural architecture search explores a virgin field in ZSL.

摘要
zero-shot learning（ZSL）的目标是识别无法在训练预测模型的新类。因此，通常使用生成模型（如生成对抗网络（GAN））来生成基于类含义 вектор的视觉样本，并取得了remarkable进步。然而，现有的GAN基于的生成ZSL方法都是基于手工设计的模型，无法适应不同的数据集/场景，并且容易出现模型不稳定的问题。为了解决这些挑战，我们提出了进化生成对抗网络搜索（EGANS），用于自动设计生成网络，以便在不同数据集/场景中具有良好的适应和稳定性，从而实现可靠的视觉特征样本生成，以提高ZSL的进步。EGANS采用了合作双向进化来进行神经网络搜索，包括生成器和判断器的搜索。在生成器搜索阶段，我们采用了多对一对抗训练策略来进行进化搜索，以找到最佳的生成器。然后，我们将最佳的生成器应用于判断器搜索阶段，通过类似的进化搜索策略来找到最佳的判断器。一旦找到了最佳的生成器和判断器，我们将它们与不同的生成ZSL基线方法进行比较，以评估EGANS的性能。实验结果表明，EGANS在标准的CUB、SUN、AWA2和FLO数据集上具有显著的性能提升，这表明了进化神经网络搜索在ZSL中探索了一个新的领域。

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

paper_url: http://arxiv.org/abs/2308.09911
repo_url: https://github.com/tencentyouturesearch/personretrieval-ivt
paper_authors: Yang Qin, Yingke Chen, Dezhong Peng, Xi Peng, Joey Tianyi Zhou, Peng Hu
for: 提高 Text-to-image person re-identification（TIReID）方法的Robustness，以便在实际场景中处理受到干扰的图像和文本对应关系。
methods: 提出了一种名为 Robust Dual Embedding（RDE）的新方法，包括两个主要组成部分：1）一个Confident Consensus Division（CCD）模块，利用双重排序模块的双重决策来获取一个纯净的训练数据集，以便学习正确和可靠的视Semantic关系。2）一个Triplet-Alignment Loss（TAL），将传统的 triplet-ranking损失函数改进为对所有负样本进行征逐，以避免模型过度依赖干扰的图像-文本对应关系。
results: 通过在三个公共评测 dataset（CUHK-PEDES、ICFG-PEDES和RSTPReID）进行广泛的实验，证明了我们的RDE方法在不受NC干扰的情况下和受NC干扰的情况下均 achieve 状态的最佳Result。

Abstract
Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community, which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance, they implicitly assume the training image-text pairs are correctly aligned, which is not always the case in real-world scenarios. In practice, the image-text pairs inevitably exist under-correlated or even false-correlated, a.k.a noisy correspondence (NC), due to the low quality of the images and annotation errors. To address this problem, we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically, RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data, which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet-Alignment Loss (TAL) relaxes the conventional triplet-ranking loss with hardest negatives, which tends to rapidly overfit NC, to a log-exponential upper bound over all negatives, thus preventing the model from overemphasizing false image-text pairs. We conduct extensive experiments on three public benchmarks, namely CUHK-PEDES, ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets.

摘要
Text-to-image人重识别（TIReID）是跨模态社区中吸引人的话题，它目的是基于文本查询来 retrieve目标人。虽然多种 TIReID 方法已经被提出并实现了各种表现，但它们在假设训练图像文本对应是正确的情况下进行学习，而在实际场景中，图像文本对应存在偏差或假的对应关系（NC），即低质量图像和注释错误。为解决这问题，我们提出了一种robust dual embedding方法（RDE），可以学习具有NC的视Semantic关系。RDE包括两个主要组件： 1. 自信投票分区（CCD）模块，通过双重权重分配模块的双重决策来获得一个净的训练数据集，使模型可以学习正确和可靠的视Semantic关系。 2. triplet对齐损失（TAL），通过放弃硬iest negative triplet损失，而是将 triplet损失Relax到log-exponential upper bound上，以防止模型过度强调NC。我们在三个公共Benchmark上进行了广泛的实验，分别是CUHK-PEDES、ICFG-PEDES和RSTPReID，以评估我们的RDE表现和稳定性。我们的方法在所有三个数据集上都实现了状态的art表现，并且在Synthetic NC情况下也具有优秀的表现。

Physics-Guided Human Motion Capture with Pose Probability Modeling

paper_url: http://arxiv.org/abs/2308.09910
repo_url: https://github.com/me-ditto/physics-guided-mocap
paper_authors: Jingyi Ju, Buzhen Huang, Chen Zhu, Zhihao Li, Yangang Wang
for: 提高人体动作捕捉的精度和成功率，避免漂浮、脚部滑动和地面凿入等误差。
methods: 采用物理学为导向，在反卷积过程中使用物理学来降噪减雷，从模型 pose 分布中重建物理可能性最高的人体动作。
results: 与前一代物理学基本方法相比，本方法在 JOINT 精度和成功率方面具有显著的提高，可以更好地捕捉人体动作。更多信息可以查看 \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}。

Abstract
Incorporating physics in human motion capture to avoid artifacts like floating, foot sliding, and ground penetration is a promising direction. Existing solutions always adopt kinematic results as reference motions, and the physics is treated as a post-processing module. However, due to the depth ambiguity, monocular motion capture inevitably suffers from noises, and the noisy reference often leads to failure for physics-based tracking. To address the obstacles, our key-idea is to employ physics as denoising guidance in the reverse diffusion process to reconstruct physically plausible human motion from a modeled pose probability distribution. Specifically, we first train a latent gaussian model that encodes the uncertainty of 2D-to-3D lifting to facilitate reverse diffusion. Then, a physics module is constructed to track the motion sampled from the distribution. The discrepancies between the tracked motion and image observation are used to provide explicit guidance for the reverse diffusion model to refine the motion. With several iterations, the physics-based tracking and kinematic denoising promote each other to generate a physically plausible human motion. Experimental results show that our method outperforms previous physics-based methods in both joint accuracy and success rate. More information can be found at \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}.

摘要
将物理学 incorporated into 人体动作捕捉，以避免浮动、脚部滑块和地面嵌入的artefacts是一个Promising方向。现有的解决方案都是采用骨骼结果作为参考动作，并将物理学当作后处理模块。然而，由于深度的模糊，单目动作捕捉不可避免噪音，这些噪音经常导致物理学基于跟踪失败。为了解决这些障碍，我们的关键思想是使用物理学作为减噪指导，在反卷积过程中重建可靠的人体动作。specifically，我们首先训练了一个卷积学习模型，用于编码2D-to-3D的不确定性，以便在反卷积过程中进行减噪。然后，我们构建了物理模块，以跟踪从分布中抽取的动作。图像观测与跟踪的差异被用于直接提供反卷积模型的减噪指导，以便在几次迭代中改进动作。通过这种方式，物理学基于跟踪和骨骼减噪促进了对人体动作的重建。实验结果表明，我们的方法在 JOINT 精度和成功率方面都高于前期物理学基于方法。更多信息可以参考 \url{https://github.com/Me-Ditto/Physics-Guided-Mocap}.

DiffusionTrack: Diffusion Model For Multi-Object Tracking

paper_url: http://arxiv.org/abs/2308.09905
repo_url: https://github.com/rainbowluocs/diffusiontrack
paper_authors: Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang
for: 这 paper 的目的是提出一种简单 yet robust 的多目标跟踪 (MOT) 方法，以解决现有 MOT 方法的一些问题，如 global or local inconsistency, 模型复杂性和灵活性不足。
methods: 这 paper 使用的方法是通过 Paired Noise Boxes 到 Paired Ground-Truth Boxes 的一种逐步减噪演化策略，来实现对象检测和跟踪的一体化。在训练阶段，模型通过减噪演化过程来学习检测和跟踪，而在测试阶段，模型通过一种灵活的一步或多步减噪演化来更新检测和跟踪结果。
results: 这 paper 的实验结果表明，使用这种方法可以在三个常用的 MOT benchmark 上达到与当前状态的识别方法相当的性能，包括 MOT17, MOT20 和 Dancetrack。

Abstract
Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.

摘要
多目标跟踪（MOT）是一项视觉任务，旨在单帧内检测个体对象并在多帧中相关联。现代MOT方法可以分为两种阶段的跟踪检测（TBD）方法和一体的检测和跟踪（JDT）方法。尽管这些方法具有成功，但它们也存在一些共同的问题，如全局或局部的不一致、轻度的模型复杂度和不同场景中的灵活性不足。在这篇论文中，我们提出了一个简单 yet robust的框架，它将对象检测和相关联视为一个一致的降噪演进程，从对应的噪声框到对应的真实框进行进行逐步降噪。在训练阶段，对象框从对应的真实框降噪到随机分布，并且模型同时学习检测和跟踪的过程。在推断阶段，模型将一组随机生成的对象框进行精细的逐步降噪处理，以实现一步或多步的检测和跟踪结果。我们在MOT17、MOT20和Dancetrack等三个常用的MOT标准 benchmark上进行了广泛的实验，结果显示我们的方法与当前状态的方法相比，具有竞争性的性能。

Scalable Video Object Segmentation with Simplified Framework

paper_url: http://arxiv.org/abs/2308.09903
repo_url: None
paper_authors: Qiangqiang Wu, Tianyu Yang, Wei WU, Antoni Chan
for: 这个论文主要针对视频对象分割（VOS）领域的问题，即如何使用简单的模型来实现高效的目标检测和分割。
methods: 这篇论文提出了一种可扩展的Simplified VOS（SimVOS）框架，利用单一的变换器底层来同时进行特征提取和匹配。此外，论文还提出了一种在框架中使用的快速注意力机制和新的токен精细化模块，以提高运行速度和避免计算成本增加。
results: 实验表明，我们的SimVOS可以在流行的视频对象分割数据集上达到最佳效果，包括DAVIS-2017（88.0% J&F）、DAVIS-2016（92.9% J&F）和YouTube-VOS 2019（84.2% J&F）等数据集，而无需应用任何 sintética video 或 BL30K 预训练。

Abstract
The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches.

摘要
当前流行的视频对象分割（VOS）方法通常通过多个手动设计的模块来实现特征匹配。然而，这些手动设计在实践中会导致不充分的目标互动，从而限制VOS中的动态目标感知特征学习。为了解决这些限制，本文提出了一个可扩展的简化VOS（SimVOS）框架，用于同时执行特征提取和匹配。具体来说，SimVOS使用可扩展的ViT脊梁来同时提取和匹配查询和参照特征。这种设计使得SimVOS能够学习更好的目标相关特征，以提高掩模预测的准确性。此外，SimVOS可以直接应用已经预训练的ViT脊梁（例如MAE）来进行VOS，这种 bridging 可以将VOS和大规模自动预训练相连接。为了实现更好的性能-速度交易，我们进一步探索了 Within-frame 注意力和一种新的 токен细化模块，以提高运行速度并降低计算成本。实验结果表明，我们的 SimVOS 在流行的视频对象分割标准 bencmarks 上 achieve state-of-the-art 结果，无需应用任何 sintetic video 或 BL30K 预训练，这些预训练在前一代 VOS 方法中通常被使用。

Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models

paper_url: http://arxiv.org/abs/2308.09899
repo_url: None
paper_authors: Junyang Zhang
for: 避免无人机和自驾车与无人机相撞、防御无人机入侵和自驾车自动降落。
methods: 使用 CNN 和 ViT 模型，实现单一无人机检测和多无人机检测。
results: 比较 CNN 和 ViT 模型的性能，发现 ViT 在单一无人机检测中表现4.6倍更好，但需要更多的训练数据、computational power 和高级设计来完全超越 CNN 检测器。

Abstract
Accurate drone detection is strongly desired in drone collision avoidance, drone defense and autonomous Unmanned Aerial Vehicle (UAV) self-landing. With the recent emergence of the Vision Transformer (ViT), this critical task is reassessed in this paper using a UAV dataset composed of 1359 drone photos. We construct various CNN and ViT-based models, demonstrating that for single-drone detection, a basic ViT can achieve performance 4.6 times more robust than our best CNN-based transfer learning models. By implementing the state-of-the-art You Only Look Once (YOLO v7, 200 epochs) and the experimental ViT-based You Only Look At One Sequence (YOLOS, 20 epochs) in multi-drone detection, we attain impressive 98% and 96% mAP values, respectively. We find that ViT outperforms CNN at the same epoch, but also requires more training data, computational power, and sophisticated, performance-oriented designs to fully surpass the capabilities of cutting-edge CNN detectors. We summarize the distinct characteristics of ViT and CNN models to aid future researchers in developing more efficient deep learning models.

摘要
准确的飞行器探测在无人机冲突避免、无人机防御和自适应无人机自降中是非常重要的。随着最近的视力变换器（ViT）的出现，我们在这篇论文中使用了一个无人机数据集，包含1359个飞行器照片，重新评估了这个关键任务。我们构建了不同的CNN和ViT基本模型，发现在单飞行器探测任务中，一个基本的ViT可以达到与我们最佳CNN基本传播学习模型的4.6倍的性能。通过在多飞行器探测任务中实现了state-of-the-art的You Only Look Once（YOLO v7，200 epochs）和实验性的ViT基本You Only Look At One Sequence（YOLOS，20 epochs），我们获得了各种98%和96%的mAP值。我们发现ViT比CNN在同一个熬煮值下表现更好，但也需要更多的训练数据、计算能力和复杂的性能设计来完全超越现有的CNN探测器。我们总结了ViT和CNN模型的特点，以帮助未来的研究人员开发更高效的深度学习模型。

Spatial-Temporal Alignment Network for Action Recognition

paper_url: http://arxiv.org/abs/2308.09897
repo_url: None
paper_authors: Jinhui Ye, Junwei Liang
for: 本文旨在提出一种视角不变特征表示方法，用于改进现有动作识别架构。
methods: 该方法基于一种名为空间-时间对应网络（STAN），该网络可以学习geometry invariant的表示。
results: 实验结果表明，在训练从scratch的情况下，STAN模型可以在UCf101和HMDB51等广泛使用的数据集上提高动作识别任务的性能。

Abstract
This paper studies introducing viewpoint invariant feature representations in existing action recognition architecture. Despite significant progress in action recognition, efficiently handling geometric variations in large-scale datasets remains challenging. To tackle this problem, we propose a novel Spatial-Temporal Alignment Network (STAN), which explicitly learns geometric invariant representations for action recognition. Notably, the STAN model is light-weighted and generic, which could be plugged into existing action recognition models (e.g., MViTv2) with a low extra computational cost. We test our STAN model on widely-used datasets like UCF101 and HMDB51. The experimental results show that the STAN model can consistently improve the state-of-the-art models in action recognition tasks in trained-from-scratch settings.

摘要

Semantic-Human: Neural Rendering of Humans from Monocular Video with Human Parsing

paper_url: http://arxiv.org/abs/2308.09894
repo_url: None
paper_authors: Jie Zhang, Pengcheng Shi, Zaiwang Gu, Yiyang Zhou, Zhi Wang
for: 本研究旨在提高人体 нейрон渲染的质量，同时实现人体解析。
methods: 本文提出了一种名为Semantic-Human的新方法，它可以同时实现高品质的渲染和视角相关的人体解析。特别是，该方法在NeRF基础上扩展了semantics, appearance和geometry的编码，以实现基于噪声批量标签的高精度2D semantic labels。此外，该方法还引入了基于SMPL表面的运动场和恢复的三维几何学regularization。
results: 在使用ZJU-MoCap数据集进行评估时，Semantic-Human方法得到了非常竞争力的结果，证明了该方法的有效性。此外，该方法还可以实现多种有趣的应用，如标签噪声除除、标签生成和图像修改等，并且经验 Validate了其优势性。

Abstract
The neural rendering of humans is a topic of great research significance. However, previous works mostly focus on achieving photorealistic details, neglecting the exploration of human parsing. Additionally, classical semantic work are all limited in their ability to efficiently represent fine results in complex motions. Human parsing is inherently related to radiance reconstruction, as similar appearance and geometry often correspond to similar semantic part. Furthermore, previous works often design a motion field that maps from the observation space to the canonical space, while it tends to exhibit either underfitting or overfitting, resulting in limited generalization. In this paper, we present Semantic-Human, a novel method that achieves both photorealistic details and viewpoint-consistent human parsing for the neural rendering of humans. Specifically, we extend neural radiance fields (NeRF) to jointly encode semantics, appearance and geometry to achieve accurate 2D semantic labels using noisy pseudo-label supervision. Leveraging the inherent consistency and smoothness properties of NeRF, Semantic-Human achieves consistent human parsing in both continuous and novel views. We also introduce constraints derived from the SMPL surface for the motion field and regularization for the recovered volumetric geometry. We have evaluated the model using the ZJU-MoCap dataset, and the obtained highly competitive results demonstrate the effectiveness of our proposed Semantic-Human. We also showcase various compelling applications, including label denoising, label synthesis and image editing, and empirically validate its advantageous properties.

摘要
“人体神经渲染是研究领域的热点话题。然而，前一些工作强调了实现光真实细节，忽略了人体解析的探索。此外，传统的 semantic 工作都受到了复杂动作中的细节表示的限制。人体解析与光重建密切相关，因为类似的外观和结构通常对应于类似的semantic部分。此外，以前的工作通常将动作场景映射到了 canonical 空间，而这经常导致过拟合或者下降抑制，限制了其泛化能力。在本文中，我们提出了 Semantic-Human，一种新的方法，能够同时实现光真实细节和视点一致的人体解析。具体来说，我们扩展了神经辐射场（NeRF），使其同时编码semantics、外观和geometry以实现基于噪声假标签的高精度2D semantic标签。利用NeRF的自然协调性和平滑性属性，Semantic-Human在连续和新视图下实现了一致的人体解析。我们还引入了基于 SMPL поверхност的运动场景约束和 recovered volumetric geometry 的正则化。我们在 ZJU-MoCap 数据集上评估了模型，并获得了非常竞争力的结果，证明了我们提出的 Semantic-Human 的效果。我们还展示了多种吸引人的应用，包括标签噪声去除、标签生成和图像修改，并且实际验证了它的优点性质。”

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

paper_url: http://arxiv.org/abs/2308.09889
repo_url: None
paper_authors: Xiaoyu Ye, Hao Huang, Jiaqi An, Yongtao Wang
for: This paper aims to address the issue of copyright infringement in Stable Diffusion (SD) customization approaches by proposing an invisible data-free universal adversarial watermark (DUAW) to protect a myriad of copyrighted images.methods: The proposed DUAW is designed to disrupt the variational autoencoder during SD customization, and it operates in a data-free context using synthetic images produced by a Large Language Model (LLM) and a pretrained SD model.results: Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier, thereby protecting copyrighted images from plagiarism.

Abstract
Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.

摘要

Calibrating Uncertainty for Semi-Supervised Crowd Counting

paper_url: http://arxiv.org/abs/2308.09887
repo_url: None
paper_authors: Chen Li, Xiaoling Hu, Shahira Abousamra, Chao Chen
for: 这篇论文的目的是提出一种用于半指导人数推断的新方法，以提高这种任务的性能。
methods: 这篇论文使用了一种基于模型不确定性的方法，通过调教一个价值函数来训练模型。这个方法使用了一种匹配函数来更好地估计人数推断的不确定性。
results: 这篇论文的结果显示，使用这种方法可以生成可靠的伪标签，并且可以实现semi-supervised人数推断的state-of-the-art性能。

Abstract
Semi-supervised crowd counting is an important yet challenging task. A popular approach is to iteratively generate pseudo-labels for unlabeled data and add them to the training set. The key is to use uncertainty to select reliable pseudo-labels. In this paper, we propose a novel method to calibrate model uncertainty for crowd counting. Our method takes a supervised uncertainty estimation strategy to train the model through a surrogate function. This ensures the uncertainty is well controlled throughout the training. We propose a matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks. The proposed method pays a sufficient amount of attention to details, while maintaining a proper granularity. Altogether our method is able to generate reliable uncertainty estimation, high quality pseudolabels, and achieve state-of-the-art performance in semisupervised crowd counting.

摘要
semi-supervised crowd counting 是一项重要又挑战性的任务。一种popular approach是iteratively generating pseudo-labels for unlabeled data and adding them to the training set。关键在于使用uncertainty选择可靠的pseudo-labels。在这篇论文中，我们提出了一种novel method to calibrate model uncertainty for crowd counting。我们的方法通过一个supervised uncertainty estimation strategy to train the model through a surrogate function，这 garantizesthat uncertainty is well controlled throughout the training。我们提出了一种matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks。提议的方法具有 suficient amount of attention to details，同时保持proper granularity。总之，我们的方法能够生成可靠的uncertainty estimation，高质量的pseudolabels，并实现semisupervised crowd counting的state-of-the-art performance。

Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders

paper_url: http://arxiv.org/abs/2308.09882
repo_url: https://github.com/jchengai/forecast-mae
paper_authors: Jie Cheng, Xiaodong Mei, Ming Liu
for: 这个研究探索了自监学习（SSL）在动态预测任务中的应用，这是计算机视觉和自然语言处理领域中广泛成功的 SSL 方法，却尚未得到广泛研究。
methods: 我们引入了 Forecast-MAE，一种基于面积自适应神经网络（Transformer）块的 SSL 框架，特意设计用于自监学习动态预测任务。我们的方法包括一种新的面 másking 策略，利用agent trajectory 和路网之间强联系，通过补做agent future trajectory 和历史 trajectory的 complementary másking，以及随机 másking 路网段。
results: 我们在 Argoverse 2 动态预测测试集上进行了实验，显示 Forecast-MAE 在与 supervised learning 和复杂设计的方法相比，在竞争性Task 中具有竞争性的性能。此外，它还超过了之前的自监学习方法，表明 Forecast-MAE 可以充分利用 SSL 来预测动态Scene。

Abstract
This study explores the application of self-supervised learning (SSL) to the task of motion forecasting, an area that has not yet been extensively investigated despite the widespread success of SSL in computer vision and natural language processing. To address this gap, we introduce Forecast-MAE, an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task. Our approach includes a novel masking strategy that leverages the strong interconnections between agents' trajectories and road networks, involving complementary masking of agents' future or history trajectories and random masking of lane segments. Our experiments on the challenging Argoverse 2 motion forecasting benchmark show that Forecast-MAE, which utilizes standard Transformer blocks with minimal inductive bias, achieves competitive performance compared to state-of-the-art methods that rely on supervised learning and sophisticated designs. Moreover, it outperforms the previous self-supervised learning method by a significant margin. Code is available at https://github.com/jchengai/forecast-mae.

摘要
这种研究探讨了使用自动教学学习（SSL）来解决运动预测任务，这是一个尚未得到广泛探讨的领域，尽管SSL在计算机视觉和自然语言处理领域得到了广泛的成功。为了解决这个遗漏，我们介绍了 Forecast-MAE，一种特制的mask autoencoders框架，用于自动教学学习运动预测任务。我们的方法包括一种新的面积策略，利用汽车轨迹和公路网络之间的强相关性，包括补做未来或历史轨迹的随机掩码和路段掩码。我们在Argoverse 2运动预测测试benchmark上进行了实验，发现 Forecast-MAE，使用标准Transformer块和最小适应性，可以与supervised learning和复杂设计的方法相比肩，并且超过了之前的自动教学方法，性能较好。代码可以在https://github.com/jchengai/forecast-mae中找到。

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

paper_url: http://arxiv.org/abs/2308.09878
repo_url: https://github.com/towardsautonomy/datasetequity
paper_authors: Shubham Shrivastava, Xianling Zhang, Sushruth Nagesh, Armin Parchami
for: 本研究旨在解决机器学习中的数据不均衡问题，具体来说是针对computer vision领域中的数据偏见问题。
methods: 本研究使用了深度感知嵌入和聚类算法来计算图像出现的可能性，然后使用这些可能性来减轻数据不均衡的影响。另外，提出了一种新的$\textbf{普适吸引损失函数}$来调整训练过程中的样本权重。
results: 实验表明，该方法可以提高3D物体检测方法的性能，特别是对于少见的类别（自行车手）在KITTI数据集上的AP效果提高了超过200%。这些结果表明该方法是通用的，可以补充现有的技术，并在小数据集和少见的类别上特别有效。

Abstract
Data imbalance is a well-known issue in the field of machine learning, attributable to the cost of data collection, the difficulty of labeling, and the geographical distribution of the data. In computer vision, bias in data distribution caused by image appearance remains highly unexplored. Compared to categorical distributions using class labels, image appearance reveals complex relationships between objects beyond what class labels provide. Clustering deep perceptual features extracted from raw pixels gives a richer representation of the data. This paper presents a novel method for addressing data imbalance in machine learning. The method computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering. It then uses these likelihoods to weigh samples differently during training with a proposed $\textbf{Generalized Focal Loss}$ function. This loss can be easily integrated with deep learning algorithms. Experiments validate the method's effectiveness across autonomous driving vision datasets including KITTI and nuScenes. The loss function improves state-of-the-art 3D object detection methods, achieving over $200\%$ AP gains on under-represented classes (Cyclist) in the KITTI dataset. The results demonstrate the method is generalizable, complements existing techniques, and is particularly beneficial for smaller datasets and rare classes. Code is available at: https://github.com/towardsautonomy/DatasetEquity

摘要
“数据不均衡是机器学习领域的一个常见问题，这主要归结于数据收集成本、标签难度和数据的地域分布。在计算机视觉领域，图像外观的偏见对数据分布仍然具有很大的潜在探索空间。相比于使用类别分布的类标签，图像外观表现出了更复杂的对象之间的关系。使用深度感知特征提取自原始像素的归一化可以提供更丰富的数据表示。本文提出了一种 novel 的数据不均衡解决方案，该方法通过使用深度感知嵌入和归一化计算样本概率。然后使用这些概率对样本进行不同权重训练，使用我们提议的 $\textbf{通用强化损失}$ 函数。这种损失函数可以轻松地与深度学习算法结合使用。实验证明了该方法的有效性，在 KITTI 和 nuScenes 自动驾驶视觉数据集上实现了超过 200% AP 提升的 Results 表明该方法是通用的，可以补偿现有技术，特别有利于小型数据集和罕见类。代码可以在 GitHub 上找到：https://github.com/towardsautonomy/DatasetEquity。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know and I'll be happy to provide it.

A Theory of Topological Derivatives for Inverse Rendering of Geometry

paper_url: http://arxiv.org/abs/2308.09865
repo_url: None
paper_authors: Ishit Mehta, Manmohan Chandraker, Ravi Ramamoorthi
for: 这篇论文旨在提出一种可微 differentiable 表面演化理论，以便通过变分函数优化图像函数。
methods: 该理论使用 topological derivatives 来实现不同的拓扑结构变化，而不是先前的 silhouette gradients。
results: 该理论可以实现可微的形态异常，包括孔子核生成和相位异常。这些结果可以用于改进图像向量化、vector-graphics生成、单图像重建形意agram和多视图3D重建等应用。

Abstract
We introduce a theoretical framework for differentiable surface evolution that allows discrete topology changes through the use of topological derivatives for variational optimization of image functionals. While prior methods for inverse rendering of geometry rely on silhouette gradients for topology changes, such signals are sparse. In contrast, our theory derives topological derivatives that relate the introduction of vanishing holes and phases to changes in image intensity. As a result, we enable differentiable shape perturbations in the form of hole or phase nucleation. We validate the proposed theory with optimization of closed curves in 2D and surfaces in 3D to lend insights into limitations of current methods and enable improved applications such as image vectorization, vector-graphics generation from text prompts, single-image reconstruction of shape ambigrams and multi-view 3D reconstruction.

摘要
我们提出了一种可 diferenciable 表面演化理论，允许离散topology变化通过使用图像函数的topological导数进行variational优化。而以前的对geometry inverse rendering方法依赖于silhouette导数进行topology变化，这些信号是稀疏的。相比之下，我们的理论 derivates topological导数，将引入vanishing holes和阶段相关到图像强度的变化。因此，我们可以实现可微形状变化，包括孔悉散和阶段悉散。我们验证了提出的理论，通过在2D和3D中优化closed curves和surfaces来增加应用，如图像vectorization、vector-graphics生成from文本提示、单个图像重建shape ambigrams和多视图3D重建。

Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis

paper_url: http://arxiv.org/abs/2308.09835
repo_url: None
paper_authors: Shijie Li, Mengwei Ren, Thomas Ach, Guido Gerig
for: 这篇论文主要针对微scopic图像分类问题提出了一个新的方法，它可以使用简单的点标注来进行训练，而不需要大量的实际标注数据。
methods: 这篇论文提出了一个三阶段的框架，包括：1）将点标注转换为伪稠密分类面组件，并受限于物体形状假设；2）使用对称的图像生成模型，将伪稠密分类面组件转换为真实的微scopic图像；3）使用伪稠密分类面组件和生成的图像，进行训练专门的分类模型。
results: 这篇论文的实验结果显示，使用这个新的方法可以在公共的 MoNuSeg 数据集上生成更多的多标的图像，并且保持高度的标注与生成图像之间的协调性。此外，这个方法可以与使用pseudo-labels或基准生成的图像进行比较，实现更高的分类精度。

Abstract
Current deep learning-based approaches for the segmentation of microscopy images heavily rely on large amount of training data with dense annotation, which is highly costly and laborious in practice. Compared to full annotation where the complete contour of objects is depicted, point annotations, specifically object centroids, are much easier to acquire and still provide crucial information about the objects for subsequent segmentation. In this paper, we assume access to point annotations only during training and develop a unified pipeline for microscopy image segmentation using synthetically generated training data. Our framework includes three stages: (1) it takes point annotations and samples a pseudo dense segmentation mask constrained with shape priors; (2) with an image generative model trained in an unpaired manner, it translates the mask to a realistic microscopy image regularized by object level consistency; (3) the pseudo masks along with the synthetic images then constitute a pairwise dataset for training an ad-hoc segmentation model. On the public MoNuSeg dataset, our synthesis pipeline produces more diverse and realistic images than baseline models while maintaining high coherence between input masks and generated images. When using the identical segmentation backbones, the models trained on our synthetic dataset significantly outperform those trained with pseudo-labels or baseline-generated images. Moreover, our framework achieves comparable results to models trained on authentic microscopy images with dense labels, demonstrating its potential as a reliable and highly efficient alternative to labor-intensive manual pixel-wise annotations in microscopy image segmentation. The code is available.

摘要
当前的深度学习基于方法 для微scopic影像分割强调大量的训练数据，包括密集的标注。在实际应用中，这种标注是非常成本高昂和劳动密集的。相比拥有完整的标注，其中包含对象的完整边界，点标注更加容易获得，并且仍然提供了对对象的重要信息。在这篇论文中，我们假设在训练时有点标注可用。我们提出了一个简化的框架，包括以下三个阶段：1. 使用点标注，生成一个 Pseudo density 分割面，受限于形态约束。2. 使用一种没有对应关系的图像生成模型，将分割面翻译成一个真实的微scopic影像，并对其进行对象水平的准确性补做。3. 使用生成的 Pseudo 分割面和实际图像组成一个对应的数据集，用于训练适应性强的分割模型。在公共的 MoNuSeg 数据集上，我们的生成框架生成了更加多样化和真实的图像，同时保持了输入权重的高准确性。当使用同一个分割后端时，我们在我们的生成数据集上训练的模型比使用 pseudo-标签或基eline-生成的图像训练得更好。此外，我们的框架可以与密集标注的模型相比，在微scopic影像分割任务中达到相同的性能水平。代码可以获得。

Cross-modality Attention-based Multimodal Fusion for Non-small Cell Lung Cancer (NSCLC) Patient Survival Prediction

paper_url: http://arxiv.org/abs/2308.09831
repo_url: None
paper_authors: Ruining Deng, Nazim Shaikh, Gareth Shannon, Yao Nie
for: 预测非小细胞肺癌患者存活result, 即computer-aided diagnosis和prognosis在医学应用中的提高。
methods: 跨模态注意力基本的多模态融合管道，该方法不仅将不同模式的特征简单 concatenate或sum，而是通过跨模态关系 gauges each modality’s importance for feature fusion。
results: 在实验中，提议的融合方法在NSCLC患者存活预测中实现了c-index 0.6587，较单模式（使用 solely tissue image data或RNA-seq data）的c-index 0.5772和0.5885高出2.3%和1.6%。

Abstract
Cancer prognosis and survival outcome predictions are crucial for therapeutic response estimation and for stratifying patients into various treatment groups. Medical domains concerned with cancer prognosis are abundant with multiple modalities, including pathological image data and non-image data such as genomic information. To date, multimodal learning has shown potential to enhance clinical prediction model performance by extracting and aggregating information from different modalities of the same subject. This approach could outperform single modality learning, thus improving computer-aided diagnosis and prognosis in numerous medical applications. In this work, we propose a cross-modality attention-based multimodal fusion pipeline designed to integrate modality-specific knowledge for patient survival prediction in non-small cell lung cancer (NSCLC). Instead of merely concatenating or summing up the features from different modalities, our method gauges the importance of each modality for feature fusion with cross-modality relationship when infusing the multimodal features. Compared with single modality, which achieved c-index of 0.5772 and 0.5885 using solely tissue image data or RNA-seq data, respectively, the proposed fusion approach achieved c-index 0.6587 in our experiment, showcasing the capability of assimilating modality-specific knowledge from varied modalities.

摘要
cancer 诊断和生存结果预测是临床应用中的关键任务，可以用于评估治疗效果和将患者分配到不同的治疗组。医疗领域中关于 cancer 诊断的数据非常丰富，包括生物pathological 图像数据和非图像数据，如基因信息。迄今为止，多Modal learning 已经展现出能够提高诊断模型性能，通过抽取和汇集不同模式的信息来提高计算机助成诊断和预测的能力。在这个工作中，我们提议一种跨模式关注机制的多模式融合管道，用于将不同模式的特征融合，以提高患者存活预测的准确性。相比单模式学习，我们的方法可以评估不同模式之间的关系，从而更好地汇集模式特征。在我们的实验中，我们的方法实现了c-index 0.6587，比单模式学习的c-index 0.5772和0.5885更高，这显示了我们的方法可以充分利用不同模式之间的关系，以提高诊断和预测的准确性。

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

paper_url: http://arxiv.org/abs/2308.09779
repo_url: None
paper_authors: Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu
for: This paper is written for the task of image segmentation using natural language references.
methods: The paper proposes a new method called Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL), which explicitly aligns vision and language features in the segmentation stage using a series of unfixed convolution kernels generated based on the input language expression.
results: The paper achieves state-of-the-art performance on three benchmark datasets (RefCOCO, RefCOCO+, and G-Ref) by effectively fusing vision and language features and exploiting their potential in the segmentation stage, while also achieving language-related localization.

Abstract
Referring image segmentation aims to segment an object mentioned in natural language from an image. A main challenge is language-related localization, which means locating the object with the relevant language. Previous approaches mainly focus on the fusion of vision and language features without fully addressing language-related localization. In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation. This approach does not explicitly align language and vision features in the segmentation stage, resulting in a suboptimal language-related localization. Different from previous methods, we propose Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL). Instead of using a fixed convolution kernel, we propose an Aligner which explicitly aligns the vision and language features in the segmentation stage. Specifically, a series of unfixed convolution kernels are generated based on the input l, and then are use to explicitly align the vision and language features. To achieve this, We generate multiple queries that represent different emphases of the language expression. These queries are transformed into a series of query-based convolution kernels. Then, we utilize these kernels to do convolutions in the segmentation stage and obtain a series of segmentation masks. The final result is obtained through the aggregation of all masks. Our method can not only fuse vision and language features effectively but also exploit their potential in the segmentation stage. And most importantly, we explicitly align language features of different emphases with the image features to achieve language-related localization. Our method surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

摘要
Traditional image segmentation methods mainly focus on fusing vision and language features without fully addressing the challenge of language-related localization. Previous approaches typically use a fixed convolution kernel to fuse the features, which does not explicitly align the language and vision features in the segmentation stage, leading to suboptimal localization.In this paper, we propose a novel method called Explicitly Align the Vision and Language for Referring Image Segmentation (EAVL). Our approach uses an Aligner to explicitly align the vision and language features in the segmentation stage, rather than using a fixed convolution kernel. We generate multiple queries that represent different emphases of the language expression and transform them into a series of query-based convolution kernels. These kernels are then used to do convolutions in the segmentation stage, resulting in a series of segmentation masks. The final result is obtained through the aggregation of all masks.Our method not only effectively fuses vision and language features but also exploits their potential in the segmentation stage. Moreover, we explicitly align language features of different emphases with the image features, achieving language-related localization. Our method outperforms previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Long-range Multimodal Pretraining for Movie Understanding

paper_url: http://arxiv.org/abs/2308.09775
repo_url: https://github.com/dawitmureja/LMP
paper_authors: Dawit Mureja Argaw, Joon-Young Lee, Markus Woodson, In So Kweon, Fabian Caba Heilbron
for: 这个论文的目的是提出一种基于电影数据的多模态预训练策略，以便在电影理解任务中实现更好的表现。
methods: 这个论文使用了长距离多模态预训练策略，通过观察和提取电影中各种modalities之间的关系，从而学习多模态和交叉模态编码器。
results: 这个论文在LVU测试集上进行了缺失学习和模型选择研究，并证明了其在多个benchmark任务上的传送性。其中，模型在LVU任务上达到了状态略于前者，并且在五个不同的benchmark任务中设置了新的状态略。

Abstract
Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks.

摘要
学习电影中的计算机视觉模型有很长的历史。虽然已经取得了很大的进步，但还有一些需求，例如需要一个预训练的多modal模型，可以在电影理解任务中表现出色。在这项工作中，我们介绍了远程多modal预训练策略和模型，该模型利用电影数据来训练可转移的多modal和跨modal编码器。我们的关键思想是从电影中所有modalities中学习和提取关系，并且在远程时间范围内做出关系。在预训练后，我们进行了ablation研究， validate我们的模型设计和学习长时间范围的重要性。我们的模型在LVU标准准则上实现了多个任务的state-of-the-art，并且比前一些工作更加数据效率。最后，我们测试了我们的模型的传输性，并在五个不同的标准准则上设置了新的state-of-the-art。

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

paper_url: http://arxiv.org/abs/2308.09718
repo_url: https://github.com/Pointcept/Pointcept
paper_authors: Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao
for: 提高3D深度学习模型的性能和通用性，即使只使用有限的大规模3D数据。
methods: 提出Point Prompt Training（PPT）框架，支持多种预训练方法，包括Prompt-driven Normalization和Language-guided Categorical Alignment等技术。
results: 经验表明，PPT可以缓解多 dataset 学习中的负转移现象，并生成高质量的表示。在多种不同的3D下世界任务上，PPT在单个模型下实现了最佳性能，并在多种预训练方法中占据了主导地位。

Abstract
The rapid advancement of deep learning models often attributes to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3D point cloud datasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.

摘要
“深度学习模型的快速进步 frequently 归功于它们可以利用大量的训练数据。然而，三维深度学习尚未得到完全的利益，主要是因为三维数据集的有限性。将多个可用的数据源合并并让它们共同训练单个模型是一个可能的解决方案。然而，由于三维点云数据集之间的域 gap 较大，这种混合超vision 可能会对模型的性能产生负面影响（即负向传播），而不是单 dataset 训练。为此，我们介绍 Point Prompt Training（PPT），一种多数据集协同学习框架，支持多种预训练方法。基于这个框架，我们提出 Prompt-driven Normalization，使模型适应不同数据集的域特定提示，并Language-guided Categorical Alignment，通过利用标签文本之间的关系，有效地统一多个数据集的标签空间。广泛的实验表明，PPT 可以超越负向传播，生成普适的表示。其中，使用单个权重共享模型进行多数据集超vised 训练，可以达到每个数据集的最优性能。此外，作为预训练框架，PPT 比其他预训练方法在表示质量和下游任务中表现出色，在室内和室外的多种3D场景中具有惊人的状态 искусternal 表现。”

Smoothness Similarity Regularization for Few-Shot GAN Adaptation

paper_url: http://arxiv.org/abs/2308.09717
repo_url: None
paper_authors: Vadim Sushko, Ruyu Wang, Juergen Gall
for: 这个研究的目的是提高几张训练图像下的GAN适应率，并且可以处理两个不同结构的资料集。
methods: 这个研究提出了一个新的平滑相似性规范，用于将预训练的GAN模型转换到几张训练图像的目标领域，即使两个领域的物件结构很不同。
results: 这个研究的结果显示，在两个不同结构的资料集中，这个新的平滑相似性规范可以将预训练的GAN模型转换到几张训练图像的目标领域，并且与预训练的GAN模型相比，可以提高适应率。

Abstract
The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this limitation, we propose a new smoothness similarity regularization that transfers the inherently learned smoothness of the pre-trained GAN to the few-shot target domain even if the two domains are very different. We evaluate our approach by adapting an unconditional and a class-conditional GAN to diverse few-shot target domains. Our proposed method significantly outperforms prior few-shot GAN adaptation methods in the challenging case of structurally dissimilar source-target domains, while performing on par with the state of the art for similar source-target domains.

摘要
文本：几张图像ew-shot GAN适应的任务是使得预训练GAN模型适应一个具有非常少的训练图像的小数据集。现有的方法在预训练数据集和目标数据集的结构相似时表现良好，但是这些方法在物体在两个领域中具有非常不同的结构时受训练不稳定或记忆问题的影响。为了解决这些限制，我们提出了一种新的平滑相似性规范，将预训练GAN中内在学习的平滑性传递到几张图像目标领域，即使两个领域具有非常不同的结构。我们通过对无条件GAN和类别GAN进行适应，评估了我们的方法。我们的提议方法在预训练和目标领域结构不相似的情况下显著超过先前的几张图像GAN适应方法，并在相似的预训练和目标领域情况下与状态的艺术相当。

Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

paper_url: http://arxiv.org/abs/2308.09713
repo_url: None
paper_authors: Jonathon Luiten, Georgios Kopanas, Bastian Leibe, Deva Ramanan
for: 该 paper 旨在同时解决动态场景新视图合成和六自由度（6-DOF）跟踪所有紧凑场景元素的问题。
methods: 作者采用了分析合成框架， Drawing inspiration from recent work that models scenes as a collection of 3D Gaussians, which are optimized to reconstruct input images via differentiable rendering.
results: 作者实现了一种可以同时解决动态场景新视图合成和6-DOF跟踪的方法，并且不需要输入任何匹配或流体。这种方法可以自动捕捉和跟踪场景中的所有紧凑元素，包括场景的旋转。

Abstract
We present a method that simultaneously addresses the tasks of dynamic scene novel-view synthesis and six degree-of-freedom (6-DOF) tracking of all dense scene elements. We follow an analysis-by-synthesis framework, inspired by recent work that models scenes as a collection of 3D Gaussians which are optimized to reconstruct input images via differentiable rendering. To model dynamic scenes, we allow Gaussians to move and rotate over time while enforcing that they have persistent color, opacity, and size. By regularizing Gaussians' motion and rotation with local-rigidity constraints, we show that our Dynamic 3D Gaussians correctly model the same area of physical space over time, including the rotation of that space. Dense 6-DOF tracking and dynamic reconstruction emerges naturally from persistent dynamic view synthesis, without requiring any correspondence or flow as input. We demonstrate a large number of downstream applications enabled by our representation, including first-person view synthesis, dynamic compositional scene synthesis, and 4D video editing.

摘要
我们提出了一种方法，同时解决动态场景新视角合成和所有精细场景元素的六度自由（6-DOF）跟踪问题。我们采用分析合成框架，受到最近的Scene模型研究的启发，将场景视为一个集合的3D Gaussian，通过可导渲染来重建输入图像。为模型动态场景，我们允许Gaussian在时间上移动和旋转，并且保持颜色、透明度和大小的 persistency。通过对Gaussian的运动和旋转进行地方刚性约束，我们证明了我们的动态3D Gaussian correctly模型了物理空间的同一个区域在时间上的变化，包括空间的旋转。基于持续动态视思操作，我们无需输入对应或流动，直接从持续动态视思渲染中获得精细6-DOF跟踪和动态重建。我们示出了大量的下游应用，包括首人视图合成、动态组合场景合成和4D视频编辑。

HumanLiff: Layer-wise 3D Human Generation with Diffusion Model

paper_url: http://arxiv.org/abs/2308.09712
repo_url: None
paper_authors: Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu
for: 本研究旨在提出一种层 wise 3D 人体生成模型，即 HumanLiff，该模型可以具有高精度和控制性，并能够生成层 wise 3D 人体。
methods: HumanLiff 模型使用了 diffusion-based 3D 条件生成方法，首先生成 minimal-clothed 人体，然后逐层生成衣物。另外，为了提高 3D 生成的精度，该模型还提出了 tri-plane shift 操作和层 wise 特征融合方法。
results: 对于 SynBody 和 TightCap 两个层 wise 3D 人体数据集，HumanLiff 模型在层 wise 3D 人体生成方面表现出了显著的优异性，与现有的状态对比方法相比，它可以生成更精度和更控制性的 3D 人体。

Abstract
3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as underwear, outerwear, trousers, shoes, etc. In this work, we propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process. Specifically, HumanLiff firstly generates minimal-clothed humans, represented by tri-plane features, in a canonical space, and then progressively generates clothes in a layer-wise manner. In this way, the 3D human generation is thus formulated as a sequence of diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D humans with tri-plane representation, we propose a tri-plane shift operation that splits each tri-plane into three sub-planes and shifts these sub-planes to enable feature grid subdivision. To further enhance the controllability of 3D generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane features and 3D layered conditions to facilitate the 3D diffusion model learning. Extensive experiments on two layer-wise 3D human datasets, SynBody (synthetic) and TightCap (real-world), validate that HumanLiff significantly outperforms state-of-the-art methods in layer-wise 3D human generation. Our code will be available at https://skhu101.github.io/HumanLiff.

摘要
人体三维生成从二维图像中得到了惊人的进步，通过神经渲染和生成模型的共同使用。现有的三维人体生成模型主要生成一个穿着衣服的三维人体，而rarely考虑人体层次结构，人体通常由躯体和衣服组成，如内衣、外衣、裤子、鞋等。在这项工作中，我们提出了人类生命（HumanLiff），首先生成最小化穿着衣服的三维人体，用三个平面特征表示，然后逐层生成衣服。因此，三维人体生成被定义为一个序列的扩散基于三维条件的生成。为了重建更细腻的三维人体，我们提出了三个平面移动操作，将每个平面分成三个子平面，并将这些子平面移动以实现特征网格分解。此外，为了进一步提高三维生成的控制性，人类生命层次融合三个平面特征和三维层次条件，以便扩散模型学习。广泛的实验表明，人类生命在两个层次三维人体数据集（SynBody和TightCap）上显著超越了当前的状态艺术法。我们的代码将在https://skhu101.github.io/HumanLiff上公开。

Robust Monocular Depth Estimation under Challenging Conditions

paper_url: http://arxiv.org/abs/2308.09711
repo_url: https://github.com/md4all/md4all
paper_authors: Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari
for: 提高单目深度估计的可靠性，特别是在不良环境和天气条件下。
methods: 利用现有方法的效果，生成复杂样本集，并通过自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自我或全自动超级视图指导模型进行自

Abstract
While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ideal conditions, as well as for different types of learning supervision. We achieve this by exploiting the efficacy of existing methods under perfect settings. Therefore, we provide valid training signals independently of what is in the input. First, we generate a set of complex samples corresponding to the normal training ones. Then, we train the model by guiding its self- or full-supervision by feeding the generated samples and computing the standard losses on the corresponding original images. Doing so enables a single model to recover information across diverse conditions without modifications at inference time. Extensive experiments on two challenging public datasets, namely nuScenes and Oxford RobotCar, demonstrate the effectiveness of our techniques, outperforming prior works by a large margin in both standard and challenging conditions. Source code and data are available at: https://md4all.github.io.

摘要
当前最先进的单目深度估计方法在理想的Setting下可以达到印象人的结果，但在具有挑战性的照明和天气条件下（如夜晚或雨天），它们的可靠性却非常低。在这篇论文中，我们揭示了这些安全关键问题，并使用md4all：一种简单而有效的解决方案，可靠地在不 идеаль的Setting下运行，并且可以处理不同类型的学习监督。我们通过利用现有方法在完美 Setting下的效果，来提供有效的训练信号，无论输入内容如何。首先，我们生成了一组复杂的样本，与常见的训练样本相对应。然后，我们使用这些生成的样本进行自我或全自监督，通过计算相应的原始图像上的标准损失来训练模型。这样做的好处是，一个模型可以在执行时不需要修改，就能够在多种Setting下恢复信息。我们在两个公共的挑战性数据集（即nuScenes和Oxford RobotCar）进行了广泛的实验，并证明了我们的技术的有效性，与先前的成果相比，在标准和挑战性的Setting下都有大幅度的提高。软件代码和数据可以在以下网址获取：https://md4all.github.io。

Training with Product Digital Twins for AutoRetail Checkout

paper_url: http://arxiv.org/abs/2308.09708
repo_url: https://github.com/yorkeyao/automated-retail-checkout
paper_authors: Yue Yao, Xinyu Tian, Zheng Tang, Sujit Biswas, Huan Lei, Tom Gedeon, Liang Zheng
for: 这 paper 的目的是为了自动化商业 checkout 过程，提高用户体验和效率。
methods: 该 paper 使用了产品 3D 模型，通过图形引擎渲染来生成快速、灵活、大规模的训练数据。它还提出了一种训练数据优化框架，通过使用产品 3D 模型来生成“数字双胞胎”，以增强训练数据的可靠性和效果。
results: 该 paper 的实验表明，使用“数字双胞胎”训练集可以提高产品检测和跟踪模型的准确率，并且可以与 Pseudo-labeled 的实际检查数据组合使用，进一步提高模型的性能。

Abstract
Automating the checkout process is important in smart retail, where users effortlessly pass products by hand through a camera, triggering automatic product detection, tracking, and counting. In this emerging area, due to the lack of annotated training data, we introduce a dataset comprised of product 3D models, which allows for fast, flexible, and large-scale training data generation through graphic engine rendering. Within this context, we discern an intriguing facet, because of the user "hands-on" approach, bias in user behavior leads to distinct patterns in the real checkout process. The existence of such patterns would compromise training effectiveness if training data fail to reflect the same. To address this user bias problem, we propose a training data optimization framework, i.e., training with digital twins (DtTrain). Specifically, we leverage the product 3D models and optimize their rendering viewpoint and illumination to generate "digital twins" that visually resemble representative user images. These digital twins, inherit product labels and, when augmented, form the Digital Twin training set (DT set). Because the digital twins individually mimic user bias, the resulting DT training set better reflects the characteristics of the target scenario and allows us to train more effective product detection and tracking models. In our experiment, we show that DT set outperforms training sets created by existing dataset synthesis methods in terms of counting accuracy. Moreover, by combining DT set with pseudo-labeled real checkout data, further improvement is observed. The code is available at https://github.com/yorkeyao/Automated-Retail-Checkout.

摘要
自动化购物过程是智能零售中的重要方面，用户可以轻松地通过摄像头传输产品，触发自动产品检测、跟踪和计数。在这个emerging领域中，由于缺乏标注的训练数据，我们提出了一个包含产品3D模型的数据集，允许快速、灵活和大规模的训练数据生成。在这个上下文中，我们发现了一个有趣的特点，即由于用户“手动”的方式，用户的行为偏好会导致实际检查出 процесс中的差异。如果训练数据不能够反映这些差异， то这将对训练效果产生负面影响。为解决这个用户偏好问题，我们提出了一个训练数据优化框架，即使用数字双子（DtTrain）训练。具体来说，我们利用产品3D模型，并且优化它们的渲染视角和照明，以生成“数字双子”，这些数字双子 inherit 产品标签，并且当加以扩展时，组成了数字双子训练集（DT集）。由于数字双子具有用户偏好，DT集更好地反映了目标场景的特点，可以训练更有效的产品检测和跟踪模型。在我们的实验中，我们发现DT集比现有的数据集生成方法在计数准确性方面表现出色。此外，通过组合DT集和 Pseudo-labeled 实际检查数据，进一步提高了性能。代码可以在中找到。

Guide3D: Create 3D Avatars from Text and Image Guidance

paper_url: http://arxiv.org/abs/2308.09705
repo_url: https://github.com/yukangcao/Guide3D
paper_authors: Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
for: 本研究旨在开发一种高效的文本和图像引导的三维生成模型，用于生成高分辨率的纹理网格。
methods: 我们提出了一种基于扩散模型的零引入文本和图像引导生成模型，包括生成多视图图像并jointly优化多分辨率响应的抽象四角体网格。我们还提出了一种相似性意识的特征融合策略，以有效地集成不同视图的特征。
results: 我们的框架在生成三维 geometry和高分辨率纹理上达到了现状之最，并且可以直接将二维生成的图像传递到三维空间中。我们的代码将会公开发布。

Abstract
Recently, text-to-image generation has exhibited remarkable advancements, with the ability to produce visually impressive results. In contrast, text-to-3D generation has not yet reached a comparable level of quality. Existing methods primarily rely on text-guided score distillation sampling (SDS), and they encounter difficulties in transferring 2D attributes of the generated images to 3D content. In this work, we aim to develop an effective 3D generative model capable of synthesizing high-resolution textured meshes by leveraging both textual and image information. To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our model involves (1) generating sparse-view images of a text-consistent character using diffusion models, and (2) jointly optimizing multi-resolution differentiable marching tetrahedral grids with pixel-aligned image features. We further propose a similarity-aware feature fusion strategy for efficiently integrating features from different views. Moreover, we introduce two novel training objectives as an alternative to calculating SDS, significantly enhancing the optimization process. We thoroughly evaluate the performance and components of our framework, which outperforms the current state-of-the-art in producing topologically and structurally correct geometry and high-resolution textures. Guide3D enables the direct transfer of 2D-generated images to the 3D space. Our code will be made publicly available.

摘要
Simplified Chinese:最近，文本到图像生成技术已经做出了很大的进步，能够生成非常印象深刻的图像。相比之下，文本到3D生成技术还没有达到相同的水平。现有方法主要依靠文本指导分数散热抽取样本（SDS），但它们在将2D生成图像中的特征传递到3D内容上遇到了困难。在这个工作中，我们目标是开发一种高效的3D生成模型，能够通过文本和图像信息来生成高分辨率的纹理镜面。为此，我们提出了 Guide3D，一种基于扩散模型的零shot文本和图像引导的3D人物生成模型。我们的模型包括（1）通过扩散模型生成文本一致的人物各个视角的稀疏图像，以及（2）同时优化多resolution的可 diferentiable marching tetrahedral网格和像素对齐的图像特征。我们还提出了一种相似性意识的特征融合策略，以高效地将不同视角的特征集成。此外，我们引入了两种新的训练目标，作为SDS的代替方法，有助于优化过程。我们 thorougly评估了我们的框架的性能和组件，其表现超越当前状态的艺术，能够生成正确的 topology和结构，以及高分辨率的纹理。 Guide3D可以将2D生成的图像直接传递到3D空间。我们的代码将公开发布。

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

paper_url: http://arxiv.org/abs/2308.09694
repo_url: None
paper_authors: Xuanyu Yi, Jiajun Deng, Qianru Sun, Xian-Sheng Hua, Joo-Hwee Lim, Hanwang Zhang
for: 增强ew-shot点云识别任务中数据稀缺问题的解决方案，使用一个joint预测方法，即一个常见的3D模型和一个培养得非常好的2D模型。
methods: 使用一种新的 invariant 训练策略，称为InvJoint，不仅更强调训练’’joint困难样本’’，还寻找2D和3D模型之间的相互关系。
results: 对ModelNet10/40、ScanObjectNN和Toys4K等三个领域进行了广泛的实验，证明了InvJoint可以学习更好的2D和3D表示，从而提高 ensemble 的性能。

Abstract
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the ''joint hard samples'', which have high confidence prediction on different wrong labels, implying that the 2D and 3D models do not collaborate well. To this end, our proposed invariant training strategy, called InvJoint, does not only emphasize the training more on the hard samples, but also seeks the invariance between the conflicting 2D and 3D ambiguous predictions. InvJoint can learn more collaborative 2D and 3D representations for better ensemble. Extensive experiments on 3D shape classification with widely adopted ModelNet10/40, ScanObjectNN and Toys4K, and shape retrieval with ShapeNet-Core validate the superiority of our InvJoint.

摘要
我们解决了很少样本点云识别三维物体中的数据缺乏挑战，通过使用一个 conjunction 的预测，来 combinational 一个常见的三维模型和一个已经训练好的二维模型。尽管这种ensemble 看起来很简单，但它在最近的二维-三维模型中几乎没有被研究过。我们发现，问题的关键在于“联合困难样本”的训练不够有效，这些样本有高信心地预测不同的错误标签，表明二维和三维模型之间没有协作良好。为此，我们提出了一种不变训练策略，即 InvJoint，不仅更多地训练“联合困难样本”，而且寻求这些样本之间的不变性。InvJoint 能够学习更好地协作的二维和三维表示，从而提高ensemble的性能。我们在广泛采用的 ModelNet10/40、ScanObjectNN 和 Toys4K 等三维形状分类和 ShapeNet-Core 等形状检索任务上进行了广泛的实验，并证明了 InvJoint 的超越性。

A Lightweight Transformer for Faster and Robust EBSD Data Collection

paper_url: http://arxiv.org/abs/2308.09693
repo_url: https://github.com/hdong920/ebsd_slice_recovery
paper_authors: Harry Dong, Sean Donegan, Megna Shah, Yuejie Chi
for: 提高3D EBSD数据质量和收集效率
methods: 使用变换器模型和投影算法进行数据处理和恢复
results: 使用自我超视觉学习和 Synthetic 3D EBSD数据进行训练，在实际3D EBSD数据上获得更高的恢复精度 compared to 现有方法

Abstract
Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer's outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods.

摘要
三维电子反射干扰diffraction（EBSD）镜像是物理科学中多种应用的重要工具，但其数据质量可能会在收集过程中出现大幅波动，特别是通过串行sectioning。幸运的是，3D EBSD数据是串行的，这开 up了使用 transformers，当前领域的最先进深度学习架构，进行数据处理和恢复的机会。为了更加鲁棒地处理错误和加速3D EBSD数据收集，我们介绍了一种两步方法，使用高效的 transformer 模型和投影算法来处理 transformer 的输出。通过对深度学习的计算和实践障碍而不是高维数据，我们使用只有自我超vision的 Synthetic 3D EBSD 数据进行训练，并在实际3D EBSD数据上获得了比现有方法更高的恢复精度。

Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

paper_url: http://arxiv.org/abs/2308.09685
repo_url: https://github.com/mjoannou/audiovisual-moments-in-time
paper_authors: Michael Joannou, Pia Rotshtein, Uta Noppeney
for: 这个论文主要是为了提供一个大规模的audiovisual动作事件数据集（AVMIT），以便用于计算机模型和人类参与者之间的研究。
methods: 这篇论文使用了一个大规模的注释任务，采集了3秒的audiovisual视频，并由11名参与者进行了分类。每个试验都需要参与者确定 audiovisual动作事件是否存在，以及这个事件是否是视频中最显著的特征。
results: 论文表明，使用AVMIT注释数据集可以提高audiovisual事件认识性能，特别是在 audiovisual对应性是关键的研究问题上。在6个回归神经网络（RNN）中，通过专门在audiovisual事件上进行训练，可以提高测试集的准确率，而不是使用模式性无关的事件。

Abstract
We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94\% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

摘要
我们介绍Audiovisual Moments in Time（AVMIT）数据集，这是一个大规模的 audiovisual 动作事件数据集。在一项大规模的注释任务中，11名参与者对 Mit 数据集（Moments in Time）中的3秒audiovisual视频进行了注释。为每个试验，参与者判断 audiovisual 动作事件是否存在，以及它是视频中最出色的特征。该数据集包括57,177个 audiovisual 视频，每个视频都被3名训练参与者独立地评估。从这个初始集合中，我们创建了一个精心选择的测试集，包含16种不同的动作类别，每个类别有60个视频（共960个视频）。我们还提供了2个预计算的 audiovisual 特征嵌入，使用VGGish/YamNet для音频数据和VGG16/EfficientNetB0 для视频数据，从而降低了audiovisual DNN研究的门槛。我们explored AVMIT 注释和特征嵌入的优点，以提高audiovisual事件认识的性能。6个回归神经网络（RNNs）在AVMIT中过滤 audiovisual 事件或MIT中的模态无关事件，然后在我们的audiovisual测试集上进行测试。在所有RNNs中，在训练 exclusively 于 audiovisual 事件上，精度提高了2.71-5.94%，甚至超过了模态无关事件的三倍增长。我们预计，新注释的 AVMIT 数据集将成为研究计算模型和人类参与者之间的 valuable 资源，特别是在研究问题中，audiovisual 协调的重要性非常高。

2023-08-19

DPL: Decoupled Prompt Learning for Vision-Language Models

R-C-P Method: An Autonomous Volume Calculation Method Using Image Processing and Machine Vision

ControlCom: Controllable Image Composition using Diffusion Model

CRC-ICM: Colorectal Cancer Immune Cell Markers Pattern Dataset

Single Image Reflection Separation via Component Synergy

Interpretation on Multi-modal Visual Fusion

Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation

DyFFPAD: Dynamic Fusion of Convolutional and Handcrafted Features for Fingerprint Presentation Attack Detection

Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts

Efficient Multi-View Inverse Rendering Using a Hybrid Differentiable Rendering Method

AltNeRF: Learning Robust Neural Radiance Field via Alternating Depth-Pose Optimization

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

TSAR-MVS: Textureless-aware Segmentation and Correlative Refinement Guided Multi-View Stereo

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

Breast Lesion Diagnosis Using Static Images and Dynamic Video

Whether you can locate or not? Interactive Referring Expression Generation

DESOBAv2: Towards Large-scale Real-world Dataset for Shadow Generation

NeutrEx: A 3D Quality Component Measure on Facial Expression Neutrality

UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Scene-Aware Feature Matching

Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling

Dual Branch Deep Learning Network for Detection and Stage Grading of Diabetic Retinopathy

On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

TDG: Text-guided Domain Generalization

MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition

VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations

EGANS: Evolutionary Generative Adversarial Network Search for Zero-Shot Learning

Noisy-Correspondence Learning for Text-to-Image Person Re-identification

Physics-Guided Human Motion Capture with Pose Probability Modeling

DiffusionTrack: Diffusion Model For Multi-Object Tracking

Scalable Video Object Segmentation with Simplified Framework

Towards a High-Performance Object Detector: Insights from Drone Detection Using ViT and CNN-based Deep Learning Models

Spatial-Temporal Alignment Network for Action Recognition

Semantic-Human: Neural Rendering of Humans from Monocular Video with Human Parsing

DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization

Calibrating Uncertainty for Semi-Supervised Crowd Counting

Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

A Theory of Topological Derivatives for Inverse Rendering of Geometry

Microscopy Image Segmentation via Point and Shape Regularized Data Synthesis

Cross-modality Attention-based Multimodal Fusion for Non-small Cell Lung Cancer (NSCLC) Patient Survival Prediction

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

Long-range Multimodal Pretraining for Movie Understanding

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

Smoothness Similarity Regularization for Few-Shot GAN Adaptation

Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

HumanLiff: Layer-wise 3D Human Generation with Diffusion Model

Robust Monocular Depth Estimation under Challenging Conditions

Training with Product Digital Twins for AutoRetail Checkout

Guide3D: Create 3D Avatars from Text and Image Guidance

Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition

A Lightweight Transformer for Faster and Robust EBSD Data Collection

Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions