2023-12-03

cs.CV

cs.CV - 2023-12-03

Robust Computer Vision in an Ever-Changing World: A Survey of Techniques for Tackling Distribution Shifts

paper_url: http://arxiv.org/abs/2312.01540
repo_url: None
paper_authors: Eashan Adhikarla, Kai Zhang, Jun Yu, Lichao Sun, John Nicholson, Brian D. Davison
for: This paper aims to address the challenging problem of distribution shift in computer vision models, which can lead to a gap between theoretical assumptions and real-world performance.
methods: The paper explores various data-centric techniques to address distribution shift, including data augmentation strategies and training mechanisms such as transfer learning and zero-shot learning.
results: The paper provides an in-depth overview of distribution shifts, their distinctions, and techniques to address them, with a focus on the robustness of machine learning models for computer vision applications.

Abstract
AI applications are becoming increasingly visible to the general public. There is a notable gap between the theoretical assumptions researchers make about computer vision models and the reality those models face when deployed in the real world. One of the critical reasons for this gap is a challenging problem known as distribution shift. Distribution shifts tend to vary with complexity of the data, dataset size, and application type. In our paper, we discuss the identification of such a prominent gap, exploring the concept of distribution shift and its critical significance. We provide an in-depth overview of various types of distribution shifts, elucidate their distinctions, and explore techniques within the realm of the data-centric domain employed to address them. Distribution shifts can occur during every phase of the machine learning pipeline, from the data collection stage to the stage of training a machine learning model to the stage of final model deployment. As a result, it raises concerns about the overall robustness of the machine learning techniques for computer vision applications that are deployed publicly for consumers. Different deep learning models each tailored for specific type of data and tasks, architectural pipelines; highlighting how variations in data preprocessing and feature extraction can impact robustness., data augmentation strategies (e.g. geometric, synthetic and learning-based); demonstrating their role in enhancing model generalization, and training mechanisms (e.g. transfer learning, zero-shot) fall under the umbrella of data-centric methods. Each of these components form an integral part of the neural-network we analyze contributing uniquely to strengthening model robustness against distribution shifts. We compare and contrast numerous AI models that are built for mitigating shifts in hidden stratification and spurious correlations, ...

摘要
人工智能应用在日常生活中越来越普及。然而，研究人员在实际应用中发现，计算机视觉模型的理论假设与实际情况存在显著的差距。一个主要的原因是分布shift问题。分布shift问题会随数据复杂度、数据集大小和应用类型而变化。我们的论文将详细介绍这种显著的差距，探讨分布shift的概念和其critical significance。我们还将详细介绍不同类型的分布shift，区分它们，并探讨在数据领域中使用的技术来解决它们。分布shift可以在机器学习管道中发生，从数据采集阶段到机器学习模型训练阶段到最终模型部署阶段。这使得机器学习技术的总体可靠性在公共用户中受到挑战。不同的深度学习模型，每个针对特定的数据和任务进行定制，architecture pipelines; highlighting how variations in data preprocessing and feature extraction can impact robustness。data augmentation strategies (e.g. geometric, synthetic and learning-based); demonstrating their role in enhancing model generalization，和training mechanisms (e.g. transfer learning, zero-shot) fall under the umbrella of data-centric methods。每个这些组成部分都是神经网络我们分析的一部分，它们各自对模型Robustness against distribution shifts做出了重要贡献。我们将比较和对多种AI模型，用于mitigating shifts in hidden stratification and spurious correlations，...

CalliPaint: Chinese Calligraphy Inpainting with Diffusion Model

paper_url: http://arxiv.org/abs/2312.01536
repo_url: None
paper_authors: Qisheng Liao, Zhinuo Wang, Muhammad Abdul-Mageed, Gus Xia
for: 本研究旨在提出一种基于计算机视觉的中国书法生成模型，用于艺术和教育领域。
methods: 本研究使用了最新的中国书法生成和图像填充技术，提出了一种新的模型CalliPaint，可以生成有力的中国书法。
results: 实验结果表明，CalliPaint模型能够生成出高质量的中国书法，并且可以根据用户的需求进行个性化定制。

Abstract
Chinese calligraphy can be viewed as a unique form of visual art. Recent advancements in computer vision hold significant potential for the future development of generative models in the realm of Chinese calligraphy. Nevertheless, methods of Chinese calligraphy inpainting, which can be effectively used in the art and education fields, remain relatively unexplored. In this paper, we introduce a new model that harnesses recent advancements in both Chinese calligraphy generation and image inpainting. We demonstrate that our proposed model CalliPaint can produce convincing Chinese calligraphy.

摘要
中国书法可以被视为独特的视觉艺术形式。最近的计算机视觉技术发展具有很大的潜在价值，对中国书法生成模型的未来发展具有重要意义。然而，中国书法填充技术，可以在艺术和教育领域得到广泛应用，尚未得到充分探索。在这篇论文中，我们介绍了一种新的模型CalliPaint，它利用了最近的中国书法生成技术和图像填充技术。我们示出了我们提议的模型可以生成真实的中国书法。

SANeRF-HQ: Segment Anything for NeRF in High Quality

paper_url: http://arxiv.org/abs/2312.01531
repo_url: None
paper_authors: Yichen Liu, Benran Hu, Chi-Keung Tang, Yu-Wing Tai
for: 这个论文的目的是实现基于Scene的高质量3D分割。methods: 该方法使用SAM和NeRF两种方法，通过用户提供的提示进行开放世界物体分割，并利用NeRF来从不同视角中汇集信息。results: 该方法在多个NeRF数据集上显示了明显的质量改善，提供了更高的自由度 для物体定位，并使得物体分割在多个视角中更一致。

Abstract
Recently, the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation, they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper, we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high quality 3D segmentation of any object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts, while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges, we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy, we evaluate our method quantitatively on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over previous state-of-the-art methods in NeRF object segmentation, provides higher flexibility for object localization, and enables more consistent object segmentation across multiple views. Additional information can be found at https://lyclyc52.github.io/SANeRF-HQ/.

摘要
近期，Segment Anything Model（SAM）展示了 Zero-shot segmentation 的杰出能力，而 NeRF（Neural Radiance Fields）在多种3D问题中受到了广泛的关注。虽然有初步尝试将这两种方法应用于3D segmentation，但它们在复杂场景中准确地和一致地分割对象存在挑战。在这篇论文中，我们提出了Segment Anything for NeRF in High Quality（SANeRF-HQ），以实现场景中任意对象的高质量3D分割。SANeRF-HQ利用SAM进行开放世界对象分割，根据用户提供的提示进行指导，同时利用NeRF来聚合不同视点的信息。为了解决以上挑战，我们使用density场和RGB相似性来增强分割边界的准确性。强调分割精度，我们对多个NeRF数据集进行了量化评估，并与前期状态艺术方法进行了比较。SANeRF-HQ在NeRF对象分割中显示出了显著的质量提升，提供了更高的对象定位灵活性，并允许在多个视点上具有更一致的对象分割。更多信息可以在查看。

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

paper_url: http://arxiv.org/abs/2312.01522
repo_url: None
paper_authors: Che Liu, Cheng Ouyang, Sibo Cheng, Anand Shah, Wenjia Bai, Rossella Arcucci
for:* 这个论文主要是为了提高医学视觉语言预训练（VLP）中的细节特征学习。methods:* 该论文提出了一种新的VLP框架，名为G2D，它通过在全息视觉语言对应中实现 Pseudo segmentation 任务来学习细节特征。results:* G2D 在6种医学成像任务和25种疾病中表现出色，尤其是在 semantic segmentation 任务中，需要细节特征。* G2D 在只使用1%的训练数据进行微调时，仍然可以超越同类型的模型。

Abstract
Recently, medical vision-language pre-training (VLP) has reached substantial progress to learn global visual representation from medical images and their paired radiology reports. However, medical imaging tasks in real world usually require finer granularity in visual features. These tasks include visual localization tasks (e.g., semantic segmentation, object detection) and visual grounding task. Yet, current medical VLP methods face challenges in learning these fine-grained features, as they primarily focus on brute-force alignment between image patches and individual text tokens for local visual feature learning, which is suboptimal for downstream dense prediction tasks. In this work, we propose a new VLP framework, named \textbf{G}lobal to \textbf{D}ense level representation learning (G2D) that achieves significantly improved granularity and more accurate grounding for the learned features, compared to existing medical VLP approaches. In particular, G2D learns dense and semantically-grounded image representations via a pseudo segmentation task parallel with the global vision-language alignment. Notably, generating pseudo segmentation targets does not incur extra trainable parameters: they are obtained on the fly during VLP with a parameter-free processor. G2D achieves superior performance across 6 medical imaging tasks and 25 diseases, particularly in semantic segmentation, which necessitates fine-grained, semantically-grounded image features. In this task, G2D surpasses peer models even when fine-tuned with just 1\% of the training data, compared to the 100\% used by these models. The code will be released upon acceptance.

摘要
最近，医疗视语预训练（VLP）已取得了重要进步，以学习医疗图像中的全球视觉表示。然而，实际世界中的医疗图像任务通常需要更高的视觉特征细化程度。这些任务包括视本地化任务（例如语义分割、对象检测）以及视本地化任务。然而，当前的医疗VLP方法面临着学习这些细化特征的挑战，因为它们主要强调粗粒对齐方法，以学习本地视觉特征，这是下游密集预测任务不佳。在这项工作中，我们提出了一种新的VLP框架，名为全球到密度级别表示学习（G2D）。G2D可以实现更高的细化和更准确的附加，相比现有的医疗VLP方法。具体来说，G2D通过在全球视语Alignment并行进行密度Segmentation任务，来学习密度和语义涵义相关的图像表示。需要注意的是，生成 pseudo Segmentation 目标不需要额外增加可训练参数：它们可以在 VLP 过程中获得 parameter-free 处理器。G2D 在 6 个医疗图像任务和 25 种疾病中表现出色，特别是在语义分割任务中，需要细化、语义涵义相关的图像特征。在这个任务中，G2D even surpasses 同类型的模型，即使只使用 1% 的训练数据进行微调。代码将在接受后发布。

Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network

paper_url: http://arxiv.org/abs/2312.02224
repo_url: None
paper_authors: Xiao Guo, Vishal Asnani, Sijia Liu, Xiaoming Liu
for: 本研究旨在预测生成模型（GM）中使用的超参数，给定一个生成的图像作为输入。
methods: 我们提出了一种新的模型解析方法，即可学习图gramPooling Network（LGPN），将模型解析转化为图节点分类任务，利用图节点和边表示超参数和它们之间的依赖关系。
results: 我们通过实验获得了模型解析和其扩展应用中的state-of-the-artresult，证明了我们的方法的有效性。我们的源代码可以获得。

Abstract
Model Parsing defines the research task of predicting hyperparameters of the generative model (GM), given a generated image as input. Since a diverse set of hyperparameters is jointly employed by the generative model, and dependencies often exist among them, it is crucial to learn these hyperparameter dependencies for the improved model parsing performance. To explore such important dependencies, we propose a novel model parsing method called Learnable Graph Pooling Network (LGPN). Specifically, we transform model parsing into a graph node classification task, using graph nodes and edges to represent hyperparameters and their dependencies, respectively. Furthermore, LGPN incorporates a learnable pooling-unpooling mechanism tailored to model parsing, which adaptively learns hyperparameter dependencies of GMs used to generate the input image. We also extend our proposed method to CNN-generated image detection and coordinate attacks detection. Empirically, we achieve state-of-the-art results in model parsing and its extended applications, showing the effectiveness of our method. Our source code are available.

摘要
模型分析定义了生成模型（GM）中的参数预测任务，给定一个生成的图像作为输入。由于生成模型使用的参数集是多样的，并且这些参数之间经常存在依赖关系，因此学习这些参数之间的依赖关系非常重要，以提高模型分析性能。为了探索这些重要的依赖关系，我们提出了一种新的模型分析方法 called Learnable Graph Pooling Network（LGPN）。具体来说，我们将模型分析转换为图节点分类任务，使用图节点和边来表示参数和它们之间的依赖关系，分别。此外，LGPN还包括一个可学习的 pooling-unpooling 机制，用于适应模型分析中的参数依赖关系。我们还将我们的提议方法推广到 CNN 生成的图像检测和坐标攻击检测。实验表明，我们的方法可以达到状态的最佳结果在模型分析和其扩展应用中，表明了我们的方法的有效性。我们的源代码可以获取。

CityGen: Infinite and Controllable 3D City Layout Generation

paper_url: http://arxiv.org/abs/2312.01508
repo_url: None
paper_authors: Jie Deng, Wenhao Chai, Jianshu Guo, Qixuan Huang, Wenhao Hu, Jenq-Neng Hwang, Gaoang Wang
for: 本研究旨在提出一种可控制、多样化的3D城市布局生成方法，用于智能城市规划、数字模拟等领域。
methods: 我们提出了一种扩展精细位置的涂抹管道，以生成无限大的3D城市布局。然后，我们利用多尺度扩散模型来生成多样化、可控制的本地Semantic布局片段。
results: 我们的CityGen方法在FID和KID指标上达到了状态之巅表现，可以生成无限大、可控制的3D城市布局。CityGen方法在智能城市规划、数字模拟等领域表现良好。

Abstract
City layout generation has recently gained significant attention. The goal of this task is to automatically generate the layout of a city scene, including elements such as roads, buildings, vegetation, as well as other urban infrastructures. Previous methods using VAEs or GANs for 3D city layout generation offer limited diversity and constrained interactivity, only allowing users to selectively regenerate parts of the layout, which greatly limits customization. In this paper, we propose CityGen, a novel end-to-end framework for infinite, diverse and controllable 3D city layout generation.First, we propose an outpainting pipeline to extend the local layout to an infinite city layout. Then, we utilize a multi-scale diffusion model to generate diverse and controllable local semantic layout patches. The extensive experiments show that CityGen achieves state-of-the-art (SOTA) performance under FID and KID in generating an infinite and controllable 3D city layout. CityGen demonstrates promising applicability in fields like smart cities, urban planning, and digital simulation.

摘要
市区布局生成最近受到了广泛关注。目标是自动生成城市场景中的布局，包括道路、建筑、植被和其他城市基础设施。以前使用 VAEs 或 GANs 进行3D城市布局生成的方法具有有限的多样性和受控性，只允许用户选择性地重新生成部分布局，这限制了自定义的可能性。在这篇论文中，我们提出了 CityGen，一个新的终端框架，用于实现无限、多样化和可控的3D城市布局生成。首先，我们提出了一个扩缩绘管道，以延展当地布局到无限城市布局。然后，我们利用多尺度扩散模型生成多样化和可控的本地语义布局块。广泛的实验表明，CityGen 在 FID 和 KID 中实现了状态态前的表现，并且可以实现无限和可控的3D城市布局生成。CityGen 在智能城市、城市规划和数字模拟等领域表现出了潜在的应用可能性。

GAPS: Geometry-Aware, Physics-Based, Self-Supervised Neural Garment Draping

paper_url: http://arxiv.org/abs/2312.01490
repo_url: https://github.com/Simonhfls/GAPS
paper_authors: Ruochen Chen, Liming Chen, Shaifali Parashar
for: 这项研究的目的是提出一种基于神经网络的衣物形态模型，以实现更快速、更美观的结果，并且可以控制衣物不可压缩性。
methods: 这种模型使用物理特性参数来控制衣物的压缩性，并且引入了碰撞意识的几何约束，以确保衣物的压缩性只在可能的情况下发生。
results: 这项研究的结果表明，使用这种模型可以获得更加现实istic的衣物形态结果，而且可以处理任意体型的人体，不需要额外的后处理。此外，这种模型还可以处理略松的衣物，而不需要特殊的训练方法。

Abstract
Recent neural, physics-based modeling of garment deformations allows faster and visually aesthetic results as opposed to the existing methods. Material-specific parameters are used by the formulation to control the garment inextensibility. This delivers unrealistic results with physically implausible stretching. Oftentimes, the draped garment is pushed inside the body which is either corrected by an expensive post-processing, thus adding to further inconsistent stretching; or by deploying a separate training regime for each body type, restricting its scalability. Additionally, the flawed skinning process deployed by existing methods produces incorrect results on loose garments. In this paper, we introduce a geometrical constraint to the existing formulation that is collision-aware and imposes garment inextensibility wherever possible. Thus, we obtain realistic results where draped clothes stretch only while covering bigger body regions. Furthermore, we propose a geometry-aware garment skinning method by defining a body-garment closeness measure which works for all garment types, especially the loose ones.

摘要
In this paper, we introduce a geometrical constraint to the existing formulation that is collision-aware and imposes garment inextensibility wherever possible, resulting in realistic results where draped clothes stretch only while covering bigger body regions. Furthermore, we propose a geometry-aware garment skinning method by defining a body-garment closeness measure that works for all garment types, especially the loose ones.

Computer Vision for Increased Operative Efficiency via Identification of Instruments in the Neurosurgical Operating Room: A Proof-of-Concept Study

paper_url: http://arxiv.org/abs/2312.03001
repo_url: None
paper_authors: Tanner J. Zachem, Sully F. Chen, Vishal Venkatraman, David AW Sykes, Ravi Prakash, Samantha Spellicy, Alexander D Suarez, Weston Ross, Patrick J. Codd
for:The paper aims to develop a computer vision algorithm to identify surgical instruments in the neurosurgical operating room, with the goal of improving instrument tracking and management, reducing waste, and optimizing surgical tray packing.methods:The authors collected 1660 images of 27 commonly used neurosurgical instruments and used a U-Net Convolutional Neural Network with 5-fold cross validation to train the model.results:The U-Net achieved an accuracy of 80-100% in identifying 25 classes of instruments, with 19/25 classes having an accuracy of over 90%. However, the model was less accurate for sub-classifying certain types of forceps.Here is the simplified Chinese text for the three key points:for:这篇论文目标是开发一种计算机视觉算法，用于在 neurosurgical 操作室中识别手术工具，以提高工具跟踪和管理，减少废弃和不必要的工具开启，优化手术盘包。methods:作者收集了 1660 张 neurosurgical 手术工具的图像，并使用 U-Net convolutional neural network 进行训练。results:U-Net 在识别 25 类工具时达到了 80-100% 的准确率，其中 19/25 类工具的准确率超过 90%。然而，模型对 certain 类型的 forceps 进行分类时的准确率较低。

Abstract
Objectives Computer vision (CV) is a field of artificial intelligence that enables machines to interpret and understand images and videos. CV has the potential to be of assistance in the operating room (OR) to track surgical instruments. We built a CV algorithm for identifying surgical instruments in the neurosurgical operating room as a potential solution for surgical instrument tracking and management to decrease surgical waste and opening of unnecessary tools. Methods We collected 1660 images of 27 commonly used neurosurgical instruments. Images were labeled using the VGG Image Annotator and split into 80% training and 20% testing sets in order to train a U-Net Convolutional Neural Network using 5-fold cross validation. Results Our U-Net achieved a tool identification accuracy of 80-100% when distinguishing 25 classes of instruments, with 19/25 classes having accuracy over 90%. The model performance was not adequate for sub classifying Adson, Gerald, and Debakey forceps, which had accuracies of 60-80%. Conclusions We demonstrated the viability of using machine learning to accurately identify surgical instruments. Instrument identification could help optimize surgical tray packing, decrease tool usage and waste, decrease incidence of instrument misplacement events, and assist in timing of routine instrument maintenance. More training data will be needed to increase accuracy across all surgical instruments that would appear in a neurosurgical operating room. Such technology has the potential to be used as a method to be used for proving what tools are truly needed in each type of operation allowing surgeons across the world to do more with less.

摘要
目的：计算机视觉（CV）是人工智能的一个领域，它使得机器可以理解和解释图像和视频。在操作室（OR）中，CV可以帮助跟踪手术工具。我们建立了一个CV算法，用于在 neurosurgical 操作室中识别手术工具，以解决手术废弃和不必要的工具开启。方法：我们收集了27种常用的 neurosurgical 手术工具的1660个图像。图像被标注使用VGG Image Annotator，并被分成80%的训练集和20%的测试集，以用于训练一个U-Net convolutional neural network 使用5次交叉验证。结果：我们的U-Net 在25种类型的工具识别中取得了80-100%的准确率，其中19/25种类型的工具准确率超过90%。但是，Adson、Gerald和Debakey 手术剪辑的准确率只有60-80%。结论：我们证明了机器学习可以准确地识别手术工具。工具识别可以帮助优化手术盘板准备、减少工具使用和废弃、降低手术器械落失事件的发生率，并帮助定时 Routine 工具维护。更多的训练数据将有助于增加所有可能出现在 neurosurgical 操作室中的工具的准确率。这种技术有可能用作证明在不同类型的手术中需要哪些工具，以便全球各地的外科医生可以做更多的事情。

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

paper_url: http://arxiv.org/abs/2312.02222
repo_url: None
paper_authors: Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Yebin Liu
for: 提高数字头文件的高准确率和效率，并且解决2D或3D生成模型中的形态扭曲、表情不准确和身份闪烁问题。
methods: 提出了一种新的框架，即增量3D GAN逆向，通过多帧图像进行逆向生成，从而提高头文件重建的性能，并且引入了一种新的3D GAN prior，以及一种创新的神经纹理编码器，以提高表情控制性和图像细节重建。
results: 比traditional技术高效，在一个或几个帧图像的情况下，实现了头文件重建的状态机器。

Abstract
While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks.

摘要
高效率和准确性是数字头部模odel的创建中的中心思想，但现有的方法，如基于2D或3D生成模型的方法，经常会遇到形状扭曲、表达不准确和身份闪烁等问题。此外，现有的一步逆转技术无法充分利用多个输入图像 для详细特征EXTRACTION。我们提出了一种新的框架，即《增量3D GAN逆转》，该框架通过多帧图像的增量来提高头部模型的重建性能，并且与多个输入图像进行Pixel-aligned图像-图像翻译，从而避免学习观察空间和参照空间之间的对应关系。此外，我们还 integrate ConvGRU循环网络来从多帧图像中收集时间数据，以提高几何和Texture细节重建。提出的思想在一步和几步头部动画任务中达到了州前性能。

Efficient Incremental Potential Contact for Actuated Face Simulation

paper_url: http://arxiv.org/abs/2312.02999
repo_url: None
paper_authors: Bo Li, Lingchen Yang, Barbara Solenthaler
for: 模拟人脸动画的 quasi-static Finite Element 模拟器
methods: 使用 Projective Dynamics (PD) 模型人脸为 actuated soft body，采用 Incremental Potential Contact (IPC) 处理自体 intersection
results: 通过提高了 IPC 优化方法，实现高视觉准确性，而且性能开销较低

Abstract
We present a quasi-static finite element simulator for human face animation. We model the face as an actuated soft body, which can be efficiently simulated using Projective Dynamics (PD). We adopt Incremental Potential Contact (IPC) to handle self-intersection. However, directly integrating IPC into the simulation would impede the high efficiency of the PD solver, since the stiffness matrix in the global step is no longer constant and cannot be pre-factorized. We notice that the actual number of vertices affected by the collision is only a small fraction of the whole model, and by utilizing this fact we effectively decrease the scale of the linear system to be solved. With the proposed optimization method for collision, we achieve high visual fidelity at a relatively low performance overhead.

摘要
我们提出了一种快速响应的质量动力学finite element模拟器，用于人脸动画。我们将面模型设置为一个受控的软体，可以高效地使用投影动力学（PD）进行模拟。我们采用增量潜在接触（IPC）处理自相互交叠。然而，直接将IPCintegrated into the simulation would degrade the high efficiency of the PD solver, since the stiffness matrix in the global step is no longer constant and cannot be pre-factorized。我们发现实际上affected by the collision的顶点数只是整个模型的一小部分，而我们通过利用这一事实，有效地减小了解决的线性系统的规模。与传统的冲击优化方法相比，我们的优化方法可以 achieve high visual fidelity at a relatively low performance overhead.

Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction

paper_url: http://arxiv.org/abs/2312.02221
repo_url: https://github.com/yizhiwang96/Slice3D
paper_authors: Yizhi Wang, Wallace Lira, Wenqi Wang, Ali Mahdavi-Amiri, Hao Zhang
for: 单视图3D重建
methods: 多slice理解、Coordinate-based transformer网络、U-Net网络
results: superiority in recovering complex and severely occluded shape structures, amid ambiguities, with an inference time less than 20 seconds.

Abstract
We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel through any occluders without obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D, a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB image and then integrates the slices into a 3D model using a coordinate-based transformer network for signed distance prediction. The slice images can be regressed or generated, both through a U-Net based network. For the former, we inject a learnable slice indicator code to designate each decoded image into a spatial slice location, while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method, especially in recovering complex and severely occluded shape structures, amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU, with an inference time less than 20 seconds.

摘要
我们介绍了多slice理解，一种新的概念，挑战当前和普遍存在的信念，即多视图合成是三维重建的最自然的通路。我们的关键观察是，对象剖解比修改视图更有利，可以更好地暴露受遮盖的结构。具体来说，剖解可以贯穿任何遮盖物，无阻碍。在极限情况下，即有无穷多个剖解， garantía可以暴露所有隐藏的对象部分。我们实现了我们的想法，开发出了一种新的单视图三维重建方法，即Slice3D。Slice3D方法首先预测多个剖解图像从单个RGB图像，然后使用坐标基于的变换网络进行签证距离预测，将剖解图像集成到3D模型中。剖解图像可以通过U-Net基于网络进行回归或生成，其中回归是通过在每个解码图像中添加学习的剖解标识码来实现的，而生成是通过整个剖解图像栈在输入通道上进行扩散滤波来实现的。我们对当前的状态艺术方法进行了广泛的评估，以示我们的方法的优越性，特别是在复杂和严重遮盖的形状结构的恢复中。所有Slice3D结果都是由单个Nvidia A40 GPU上训练的网络生成，执行时间低于20秒。

QuantAttack: Exploiting Dynamic Quantization to Attack Vision Transformers

paper_url: http://arxiv.org/abs/2312.02220
repo_url: None
paper_authors: Amit Baras, Alon Zolfi, Yuval Elovici, Asaf Shabtai
for: 这篇论文主要关注的是提出了一种新的攻击方法，可以降低量化模型的可用性和效率，并且可以在各种任务上进行测试。
methods: 这篇论文使用了一种新的攻击方法，即QuantAttack，这种攻击方法可以在量化模型的测试过程中降低模型的可用性和效率。
results: 这篇论文的实验结果显示，QuantAttack可以对于多种任务进行测试，并且可以在不同的模型和攻击版本中进行转移性测试。实验结果显示，QuantAttack可以对于量化模型进行可靠的攻击，并且可以对于模型的可用性和效率造成重大影响。

Abstract
In recent years, there has been a significant trend in deep neural networks (DNNs), particularly transformer-based models, of developing ever-larger and more capable models. While they demonstrate state-of-the-art performance, their growing scale requires increased computational resources (e.g., GPUs with greater memory capacity). To address this problem, quantization techniques (i.e., low-bit-precision representation and matrix multiplication) have been proposed. Most quantization techniques employ a static strategy in which the model parameters are quantized, either during training or inference, without considering the test-time sample. In contrast, dynamic quantization techniques, which have become increasingly popular, adapt during inference based on the input provided, while maintaining full-precision performance. However, their dynamic behavior and average-case performance assumption makes them vulnerable to a novel threat vector -- adversarial attacks that target the model's efficiency and availability. In this paper, we present QuantAttack, a novel attack that targets the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption. We show that carefully crafted adversarial examples, which are designed to exhaust the resources of the operating system, can trigger worst-case performance. In our experiments, we demonstrate the effectiveness of our attack on vision transformers on a wide range of tasks, both uni-modal and multi-modal. We also examine the effect of different attack variants (e.g., a universal perturbation) and the transferability between different models.

摘要
近年来，深度神经网络（DNN）特别是基于转换器的模型在发展中采取了更大更强大的模型。它们在表现上达到了国际标准，但它们的增长规模需要更多的计算资源（例如，更高的内存容量的GPU）。为解决这个问题，量化技术（即低比位准确表示和矩阵乘法）已经被提出。大多数量化技术采用静态策略，在训练或推理过程中将模型参数量化，而不考虑测试样本。然而，动态量化技术，它们在推理过程中根据输入进行适应，保持全精度性，在过去几年变得越来越受欢迎。然而，它们的动态行为和平均情况性假设使得它们对于一种新的威胁vector——对模型效率和可用性的攻击感到敏感。在这篇论文中，我们介绍了QuantAttack，一种新的攻击，targeting the availability of quantized models, slowing down the inference, and increasing memory usage and energy consumption.我们表明，通过精心制作的 adversarial examples，可以让模型在推理过程中消耗系统资源，导致最坏情况性能。在我们的实验中，我们证明QuantAttack对于视觉转换器在各种任务上都有效，包括单模态和多模态任务。我们还 examine了不同的攻击变种（例如，universal pertubation）和模型之间的传输性。

Diffusion Posterior Sampling for Nonlinear CT Reconstruction

paper_url: http://arxiv.org/abs/2312.01464
repo_url: None
paper_authors: Shudong Li, Matthew Tivnan, Yuan Shen, J. Webster Stayman
for: 用于CT图像重建和修复
methods: 使用扩散 posterior sampling，结合分布式 posterior 和可能性模型来生成高质量CT图像
results: 可以在低质量测量数据下生成高质量CT图像，并且可以在多种CT系统设计中使用这种方法，无需额外训练

Abstract
Diffusion models have been demonstrated as powerful deep learning tools for image generation in CT reconstruction and restoration. Recently, diffusion posterior sampling, where a score-based diffusion prior is combined with a likelihood model, has been used to produce high quality CT images given low-quality measurements. This technique is attractive since it permits a one-time, unsupervised training of a CT prior; which can then be incorporated with an arbitrary data model. However, current methods only rely on a linear model of x-ray CT physics to reconstruct or restore images. While it is common to linearize the transmission tomography reconstruction problem, this is an approximation to the true and inherently nonlinear forward model. We propose a new method that solves the inverse problem of nonlinear CT image reconstruction via diffusion posterior sampling. We implement a traditional unconditional diffusion model by training a prior score function estimator, and apply Bayes rule to combine this prior with a measurement likelihood score function derived from the nonlinear physical model to arrive at a posterior score function that can be used to sample the reverse-time diffusion process. This plug-and-play method allows incorporation of a diffusion-based prior with generalized nonlinear CT image reconstruction into multiple CT system designs with different forward models, without the need for any additional training. We develop the algorithm that performs this reconstruction, including an ordered-subsets variant for accelerated processing and demonstrate the technique in both fully sampled low dose data and sparse-view geometries using a single unsupervised training of the prior.

摘要
Diffusion模型已经被证明为深度学习工具的强大图像生成工具在CT重建和修复中。最近，Diffusion posterior sampling技术已经用于生成基于低质量测量的高质量CT图像。这种技术吸引人因为它允许一次性，不监督地训练CT先验，然后将其与任意数据模型结合使用。然而，当前的方法只是基于x射线CT物理学的线性模型来重建或修复图像。虽然线性化传输 Tomatoes reconstruction问题是一种常见的简化方法，但这是真实的非线性前向模型的一种近似。我们提出一种新的方法，通过Diffusion posterior sampling来解决非线性CT图像重建 inverse problem。我们在训练一个先验分布函数估计器后，通过bayes规则与测量 likelihood分布函数相结合，以获得可用于逆时间填充过程的 posterior分布函数。这种插件和玩法允许在不同的CT系统设计中，通过Diffusion-based先验与通用化非线性CT图像重建结合使用，而无需进行任何额外训练。我们开发了这种重建算法，包括一种顺序subsets变体用于加速处理，并在完全样本低剂量数据和稀缺视geometry中进行了示例。

Towards an accurate and generalizable multiple sclerosis lesion segmentation model using self-ensembled lesion fusion

paper_url: http://arxiv.org/abs/2312.01460
repo_url: None
paper_authors: Jinwei Zhang, Lianrui Zuo, Blake E. Dewey, Samuel W. Remedios, Dzung L. Pham, Aaron Carass, Jerry L. Prince
for: 提高多发性硬化病（MS）斑点分割的效率和可重复性，而不需要手动定义斑点 bounding box。
methods: 使用常见的U-Net建模器，并提出了一种新的测试时自 ensemble lesion fusión策略，以实现最高性能和普适性。
results: 在ISBI 2015 MS分割挑战数据集上达到了最佳性能，并在不同扫描仪的临床测试数据集上保持了一定的普适性。

Abstract
Automatic multiple sclerosis (MS) lesion segmentation using multi-contrast magnetic resonance (MR) images provides improved efficiency and reproducibility compared to manual delineation. Current state-of-the-art automatic MS lesion segmentation methods utilize modified U-Net-like architectures. However, in the literature, dedicated architecture modifications were always required to maximize their performance. In addition, the best-performing methods have not proven to be generalizable to diverse test datasets with contrast variations and image artifacts. In this work, we developed an accurate and generalizable MS lesion segmentation model using the well-known U-Net architecture without further modification. A novel test-time self-ensembled lesion fusion strategy is proposed that not only achieved the best performance using the ISBI 2015 MS segmentation challenge data but also demonstrated robustness across various self-ensemble parameter choices. Moreover, equipped with instance normalization rather than batch normalization widely used in literature, the model trained on the ISBI challenge data generalized well on clinical test datasets from different scanners.

摘要
自动多发性纤维病（MS）斑点分 segmentation 使用多contrast磁共振（MR）图像提供了改善效率和重复性的方法，比手动定义更加高效。当前Literature中的状态ethe-of-the-art自动MS斑点分 segmentation方法通常使用修改后的U-Net-like架构。然而，在文献中，专门的架构修改总是需要以最大化其性能。此外，Literature中的最佳performing方法尚未证明能够在多种测试数据集中进行普适化。在这种情况下，我们开发了一个精度高且普适的MS斑点分 segmentation模型，使用文献中的well-known U-Net架构而无需进行修改。此外，我们还提出了一种新的测试时自ensemble lesion fusions策略，不仅在ISBI 2015 MS segmentation challenge数据上达到了最好性能，而且在不同的自ensemble参数选择情况下也表现了稳定性。此外，我们在文献中广泛使用批 normalization而不是batch normalization，使得模型在不同的扫描仪上训练后在临床测试数据上进行了良好的普适性。

Looking Inside Out: Anticipating Driver Intent From Videos

paper_url: http://arxiv.org/abs/2312.01444
repo_url: None
paper_authors: Yung-chi Kung, Arthur Zhang, Junmin Wang, Joydeep Biswas
For: The paper aims to improve the prediction of future driver actions by utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance.* Methods: The proposed method explicitly extracts object and road-level features from external camera data, which are important for predicting driver intention. The approach uses a combination of a transformer and an LSTM-based architecture, with handcrafted features as inputs.* Results: The models predict driver maneuvers more accurately and earlier than existing approaches, with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place.Here is the simplified Chinese version of the three key points:* 用途: 文章目标是通过利用卡车内和外部摄像头数据提高当前最佳状态(SOTA)的未来驾驶动作预测性能。* 方法: 提议的方法显式提取外部摄像头数据中的对象和路况特征，这些特征对预测驾驶员意图非常重要。方法使用组合transformer和LSTM-based架构，使用手工设计的特征作为输入。* 结果: 模型对驾驶员动作的预测比现有方法更准确和更早，准确率为87.5%，预测时间为4.35秒。

Abstract
Anticipating driver intention is an important task when vehicles of mixed and varying levels of human/machine autonomy share roadways. Driver intention can be leveraged to improve road safety, such as warning surrounding vehicles in the event the driver is attempting a dangerous maneuver. In this work, we propose a novel method of utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance in predicting future driver actions. Compared to existing methods, our approach explicitly extracts object and road-level features from external camera data, which we demonstrate are important features for predicting driver intention. Using our handcrafted features as inputs for both a transformer and an LSTM-based architecture, we empirically show that jointly utilizing in-cabin and external features improves performance compared to using in-cabin features alone. Furthermore, our models predict driver maneuvers more accurately and earlier than existing approaches, with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place. We release our model configurations and training scripts on https://github.com/ykung83/Driver-Intent-Prediction

摘要
anticipating driver intention is an important task when vehicles of mixed and varying levels of human/machine autonomy share roadways。driver intention can be leveraged to improve road safety，such as warning surrounding vehicles in the event the driver is attempting a dangerous maneuver。in this work，we propose a novel method of utilizing in-cabin and external camera data to improve state-of-the-art (SOTA) performance in predicting future driver actions。compared to existing methods，our approach explicitly extracts object and road-level features from external camera data，which we demonstrate are important features for predicting driver intention。using our handcrafted features as inputs for both a transformer and an LSTM-based architecture，we empirically show that jointly utilizing in-cabin and external features improves performance compared to using in-cabin features alone。furthermore，our models predict driver maneuvers more accurately and earlier than existing approaches，with an accuracy of 87.5% and an average prediction time of 4.35 seconds before the maneuver takes place。we release our model configurations and training scripts on https://github.com/ykung83/Driver-Intent-Prediction。

Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT

paper_url: http://arxiv.org/abs/2312.01435
repo_url: https://github.com/ssen7/histo_cap_transformers
paper_authors: Saurav Sengupta, Donald E. Brown
for: Automatic report generation for histopathology images
methods: 使用 pré-entrenado Vision Transformer 和 BERT 模型进行报告生成
results: achieved 79.98% accuracy in Tissue Type classification, 66.36% accuracy in classifying the sex of the patient, and a BLEU-4 score of 0.5818 in caption generation task.

Abstract
Deep learning for histopathology has been successfully used for disease classification, image segmentation and more. However, combining image and text modalities using current state-of-the-art methods has been a challenge due to the high resolution of histopathology images. Automatic report generation for histopathology images is one such challenge. In this work, we show that using an existing pre-trained Vision Transformer in a two-step process of first using it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then using it as the encoder and a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model for language modeling-based decoder for report generation, we can build a fairly performant and portable report generation mechanism that takes into account the whole of the high resolution image, instead of just the patches. Our method allows us to not only generate and evaluate captions that describe the image, but also helps us classify the image into tissue types and the gender of the patient as well. Our best performing model achieves a 79.98% accuracy in Tissue Type classification and 66.36% accuracy in classifying the sex of the patient the tissue came from, with a BLEU-4 score of 0.5818 in our caption generation task.

摘要
深度学习在 histopathology 中已经成功应用于疾病分类、图像分割和更多的任务。然而，将图像和文本模式结合使用现有的方法具有高分辨率 histopathology 图像的挑战。自动生成 histopathology 图像的报告是一个such challenge.在这种工作中，我们显示了使用现有的预训练 Vision Transformer 进行两步处理：首先使用它来编码 4096x4096 大小的 Whole Slide Image (WSI) 的4096x4096 个小块，然后使用它作为编码器和一个预训练 Bidirectional Encoder Representations from Transformers (BERT) 模型来生成报告。我们的方法可以建立一个可靠并可移植的报告生成机制，该机制可以考虑整个高分辨率图像，而不仅仅是小块。我们的方法不仅可以生成和评估描述图像的caption，还可以帮助我们将图像分类为组织类型和患者的性别。我们的最佳模型在 Tissue Type 分类任务中达到了 79.98% 的准确率，在分类患者的性别任务中达到了 66.36% 的准确率，并在描述任务中达到了 BLEU-4 分数 0.5818。

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

paper_url: http://arxiv.org/abs/2312.01431
repo_url: https://github.com/qizhongtan/d2st-adapter
paper_authors: Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian
for: 这篇论文旨在提出一种基于几拟图像模型的ew-shot动作识别方法，以便学习强大的特征提取器，这是ew-shot学习中非常重要的。
methods: 该方法使用了一种名为D$^2$ST-Adapter的新的适应器调试框架，该框架由两个分支组成，用于分别编码空间和时间特征。此外，我们还提出了一种叫做Deformable Spatio-Temporal Attention模块，可以模型空间和时间特征在相应的路径中，使我们的D$^2$ST-Adapter能够在3D空间中编码全局的特征，同时保持轻量级的设计。
results: 我们的方法在对pre-trained ResNet和ViT的实现中表现出色，superior于当前状态的方法，特别是在有很多时间动态的场景下。

Abstract
Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter), a novel adapter tuning framework for few-shot action recognition, which is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. Furthermore, we devise the Deformable Spatio-Temporal Attention module as the core component of D$^2$ST-Adapter, which can be tailored to model both spatial and temporal features in corresponding pathways, allowing our D$^2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.

摘要
这文章提出了一个新的适应器架构，即Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter)，用于几步动作识别。这个适应器架构是一个双走道架构，用于分开空间和时间特征。此外，我们开发了一个弹性空间时间注意力模组，用于模型空间和时间特征。这个模组可以在相应的道路上调整，以便在3D空间时间中实现全球视野的特征编码。我们还进行了实验，证明了我们的方法在几步动作识别中具有优越性，特别是在时间动力学扮演重要的情况下。

WavePlanes: A compact Wavelet representation for Dynamic Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.02218
repo_url: https://github.com/azzarelli/waveplanes
paper_authors: Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull
for: 这个论文旨在提高NeRF技术来处理动态场景，但是现有的NeRF模型具有资源占用和压缩问题。这篇论文提出了WavePlanes，一种快速、更加压缩的明确模型。
methods: 该论文提出了一种基于N-级2-D波峰征应的多尺度空间和时间特征面表示方法，并使用反推 discrete波峰变换来重建N个特征信号。这些特征信号用于估计4-D网格中的颜色和浓度。
results: 对比之前的平面基于模型，WavePlanes具有15倍小于的模型大小，更低的计算负担和相似的训练时间（只需一个小时），而且不需要自定义CUDA代码或高性能计算资源。此外，该论文还提出了新的特征融合方案，这些方案具有更高的可解释性。

Abstract
Dynamic Neural Radiance Fields (Dynamic NeRF) enhance NeRF technology to model moving scenes. However, they are resource intensive and challenging to compress. To address this issue, this paper presents WavePlanes, a fast and more compact explicit model. We propose a multi-scale space and space-time feature plane representation using N-level 2-D wavelet coefficients. The inverse discrete wavelet transform reconstructs N feature signals at varying detail, which are linearly decoded to approximate the color and density of volumes in a 4-D grid. Exploiting the sparsity of wavelet coefficients, we compress a Hash Map containing only non-zero coefficients and their locations on each plane. This results in a compressed model size of ~12 MB. Compared with state-of-the-art plane-based models, WavePlanes is up to 15x smaller, less computationally demanding and achieves comparable results in as little as one hour of training - without requiring custom CUDA code or high performance computing resources. Additionally, we propose new feature fusion schemes that work as well as previously proposed schemes while providing greater interpretability. Our code is available at: https://github.com/azzarelli/waveplanes/

摘要
“几何动态神经辉煌场”（Dynamic NeRF）技术可以模拟运动场景，但它们需要大量资源和压缩。这篇论文提出了“波浪面”（WavePlanes），一种快速和更加压缩的明确模型。我们提出了多个Scale的空间和时间特征面表示，使用 N 级二dimensional 波峰系数。反向逆 discrete wavelet 变换可以重建 N 个特征信号，每个信号都是不同的细节水平上的线性解oding，用于估算矩阵中的颜色和密度。利用波峰系数的稀缺性，我们将 Hash Map 中的非零系数和其所在面都压缩，从而将模型大小压缩到约 12 MB。与现有的平面基于模型相比，WavePlanes 可以在一小时内训练，并且需要更少的计算资源，并且可以在一小时内达到相似的结果。此外，我们还提出了新的特征融合方案，可以和以前的方案相比，提供更高的解释性。我们的代码可以在：https://github.com/azzarelli/waveplanes/ 获取。

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

paper_url: http://arxiv.org/abs/2312.01408
repo_url: None
paper_authors: Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You, Hongxia Yang, Mingyuan Zhou
for: 这个论文主要针对的是如何在视觉领域中进行Context-Aware In-Context Learning（CAICL），即通过提供相关的视觉示例来帮助语言模型更好地理解图像，从而提高其对图像的理解和生成能力。
methods: 该论文提出了一种基于Diffusion的视觉基础模型，称为Improved Prompt Diffusion（iPromptDiff），它将视觉上下文转化为一个嵌入向量，然后将这个嵌入向量与文本提示的token嵌入进行修饰。这种方法可以在多种训练任务上展现出优异的灵活性和可靠性。
results: 论文的实验结果表明，当将iPromptDiff与标准的ControlNet结构结合使用时，可以在多种图像生成任务上达到出色的效果，包括normal-to-image和image-to-line转换等。此外，该方法还能够在没有昂贵的预训练和限制性的框架的情况下进行高效地训练和生成。

Abstract
In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest. Existing approaches in visual in-context learning frequently face hurdles such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture.

摘要
基于大语言模型中的协同学习成功，它在视觉领域的扩展 particualrly with visual foundation models like Stable Diffusion 已经引起了广泛的关注。现有的视觉协同学习方法frequently face challenges such as expensive pretraining, limiting frameworks, inadequate visual comprehension, and limited adaptability to new tasks. In response to these challenges, we introduce improved Prompt Diffusion (iPromptDiff) in this study. iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector. This vector is subsequently used to modulate the token embeddings of text prompts. We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks and excels in in-context learning for novel vision tasks, such as normal-to-image or image-to-line transformations. The effectiveness of these capabilities relies heavily on a deep visual understanding, which is achieved through relevant visual demonstrations processed by our proposed in-context learning architecture.

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams

paper_url: http://arxiv.org/abs/2312.01407
repo_url: None
paper_authors: Liao Wang, Kaixin Yao, Chengcheng Guo, Zhirui Zhang, Qiang Hu, Jingyi Yu, Lan Xu, Minye Wu
for: 这篇论文旨在实现在常见设备上实时渲染动态场景。
methods: 作者提出了一种使用2D特征图流来表示4D辐射场，并对其进行特有的训练方案，以利用2D视频编码器来压缩特征图流，并使用视频硬件加速器实现实时解码。
results: 作者提出了一种新的渲染管线，可以快速渲染动态场景，并提供了一个实时交互播放器，可以在多种设备上实时流动和渲染动态场景，从桌面到手机。

Abstract
Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

摘要
Translated into Simplified Chinese:NeRFs 在渲染静态场景方面表现出色，但是在普遍设备上实时渲染长时间的动态场景仍然存在数据存储和计算限制。在这篇论文中，我们介绍了 VideoRF，首个能够在移动平台上实时流动和渲染动态场景的方法。核心是一个序列化的2D特征图像流，代表了4D频谱场景所有的一个。我们提出了特化的训练方案，直接应用于这个2D领域，以利用时间和空间的重复性来压缩特征图像流。通过利用重复性，我们可以使用2D视频编解码器高效地压缩特征图像流，从而利用视频硬件加速器实现实时解码。另一方面，基于特征图像流，我们提议了一种特殊的渲染管线，以便高效地查询频谱属性。与延迟渲染模型结合使用，VideoRF在移动设备上实现了实时渲染，拥有高效的性能。我们已经开发了一个实时交互式播放器，可以在多种设备上线上流动和渲染动态场景，从桌面到手机都可以享受到无缝和 immerse 的自由观看体验。

Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective

paper_url: http://arxiv.org/abs/2312.01397
repo_url: https://github.com/unites-lab/vpns
paper_authors: Can Jin, Tianjin Huang, Yihua Zhang, Mykola Pechenizkiy, Sijia Liu, Shiwei Liu, Tianlong Chen
for: 这个研究旨在探讨大规模深度学习模型的硬件成本问题，并提出一种数据模型共设 perspective，以提高类型数据模型的简化。
methods: 本研究使用了自适应的可读性提示（Visual Prompts）来优化神经网络简化，并提出了一种共同优化方法来调整模型和数据之间的关系。
results: 实验结果显示，使用VPNs框架可以实现substantial的性能改善，并且发现从预训模型中获取的子网络具有更好的转移性。

Abstract
The rapid development of large-scale deep learning models questions the affordability of hardware platforms, which necessitates the pruning to reduce their computational and memory footprints. Sparse neural networks as the product, have demonstrated numerous favorable benefits like low complexity, undamaged generalization, etc. Most of the prominent pruning strategies are invented from a model-centric perspective, focusing on searching and preserving crucial weights by analyzing network topologies. However, the role of data and its interplay with model-centric pruning has remained relatively unexplored. In this research, we introduce a novel data-model co-design perspective: to promote superior weight sparsity by learning important model topology and adequate input data in a synergetic manner. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework. As a pioneering effort, this paper conducts systematic investigations about the impact of different visual prompts on model pruning and suggests an effective joint optimization approach. Extensive experiments with 3 network architectures and 8 datasets evidence the substantial performance improvements from VPNs over existing start-of-the-art pruning algorithms. Furthermore, we find that subnetworks discovered by VPNs from pre-trained models enjoy better transferability across diverse downstream scenarios. These insights shed light on new promising possibilities of data-model co-designs for vision model sparsification.

摘要
“大规模深度学习模型的快速发展对硬件平台的可持续性提出了问题，这需要进行剔除以减少其计算和内存负载。单簇神经网络作为产品，已经展示了许多有利的优点，如低复杂度、不受损的一致性等。大多数主流的剔除策略都是从模型中心的角度出发，专注于搜寻和保留重要的权重，并且对于模型和资料之间的互动几近未探访。在这个研究中，我们将引入一个新的数据-模型共设角度：通过协同运作，提高神经网络剔除的优秀权重状态，并且从数据的角度进行优化。具体来说，我们将自订的视觉提示mounted onto upgrade神经网络剔除的our proposed VPNs框架。这是一个开拓性的尝试，这篇论文将进行系统性的探索，了解不同的视觉提示对模型剔除的影响，并提出一个有效的共同优化方法。实验结果显示，VPNs在3种网络架构和8个数据集上取得了显著的性能改进，而且发现自动获得的VPNs子网络从预训模型中获得的转移性较高。这些发现照明了新的可能性，即数据-模型共设的剔除策略对于视觉模型剔除具有广泛的应用前景。”

Language-driven All-in-one Adverse Weather Removal

paper_url: http://arxiv.org/abs/2312.01381
repo_url: None
paper_authors: Hao Yang, Liyuan Pan, Yan Yang, Wei Liang
for: 这篇论文是为了提出一个能够统一多种不良天气情况的框架，以便在实际应用中更好地处理不同的天气状况。
methods: 这篇论文使用了一种基于语言的恢复框架（LDR），利用预训的摄像头语言（PVL）模型来扩大不良天气知识的多样性，并透过专案列表中的选择来适应不同的天气状况。
results: 根据实验结果，这篇论文的恢复效果较好，能够实现在实际应用中更好地处理不同的天气状况（见Fig. 1）。

Abstract
All-in-one (AiO) frameworks restore various adverse weather degradations with a single set of networks jointly. To handle various weather conditions, an AiO framework is expected to adaptively learn weather-specific knowledge for different degradations and shared knowledge for common patterns. However, existing methods: 1) rely on extra supervision signals, which are usually unknown in real-world applications; 2) employ fixed network structures, which restrict the diversity of weather-specific knowledge. In this paper, we propose a Language-driven Restoration framework (LDR) to alleviate the aforementioned issues. First, we leverage the power of pre-trained vision-language (PVL) models to enrich the diversity of weather-specific knowledge by reasoning about the occurrence, type, and severity of degradation, generating description-based degradation priors. Then, with the guidance of degradation prior, we sparsely select restoration experts from a candidate list dynamically based on a Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the weather-specific and shared knowledge to handle various weather conditions (e.g., unknown or mixed weather). Experiments on extensive restoration scenarios show our superior performance (see Fig. 1). The source code will be made available.

摘要
全功能（AiO）框架可以重听不同的恶势力，通过单一的网络集成来恢复。为了处理不同的天气情况，一个AiO框架应该适应性学习不同的天气知识和共同的模式知识。然而，现有方法：1）依赖于额外的监督信号，这些信号通常在实际应用中不知道; 2）采用固定的网络结构，这限制了天气特定的知识的多样性。在这篇论文中，我们提出了语言驱动的修复框架（LDR），以解决以上问题。首先，我们利用预训练的视觉语言（PVL）模型，以提高天气特定的知识的多样性，通过对损害的发生、类型和严重程度进行推理，生成损害先验。然后，根据损害先验的导向，我们动态从候选名单中选择修复专家，使用杂合结构（MoE）来适应不同的天气情况（例如，未知或混合的天气）。这使得我们可以适应天气特定的和共同的知识，以处理不同的天气情况。实验结果表明，我们的表现较为出色（参见图1）。源代码将公开。

A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling

paper_url: http://arxiv.org/abs/2312.02719
repo_url: None
paper_authors: Wentao Qu, Yuantian Shao, Lingwu Meng, Xiaoshui Huang, Liang Xiao
for: 提高点云upsampling的表达力，以提高后续任务 such as classification和重建的性能。
methods: 直接模elling dense point cloud的数据分布 Gradient，使用 conditional denoising diffusion probability model (DDPM) for point cloud upsampling，并在training中使用环境学习 dual mapping paradigm。
results: 在PU1K和PUGAN上的量化和质量评估中，PUDM signifiantly outperformed现有方法， achieved state of the art (SOTA) performance in terms of Chamfer Distance (CD) and Hausdorff Distance (HD).

Abstract
Point cloud upsampling (PCU) enriches the representation of raw point clouds, significantly improving the performance in downstream tasks such as classification and reconstruction. Most of the existing point cloud upsampling methods focus on sparse point cloud feature extraction and upsampling module design. In a different way, we dive deeper into directly modelling the gradient of data distribution from dense point clouds. In this paper, we proposed a conditional denoising diffusion probability model (DDPM) for point cloud upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a condition, and iteratively learns the transformation relationship between the dense point cloud and the noise. Simultaneously, PUDM aligns with a dual mapping paradigm to further improve the discernment of point features. In this context, PUDM enables learning complex geometry details in the ground truth through the dominant features, while avoiding an additional upsampling module design. Furthermore, to generate high-quality arbitrary-scale point clouds during inference, PUDM exploits the prior knowledge of the scale between sparse point clouds and dense point clouds during training by parameterizing a rate factor. Moreover, PUDM exhibits strong noise robustness in experimental results. In the quantitative and qualitative evaluations on PU1K and PUGAN, PUDM significantly outperformed existing methods in terms of Chamfer Distance (CD) and Hausdorff Distance (HD), achieving state of the art (SOTA) performance.

摘要
点云upsampling（PCU）提高了Raw点云的表示，大幅提高下游任务 such as classification和重建的性能。大多数现有的点云upsampling方法都集中在稀疏点云特征提取和upsampling模块设计。在不同的方式上，我们 deeper into directly模elling数据分布的梯度从密集点云。在这篇论文中，我们提出了一种条件杂化扩散概率模型（DDPM） для点云upsampling，称为PUDM。 Specifically，PUDM将稀疏点云作为条件，并iteratively学习密集点云和噪声之间的变换关系。同时，PUDM遵循 dual mapping paradigm，以更好地提高点特征的分辨率。在这种情况下，PUDM允许学习真实的地面特征，而不需要额外设计upsampling模块。此外，PUDM在推理过程中生成高质量、任意比例的点云，通过在训练过程中参数化比率因子来实现。此外，PUDM在实验结果中表现出了强大的噪声Robustness。在PU1K和PUGAN上进行的量化和质量评估中，PUDMsignificantly outperformed现有方法， achieved state of the art（SOTA）性能。

MoEC: Mixture of Experts Implicit Neural Compression

paper_url: http://arxiv.org/abs/2312.01361
repo_url: None
paper_authors: Jianchen Zhao, Cheng-Ching Tseng, Ming Lu, Ruichuan An, Xiaobao Wei, He Sun, Shanghang Zhang
For: 提出了一种新的隐藏神经表示法（INR）数据压缩技术，用于压缩复杂的场景中的数据。* Methods: 使用了一个网格网络来自动分配特定的INR到3D场景中的点。与之前的块分割和树状分割相比，我们的学习分割可以在端到端方式下自动找到最佳分割方案。* Results: 在庞大和多样化的生物医学数据集上进行了详细的实验，并证明了MoEC在现有方法之上具有优势。特别是在极高的压缩比例下，如6000倍，我们能够保持PSNR为48.16。

Abstract
Emerging Implicit Neural Representation (INR) is a promising data compression technique, which represents the data using the parameters of a Deep Neural Network (DNN). Existing methods manually partition a complex scene into local regions and overfit the INRs into those regions. However, manually designing the partition scheme for a complex scene is very challenging and fails to jointly learn the partition and INRs. To solve the problem, we propose MoEC, a novel implicit neural compression method based on the theory of mixture of experts. Specifically, we use a gating network to automatically assign a specific INR to a 3D point in the scene. The gating network is trained jointly with the INRs of different local regions. Compared with block-wise and tree-structured partitions, our learnable partition can adaptively find the optimal partition in an end-to-end manner. We conduct detailed experiments on massive and diverse biomedical data to demonstrate the advantages of MoEC against existing approaches. In most of experiment settings, we have achieved state-of-the-art results. Especially in cases of extreme compression ratios, such as 6000x, we are able to uphold the PSNR of 48.16.

摘要
emerging implicit neural representation (INR) 是一种promising的数据压缩技术，它使用深度神经网络（DNN）的参数来表示数据。现有的方法手动将复杂的场景分成本地区域，并适应INRs到这些区域中。然而，手动设计分区方案对于复杂的场景是非常困难的，而且无法同时学习分区和INRs。为解决这个问题，我们提出了MoEC，一种基于混合专家理论的新的隐式神经压缩方法。特别是，我们使用一个网关网络来自动将3D点分配到特定的INR中。这个网关网络与不同的本地区域INRs一起被训练。相比于块 wise和树状分区，我们的学习分区可以适应ively找到最佳分区方案。我们在大量和多样化的生物医学数据上进行了详细的实验，以示MoEC对现有方法的优势。在大多数实验设置中，我们已经达到了状态法的结果。特别是在极高的压缩比例，如6000x，我们能够保持PSNR的48.16。Note: The text is translated into Simplified Chinese, which is the most widely used standard for Chinese writing. The translation is based on the grammar and vocabulary of Simplified Chinese, and some words and phrases may be different from Traditional Chinese.

Deep learning and traditional-based CAD schemes for the pulmonary embolism diagnosis: A survey

paper_url: http://arxiv.org/abs/2312.01351
repo_url: None
paper_authors: Seyed Hesamoddin Hosseini, Amir Hossein Taherinia, Mahdi Saadatmand
for: This paper aims to review, evaluate, and compare the performance of deep learning and traditional-based CAD systems for the diagnosis of Pulmonary Embolism (PE).
methods: The authors use a systematic search of articles available in databases such as IEEE, ScienceDirect, Wiley, Springer, Nature, and Wolters Kluwer to identify studies that have used either deep learning or traditional-based CAD systems for PE diagnosis. They evaluate the performance of each system using criteria such as sensitivity, False Positives (FP), and the number of datasets.
results: The authors review and evaluate 23 papers that have been published on PE diagnosis using either deep learning or traditional-based CAD systems from 2002 to 2023. They provide a comprehensive overview of the recent studies and state-of-the-art research works in this field, and compare the performance of deep learning and traditional-based CAD systems.

Abstract
Nowadays, pulmonary Computed Tomography Angiography (CTA) is the main tool for detecting Pulmonary Embolism (PE). However, manual interpretation of CTA volume requires a radiologist, which is time-consuming and error-prone due to the specific conditions of lung tissue, large volume of data, lack of experience, and eye fatigue. Therefore, Computer-Aided Design (CAD) systems are used as a second opinion for the diagnosis of PE. The purpose of this article is to review, evaluate, and compare the performance of deep learning and traditional-based CAD system for diagnosis PE and to help physicians and researchers in this field. In this study, all articles available in databases such as IEEE, ScienceDirect, Wiley, Springer, Nature, and Wolters Kluwer in the field of PE diagnosis were examined using traditional and deep learning methods. From 2002 to 2023, 23 papers were studied to extract the articles with the considered limitations. Each paper presents an automatic PE detection system that we evaluate using criteria such as sensitivity, False Positives (FP), and the number of datasets. This research work includes recent studies, state-of-the-art research works, and a more comprehensive overview compared to previously published review articles in this research area.

摘要
现在，肺 computed tomography angiography (CTA) 是诊断肺动脉瘤 (PE) 的主要工具。然而，手动解读 CTA 卷积需要一名放射学家，这是时间consuming 和 error-prone，因为肺组织的特殊条件、大量数据、缺乏经验和疲劳视力。因此，计算机支持设计 (CAD) 系统被用作诊断 PE 的第二 Opinion。本文的目的是回顾、评估和比较深度学习和传统基础 CAD 系统在诊断 PE 方面的表现，以帮助医生和研究人员。本研究中，从 2002 年到 2023 年，我们对 IEEE、ScienceDirect、Wiley、Springer、Nature 和 Wolters Kluwer 等数据库中的相关文献进行了检索和分析。期间，我们选择了 23 篇文章，并对它们进行了自动诊断 PE 系统的评估，使用敏感度、false positives (FP) 和数据集的指标。本研究包括最新的研究、顶峰研究和更为全面的概述，与之前在这个研究领域已发表的综述文章相比，更为全面和系统。

DragVideo: Interactive Drag-style Video Editing

paper_url: http://arxiv.org/abs/2312.02216
repo_url: https://github.com/rickyskywalker/dragvideo-official
paper_authors: Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, Chi-Keung Tang
for: 提供一种基于DragGAN的视频编辑技术，以便提供直观的用户控制和自然的编辑结果。
methods: 提出了一种基于DragDiffusion的Drag-on-Video U-Net（DoVe）编辑方法，通过优化扩散视频准确度来实现愿望的编辑效果。
results: 在多种复杂的编辑任务中，如动作编辑、skeleton编辑等，DragVideo表现出了广泛的应用前景和普适性。

Abstract
Editing visual content on videos remains a formidable challenge with two main issues: 1) direct and easy user control to produce 2) natural editing results without unsightly distortion and artifacts after changing shape, expression and layout. Inspired by DragGAN, a recent image-based drag-style editing technique, we address above issues by proposing DragVideo, where a similar drag-style user interaction is adopted to edit video content while maintaining temporal consistency. Empowered by recent diffusion models as in DragDiffusion, DragVideo contains the novel Drag-on-Video U-Net (DoVe) editing method, which optimizes diffused video latents generated by video U-Net to achieve the desired control. Specifically, we use Sample-specific LoRA fine-tuning and Mutual Self-Attention control to ensure faithful reconstruction of video from the DoVe method. We also present a series of testing examples for drag-style video editing and conduct extensive experiments across a wide array of challenging editing tasks, such as motion editing, skeleton editing, etc, underscoring DragVideo's versatility and generality. Our codes including the DragVideo web user interface will be released.

摘要
修改视频内容仍然是一大挑战，主要问题有两个：1）直接和简单地控制用户，以生成2）无瑕疵扭曲和artefacts的自然编辑结果。受DragGAN图像基于拖动技术的启发，我们解决上述问题，提出了DragVideo，其中采用了类似于拖动式用户交互方式来编辑视频内容，同时保持时间一致性。基于最近的扩散模型 DragDiffusion，DragVideo包含了novel的拖动视频U-Net（DoVe）编辑方法，该方法使用视频U-Net生成的扩散视频卷积数据进行优化，以实现所需的控制。我们还使用特定样本的LoRA精度调整和相互自我关注控制，以确保视频从DoVe方法中 faithful 重建。我们还提供了许多拖动式视频编辑测试例子，并在各种复杂的编辑任务上进行了广泛的实验，如运动编辑、skeleton编辑等，这些实验证明了DragVideo的多样性和通用性。我们将发布代码，包括DragVideo网页用户界面。

Enhancing and Adapting in the Clinic: Source-free Unsupervised Domain Adaptation for Medical Image Enhancement

paper_url: http://arxiv.org/abs/2312.01338
repo_url: https://github.com/liamheng/annotation-free-medical-image-enhancement
paper_authors: Heng Li, Ziqin Lin, Zhongxi Qiu, Zinan Li, Huazhu Fu, Yan Hu, Jiang Liu
for: 这篇论文的目的是为了提出一个不需要源数据的医疗影像增强算法（SAME），可以在测试阶段进行不supervised增强，并且可以利用测试数据来适应不同领域的医疗影像增强。
methods: 这篇论文使用了一个结构保持增强网络来学习一个可靠的源模型，然后使用教师生物数据来进行源自由领域适应（SFUDA），并且还开发了一个假标签选择器来增强增强任务的知识传递。
results: 实验结果显示，SAME可以在十个不同领域的医疗影像中提供出色的增强性和下游任务的表现，并且可以对不同领域的医疗影像进行适应。

Abstract
Medical imaging provides many valuable clues involving anatomical structure and pathological characteristics. However, image degradation is a common issue in clinical practice, which can adversely impact the observation and diagnosis by physicians and algorithms. Although extensive enhancement models have been developed, these models require a well pre-training before deployment, while failing to take advantage of the potential value of inference data after deployment. In this paper, we raise an algorithm for source-free unsupervised domain adaptive medical image enhancement (SAME), which adapts and optimizes enhancement models using test data in the inference phase. A structure-preserving enhancement network is first constructed to learn a robust source model from synthesized training data. Then a teacher-student model is initialized with the source model and conducts source-free unsupervised domain adaptation (SFUDA) by knowledge distillation with the test data. Additionally, a pseudo-label picker is developed to boost the knowledge distillation of enhancement tasks. Experiments were implemented on ten datasets from three medical image modalities to validate the advantage of the proposed algorithm, and setting analysis and ablation studies were also carried out to interpret the effectiveness of SAME. The remarkable enhancement performance and benefits for downstream tasks demonstrate the potential and generalizability of SAME. The code is available at https://github.com/liamheng/Annotation-free-Medical-Image-Enhancement.

摘要
医疗成像提供了许多有价值的信息，包括生物结构和疾病特征。然而，图像异常是在临床实践中的常见问题，可能影响医生和算法的观察和诊断。虽然扩展的增强模型已经开发出来，但这些模型需要在部署前进行较好的预训练，而不使用潜在有价值的推理数据。本文提出了一种无源自适应医学成像增强算法（SAME），该算法在推理阶段使用测试数据进行适应和优化增强模型。首先，我们构建了一个可靠的源模型，用于学习Synthesized training data。然后，我们使用测试数据进行无源自适应（SFUDA），并通过知识传播来帮助学习增强任务。此外，我们还开发了一个pseudo-label选择器，以提高增强任务的知识传播。我们对十个医学成像模式的十个数据集进行实验，并进行设置分析和剖析研究，以解释SAME的效果。我们发现SAME可以提供出色的增强性和下游任务的优势，这表明SAME具有普适性和可重用性。代码可以在https://github.com/liamheng/Annotation-free-Medical-Image-Enhancement上获取。

MABViT – Modified Attention Block Enhances Vision Transformers

paper_url: http://arxiv.org/abs/2312.01324
repo_url: None
paper_authors: Mahesh Ramesh, Aswinkumar Ramkumar
for: 提高 Large Language Models (LLMs) 的性能，以及快速训练 LLMs without significantly impacting performance.
methods: 使用 Gated Linear Units (GLU) activation function, 并在 each Transformer block 中实现 parallel 配置。
results: 在 ImageNet-1K 数据集上，提出了一种新的 transformer 变体，该变体在 Value tensor 上实现 GLU-based activation function，并在 S/16 和 B/16 变体上显示出优于现有状态的性能。此外，还提供了使用 GELU activation function 的结果，以证明我们的假设。最后，我们展示了 MABViT 变体在深度 transformers 中的更大潜力。

Abstract
Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

摘要
Simplified Chinese:近期研究表明， gat 线性单元 (GLU) 可以提高 transformer 模型的性能，特别是在 Large Language Models (LLMs) 中。此外，使用 parallel 配置在每个 Transformer 块中而不是传统的 serialized 方法可以加速 LLMs 的训练，而无需显著影响性能。然而，当 MLP 和 attention 块在图像分类任务中运行在平行方式时，我们观察到了明显的性能下降。为了解决这个问题，我们提议一种新的 transformer 变体，其中 incorporates 非线性在 attention 块中。我们将 GLU 基于 activation 函数应用于 Value tensor，并这新技术在 ImageNet-1K 数据集上超过了当前状态的 art S/16 变体的 Vision Transformers 的性能，同时使用 fewer 参数。它还超过了 B/16 变体，而使用的参数只是半数。此外，我们测试了 GELU activation 函数变体，以确认我们的发现。最后，我们显示了 MABViT 变体在深度 transformers 中表现更好。

Few-shot Shape Recognition by Learning Deep Shape-aware Features

paper_url: http://arxiv.org/abs/2312.01315
repo_url: None
paper_authors: Wenlong Shi, Changsheng Lu, Ming Shao, Yinjie Zhang, Siyu Xia, Piotr Koniusz
for: 提高几何特征提取和分类的性能，并能够泛化到未看到的形状。
methods: 使用嵌入模块提取不变性形状特征，并使用双重注意力机制将形状特征分解和重构为可学习的形状基元。
results: 在五个数据集上进行了实验，并显示了与状态艺术方法的显著提高，特别是在几何特征提取和分类 task 上。

Abstract
Traditional shape descriptors have been gradually replaced by convolutional neural networks due to their superior performance in feature extraction and classification. The state-of-the-art methods recognize object shapes via image reconstruction or pixel classification. However , these methods are biased toward texture information and overlook the essential shape descriptions, thus, they fail to generalize to unseen shapes. We are the first to propose a fewshot shape descriptor (FSSD) to recognize object shapes given only one or a few samples. We employ an embedding module for FSSD to extract transformation-invariant shape features. Secondly, we develop a dual attention mechanism to decompose and reconstruct the shape features via learnable shape primitives. In this way, any shape can be formed through a finite set basis, and the learned representation model is highly interpretable and extendable to unseen shapes. Thirdly, we propose a decoding module to include the supervision of shape masks and edges and align the original and reconstructed shape features, enforcing the learned features to be more shape-aware. Lastly, all the proposed modules are assembled into a few-shot shape recognition scheme. Experiments on five datasets show that our FSSD significantly improves the shape classification compared to the state-of-the-art under the few-shot setting.

摘要
传统的形态描述符被逐渐被深度学习模型取代，因为它们在特征提取和分类方面表现更高水平。现状顶尖方法通过图像重建或像素分类来识别物体形态。然而，这些方法偏好于文本信息，忽略了重要的形态描述，因此在未看过的形态上不能泛化。我们是第一个提出了几个样本形态描述符（FSSD），可以通过只提供一个或几个样本来识别物体形态。我们使用嵌入模块来EXTRACT变换不变的形态特征。其次，我们开发了双注意力机制来分解和重建形态特征，通过学习形态基元来拼接出任意形态。这样，任何形态都可以通过有限集基元组成，并且学习的表示模型具有高度可读性和可扩展性。最后，我们提出了解码模块，以便包括形态 маска和边缘的supervision，将原始和重建的形态特征进行对齐，使得学习的特征更加形态意识。最后，我们将所有模块粘合成为几个样本形态识别方案。实验结果表明，我们的FSSD在几个数据集下实现了对状态的很大改进。

FlashAvatar: High-Fidelity Digital Avatar Rendering at 300FPS

paper_url: http://arxiv.org/abs/2312.02214
repo_url: None
paper_authors: Jun Xiang, Xuan Gao, Yudong Guo, Juyong Zhang
for: 本研究旨在提出一种新型的3D动画可控人物表示方法，可以快速从短视频序列中重construct高质量的3D人物模型，并在消耗级 GPU 上实现300帧/秒的高速渲染。
methods: 该方法基于 Parametric 面型模型和3D Gaussian 场，通过学习附加的空间偏移来模型非表面区域和细节表达。通过全面使用几何约束，捕捉高频脸部细节和保留扭曲表达。
results: 实验结果表明，FlashAvatar 比现有方法更高效和更精细，可以在300帧/秒的高速渲染中保持高质量的视觉效果和个性化细节。项目页面：https://ustc3dv.github.io/FlashAvatar/

Abstract
We propose FlashAvatar, a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions, proper initialization can help reduce the number of Gaussians, thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/

摘要
我们提出Flash Avatar，一种新的简洁且快速的三维动画人物表示方法。它可以通过短时间内монокуляр视频序列重建一个数字人物，并在消费级GPU上以300帧/秒的高Definition图像的形式进行高质量图像渲染。为了实现这一目标，我们保持一个固定的三维 Gaussian 场在面型模型表面中，并学习额外的空间偏移以模型非表面区域和细节。通过完全利用几何约束，我们可以捕捉高频脸部细节和保留夸大表达。初始化可以减少Gaussian的数量，从而实现超快渲染速度。我们的实验结果表明，Flash Avatar在视觉质量和个性化细节方面超过现有方法，并且在渲染速度方面几乎是一个数量级快。项目页面：https://ustc3dv.github.io/FlashAvatar/

SAGE: Bridging Semantic and Actionable Parts for GEneralizable Articulated-Object Manipulation under Language Instructions

paper_url: http://arxiv.org/abs/2312.01307
repo_url: None
paper_authors: Haoran Geng, Songlin Wei, Congyue Deng, Bokui Shen, He Wang, Leonidas Guibas
for: 这篇论文的目的是提供一个能够实现多种零件物体的普遍化操作的框架，以应对实际世界中的多种任务。
methods: 这篇论文使用了大型自然语言模型（LLMs）和视觉语言模型（VLMs），以及专业部分识别模型，实现了对零件物体的semantic part的理解和可操作的分解。
results: 实验结果显示，这个框架可以在实际世界中处理多种零件物体，并且可以实现高度普遍化的操作。同时，这个框架还提供了一个新的语言指令驱动的零件物体操作 benchmark。

Abstract
Generalizable manipulation of articulated objects remains a challenging problem in many real-world scenarios, given the diverse object structures, functionalities, and goals. In these tasks, both semantic interpretations and physical plausibilities are crucial for a policy to succeed. To address this problem, we propose SAGE, a novel framework that bridges the understanding of semantic and actionable parts of articulated objects to achieve generalizable manipulation under language instructions. Given a manipulation goal specified by natural language, an instruction interpreter with Large Language Models (LLMs) first translates them into programmatic actions on the object's semantic parts. This process also involves a scene context parser for understanding the visual inputs, which is designed to generate scene descriptions with both rich information and accurate interaction-related facts by joining the forces of generalist Visual-Language Models (VLMs) and domain-specialist part perception models. To further convert the action programs into executable policies, a part grounding module then maps the object semantic parts suggested by the instruction interpreter into so-called Generalizable Actionable Parts (GAParts). Finally, an interactive feedback module is incorporated to respond to failures, which greatly increases the robustness of the overall framework. Experiments both in simulation environments and on real robots show that our framework can handle a large variety of articulated objects with diverse language-instructed goals. We also provide a new benchmark for language-guided articulated-object manipulation in realistic scenarios.

摘要
通用的机械人对象控制问题在许多实际场景中仍然是一个挑战，主要因为对象的结构、功能和目标都具有多种多样性。为解决这个问题，我们提出了SAGE框架，它可以将语言指令中的含义和操作掌握到一起，实现通用的机械人对象控制。给出一个由自然语言定义的操作目标后，一个指令解释器使用大型自然语言模型（LLM）将其转换为对象语义部分的程序动作。此外，还包括场景描述分析模块，用于理解视觉输入，该模块通过将通用视语言模型（VLM）和域特化部件识别模型相结合，生成包含详细信息和准确互动相关信息的场景描述。接着，一个部件地面模块将对象语义部分建议的部件映射到称为通用可操作部分（GAParts）。最后，一个交互反馈模块被 integrate 以应对失败，这大大提高了整个框架的稳定性。在 simulate 环境和真实机器人上进行的实验表明，我们的框架可以处理具有多种多样的机械人对象和语言指令目标。我们还提供了一个新的语言指导机械人对象控制的实际场景 benchmark。

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

paper_url: http://arxiv.org/abs/2312.02212
repo_url: https://github.com/liujin112/portraitdiffusion
paper_authors: Jin Liu, Huaibo Huang, Chao Jin, Ran He
for: 这篇论文主要针对的是如何实现免需训练的人脸风格化，即将人脸转换成specific的肖像风格。
methods: 该论文提出了一种免需训练的人脸风格化框架， named Portrait Diffusion， который利用了off-the-shelf文本到图像扩散模型，不需要特定的示例 fine-tuning。
results: 该框架通过具有修改自我注意力操作的Style Attention Control来精准地混合内容和风格特征在注意力空间中，并提出了一种链式绘制方法来进行不满的区域的慢慢调整。经验 validate了Portrait Diffusion方法的效果，并证明了链式绘制方法在人脸风格化中的优势。

Abstract
Face stylization refers to the transformation of a face into a specific portrait style. However, current methods require the use of example-based adaptation approaches to fine-tune pre-trained generative models so that they demand lots of time and storage space and fail to achieve detailed style transformation. This paper proposes a training-free face stylization framework, named Portrait Diffusion. This framework leverages off-the-shelf text-to-image diffusion models, eliminating the need for fine-tuning specific examples. Specifically, the content and style images are first inverted into latent codes. Then, during image reconstruction using the corresponding latent code, the content and style features in the attention space are delicately blended through a modified self-attention operation called Style Attention Control. Additionally, a Chain-of-Painting method is proposed for the gradual redrawing of unsatisfactory areas from rough adjustments to fine-tuning. Extensive experiments validate the effectiveness of our Portrait Diffusion method and demonstrate the superiority of Chain-of-Painting in achieving precise face stylization. Code will be released at \url{https://github.com/liujin112/PortraitDiffusion}.

摘要
“面部风格化”指的是将面部转换成特定的肖像风格。然而，现有的方法通常需要使用示例基于的适应方法来精度调整预训练的生成模型，这需要很多时间和存储空间，并且无法实现细节化的风格变换。这篇论文提出了一个不需要训练的面部风格化框架，名为“人脸扩散”。这个框架利用了市面上的文本到图像扩散模型，废弃了特定示例的精度调整。具体来说，首先将内容和风格图像转换成内存代码。然后，在使用相应的内存代码进行图像重建时，内容和风格特征在注意空间中细腻地融合，通过修改自注意操作来实现风格控制。此外，一种链式绘制方法也被提出，用于从粗略调整到细腻调整中的渐进重绘。广泛的实验证明了我们的“人脸扩散”方法的有效性，并证明了链式绘制方法在实现精确的面部风格化方面的优势。代码将在 \url{https://github.com/liujin112/PortraitDiffusion} 上发布。

Stable Messenger: Steganography for Message-Concealed Image Generation

paper_url: http://arxiv.org/abs/2312.01284
repo_url: None
paper_authors: Quang Nguyen, Truong Vu, Cuong Pham, Anh Tran, Khoi Nguyen
for: 这篇论文关注于数字保护，特别是隐藏信息的预处理。
methods: 该论文提出了一种新的评估全息译解质量的指标“消息准确率”，并提出了一种适应性的普通损失函数Log-Sum-Exponential（LSE）损失函数，以提高消息准确率。此外，论文还提出了一种新的隐藏特征 aware编码技术。
results: 经过实验证明，新的LSE损失函数和隐藏特征 aware编码技术均有显著改进消息准确率。这种全面的方法标志着评估指标的演进、损失函数的优化和图像隐藏技术的创新，为数字保护带来了更加可靠和可靠的信息保护。

Abstract
In the ever-expanding digital landscape, safeguarding sensitive information remains paramount. This paper delves deep into digital protection, specifically focusing on steganography. While prior research predominantly fixated on individual bit decoding, we address this limitation by introducing ``message accuracy'', a novel metric evaluating the entirety of decoded messages for a more holistic evaluation. In addition, we propose an adaptive universal loss tailored to enhance message accuracy, named Log-Sum-Exponential (LSE) loss, thereby significantly improving the message accuracy of recent approaches. Furthermore, we also introduce a new latent-aware encoding technique in our framework named \Approach, harnessing pretrained Stable Diffusion for advanced steganographic image generation, giving rise to a better trade-off between image quality and message recovery. Throughout experimental results, we have demonstrated the superior performance of the new LSE loss and latent-aware encoding technique. This comprehensive approach marks a significant step in evolving evaluation metrics, refining loss functions, and innovating image concealment techniques, aiming for more robust and dependable information protection.

摘要
在不断扩展的数字景观中，保护敏感信息仍然是首要的。这篇论文深入探讨数字保护，特别是隐藏式加密。先前的研究主要集中在各个比特位解码，而我们则通过引入“信息准确度”这个新的评价指标，对整体解码消息进行更加全面的评估。此外，我们还提出了适应性的 универса loss，以提高消息准确度，称之为 Log-Sum-Exponential（LSE）损失。这种新的损失函数在实验中得到了显著的提升。此外，我们还提出了一种新的隐藏式编码技术，名为 \Approach，具有预训练的稳定扩散，用于进一步提高图像质量和消息恢复的兼顾。我们的全面的方法在实验结果中得到了显著的优秀表现。这种新的评价指标、损失函数和图像隐藏技术，标志着对信息保护的演进，并且将为未来的数字保护做出更重要的贡献。

Deeper into Self-Supervised Monocular Indoor Depth Estimation

paper_url: http://arxiv.org/abs/2312.01283
repo_url: https://github.com/fcntes/indoordepth
paper_authors: Chao Fan, Zhenyu Yin, Yue Li, Feiqing Zhang
for: 这个研究旨在提高室内深度估计的精度，使用Convolutional Neural Networks (CNNs)进行自主学习。
methods: 该方法包括两个创新：首先，提出了一种改进的光学loss函数，以解决低тексту化区域的挑战；其次，在不同阶段使用多个光学损失来训练一个更深的pose网络，以更好地预测 egomotion。
results: 实验表明，我们的方法在NYUv2测试集上超过了之前的state-of-the-art方法，并在ScanNet测试集上进行了验证。代码可以在https://github.com/fcntes/IndoorDepth中下载。

Abstract
Monocular depth estimation using Convolutional Neural Networks (CNNs) has shown impressive performance in outdoor driving scenes. However, self-supervised learning of indoor depth from monocular sequences is quite challenging for researchers because of the following two main reasons. One is the large areas of low-texture regions and the other is the complex ego-motion on indoor training datasets. In this work, our proposed method, named IndoorDepth, consists of two innovations. In particular, we first propose a novel photometric loss with improved structural similarity (SSIM) function to tackle the challenge from low-texture regions. Moreover, in order to further mitigate the issue of inaccurate ego-motion prediction, multiple photometric losses at different stages are used to train a deeper pose network with two residual pose blocks. Subsequent ablation study can validate the effectiveness of each new idea. Experiments on the NYUv2 benchmark demonstrate that our IndoorDepth outperforms the previous state-of-the-art methods by a large margin. In addition, we also validate the generalization ability of our method on ScanNet dataset. Code is availabe at https://github.com/fcntes/IndoorDepth.

摘要
单目深度估计使用单曲神经网络（CNN）在车辆外部驾驶场景中表现出色。然而，内部深度自我学习从单目序列中得到实用的挑战，主要是由以下两个原因引起：一是内部大面积的低纹地区，二是内部训练数据集中的复杂自我运动。在这个工作中，我们提出了两个创新方法。具体来说，我们首先提出了一个新的光метри输入损失函数，以改善对低纹地区的挑战。此外，为了进一步减少对自我运动预测的错误，我们在不同阶段使用多个光метри输入损失函数来训练深度姿态网络，包括两个复原姿态块。接下来的混合分析可以证明每个新想法的有效性。实验结果显示，我们的室内深度方法（IndoorDepth）在NYUv2标准库上比前一代方法表现出较大的margin。此外，我们还 validate了我们的方法在ScanNet数据集上的一致性。代码可以在https://github.com/fcntes/IndoorDepth上获取。

Cycle-consistent Generative Adversarial Network Synthetic CT for MR-only Adaptive Radiation Therapy on MR-Linac

paper_url: http://arxiv.org/abs/2312.02211
repo_url: None
paper_authors: Gabriel L. Asher, Bassem I. Zaki, Gregory A. Russo, Gobind S. Gill, Charles R. Thomas, Temiloluwa O. Prioleau, Rongxiao Zhang, Brady Hunt
For: This study aims to assess the effectiveness of Deep Learning (DL) for creating synthetic Computed Tomography (CT) images in MR-guided adaptive radiation therapy (MRgART).* Methods: The study uses a Cycle-GAN model trained with MRI and CT scan slices from MR-LINAC treatments to generate sCT volumes. The model was tested on 357 sCT frames from 17 patients.* Results: The study found that the sCTs generated by the model were comparable to dCTs in electron density and structural similarity with MRI scans. The dosimetric evaluations indicated minimal differences between sCTs and dCTs, with sCTs showing better air-bubble reconstruction.Here is the simplified Chinese text for the three key points:* For: 这项研究目的是用深度学习（DL）来生成MR-guided adaptive radiation therapy（MRgART）中的人工计算机断层图像（sCT）。* Methods: 研究使用了基于Cycle-GAN模型，使其在MR-LINAC治疗中的MRI和CT扫描片上训练，以生成sCT卷积体。模型在17名患者的357幅sCT扫描片上进行测试。* Results: 研究发现，由模型生成的sCT与实际CT图像在电子密度和MRI扫描片上的结构相似性都很高。多层评估表明，sCT和实际CT图像之间的差异非常小，sCT更好地重建了气泡。

Abstract
Purpose: This study assesses the effectiveness of Deep Learning (DL) for creating synthetic CT (sCT) images in MR-guided adaptive radiation therapy (MRgART). Methods: A Cycle-GAN model was trained with MRI and CT scan slices from MR-LINAC treatments, generating sCT volumes. The analysis involved retrospective treatment plan data from patients with various tumors. sCT images were compared with standard CT scans using mean absolute error in Hounsfield Units (HU) and image similarity metrics (SSIM, PSNR, NCC). sCT volumes were integrated into a clinical treatment system for dosimetric re-evaluation. Results: The model, trained on 8405 frames from 57 patients and tested on 357 sCT frames from 17 patients, showed sCTs comparable to dCTs in electron density and structural similarity with MRI scans. The MAE between sCT and dCT was 49.2 +/- 13.2 HU, with sCT NCC exceeding dCT by 0.06, and SSIM and PSNR at 0.97 +/- 0.01 and 19.9 +/- 1.6 respectively. Dosimetric evaluations indicated minimal differences between sCTs and dCTs, with sCTs showing better air-bubble reconstruction. Conclusions: DL-based sCT generation on MR-Linacs is accurate for dose calculation and optimization in MRgART. This could facilitate MR-only treatment planning, enhancing simulation and adaptive planning efficiency on MR-Linacs.

摘要
The results showed that the sCT images generated by the DL model were comparable to the dCT images in terms of electron density and structural similarity with MRI scans. The mean absolute error between sCT and dCT was 49.2 ± 13.2 HU, with sCT NCC exceeding dCT by 0.06. The SSIM and PSNR values were 0.97 ± 0.01 and 19.9 ± 1.6, respectively. The dosimetric evaluations showed minimal differences between sCTs and dCTs, with sCTs demonstrating better air-bubble reconstruction.The study concluded that DL-based sCT generation on MR-Linacs is accurate for dose calculation and optimization in MRgART, which could facilitate MR-only treatment planning and enhance simulation and adaptive planning efficiency on MR-Linacs.

Brain Decodes Deep Nets

paper_url: http://arxiv.org/abs/2312.01280
repo_url: https://github.com/huzeyann/braindecodesdeepnets
paper_authors: Huzheng Yang, James Gee, Jianbo Shi
for: 本研究旨在开发一种工具，用于可见化和分析大规模预训练的视觉模型，并将其映射到大脑中，以暴露其隐藏的内部结构。
methods: 本研究使用了一种潜在的脑编码方法，即预测大脑fMRI测量的响应图像。研究发现，在不同的空间、层、缩放和通道维度上的Explicit mapping между大脑和深度网络特征是关键。这种映射方法被称为FactorTopy，可以与任何深度网络进行插件式使用，并能够将网络映射到大脑上（Literally!）。
results: 研究发现，不同的训练方法会导致不同的层次组织和缩放行为，随着更多的数据或网络资源增加。此外，研究还提供了小数据集训练的细节，包括如何使用预训练模型进行微调。

Abstract
We developed a tool for visualizing and analyzing large pre-trained vision models by mapping them onto the brain, thus exposing their hidden inside. Our innovation arises from a surprising usage of brain encoding: predicting brain fMRI measurements in response to images. We report two findings. First, explicit mapping between the brain and deep-network features across dimensions of space, layers, scales, and channels is crucial. This mapping method, FactorTopy, is plug-and-play for any deep-network; with it, one can paint a picture of the network onto the brain (literally!). Second, our visualization shows how different training methods matter: they lead to remarkable differences in hierarchical organization and scaling behavior, growing with more data or network capacity. It also provides insight into finetuning: how pre-trained models change when adapting to small datasets. Our method is practical: only 3K images are enough to learn a network-to-brain mapping.

摘要
我们开发了一种工具用于可见化和分析大训练后的视觉模型，通过将其映射到大脑中，以显示其隐藏的内部结构。我们的创新来自于使用大脑编码的意外用途：预测大脑 fMRI 测量响应图像。我们报道了两个发现：首先， между大脑和深度网络特征之间的直接映射在空间、层、缩放和通道方面是关键的。我们称之为FactorTopy，它是任何深度网络的插件式映射方法，可以literally将网络映射到大脑上。第二，我们的可见化显示了不同的训练方法对 Hierarchical 组织结构和缩放行为产生了重要的影响，随着更多的数据或网络容量增加，这些效果会逐渐增加。此外，我们的可见化还提供了训练策略的更新：预训练模型如何在小数据集上进行调整。我们的方法是实用的：只需要3K图像就可以学习一个网络-大脑映射。

Learning to Compose SuperWeights for Neural Parameter Allocation Search

paper_url: http://arxiv.org/abs/2312.01274
repo_url: https://github.com/piotr-teterwak/SuperWeights
paper_authors: Piotr Teterwak, Soren Nelson, Nikoli Dryden, Dina Bashkirova, Kate Saenko, Bryan A. Plummer
for: 本研究旨在自动化神经网络参数分配，以实现参数共享。
methods: 我们使用SuperWeight Networks来生成层 weights，并使用梯度信息来衡量层之间的冲突程度。
results: 我们的SuperWeight Networks在ImageNet和CIFAR datasets上具有更高的性能，并且可以为多种网络 architecture 生成参数，支持高效的集成和任何时间预测。

Abstract
Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not during training, resulting in reduced performance. To address this, we generate layer weights by learning to compose sets of SuperWeights, which represent a group of trainable parameters. These SuperWeights are created to be large enough so they can be used to represent any layer in the network, but small enough that they are computationally efficient. The second drawback we address is the method of measuring similarity between shared parameters. Whereas prior work compared the weights themselves, we argue this does not take into account the amount of conflict between the shared weights. Instead, we use gradient information to identify layers with shared weights that wish to diverge from each other. We demonstrate that our SuperWeight Networks consistently boost performance over the state-of-the-art on the ImageNet and CIFAR datasets in the NPAS setting. We further show that our approach can generate parameters for many network architectures using the same set of weights. This enables us to support tasks like efficient ensembling and anytime prediction, outperforming fully-parameterized ensembles with 17% fewer parameters.

摘要

AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing

paper_url: http://arxiv.org/abs/2312.02209
repo_url: None
paper_authors: Fan Yang, Tianyi Chen, Xiaosheng He, Zhongang Cai, Lei Yang, Si Wu, Guosheng Lin
for: 本研究旨在提出一种可交互编辑的3D人体生成模型，以解决现有的高精度地方编辑和计算成本问题。
methods: 该模型使用特征分解和索引技术，将全体特征（如人体、头发、衣服等）分解成六个特征平面中，然后将每个特征平面进行分解和操作。
results: 该模型可以允许用户交互编辑选择的特征，保持其他特征不变，并提供高质量的3D人体模型生成。

Abstract
Editable 3D-aware generation, which supports user-interacted editing, has witnessed rapid development recently. However, existing editable 3D GANs either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D, an editable 3D human generation model, which address the aforementioned problems with attribute decomposition and indexing. The core idea of the proposed model is to generate all attributes (e.g. human body, hair, clothes and so on) in an overall attribute space with six feature planes, which are then decomposed and manipulated with different attribute indexes. To precisely extract features of different attributes from the generated feature planes, we propose a novel attribute indexing method as well as an orthogonal projection regularization to enhance the disentanglement. We also introduce a hyper-latent training strategy and an attribute-specific sampling strategy to avoid style entanglement and misleading punishment from the discriminator. Our method allows users to interactively edit selected attributes in the generated 3D human avatars while keeping others fixed. Both qualitative and quantitative experiments demonstrate that our model provides a strong disentanglement between different attributes, allows fine-grained image editing and generates high-quality 3D human avatars.

摘要
可编辑3D生成，它们在最近几年内得到了快速发展。然而，现有的可编辑3D GANs（生成 adversarial network） Either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D， an editable 3D human generation model， which addresses the aforementioned problems with attribute decomposition and indexing.AttriHuman-3D的核心思想是生成所有特征（例如人体、发 styling、服装等）在总特征空间中的六个特征平面，然后将这些特征分解和修改使用不同的特征索引。为了准确提取不同特征的特征从生成的特征平面中，我们提议了一种新的特征索引方法以及一种正交投影正则化来增强分离。我们还引入了嵌入级训练策略和特征特定的采样策略，以避免样式杂化和误导评价器。我们的方法允许用户在生成的3D人物模型中交互地编辑选择的特征，而保持其他特征不变。我们的方法可以提供高精度的本地编辑，并生成高质量的3D人物模型。我们的方法在质量和效率两个方面具有优势。我们的实验表明，我们的方法可以准确地分离不同特征，并且可以进行细致的图像编辑。我们的方法可以生成高质量的3D人物模型，并且可以在不同的应用场景中使用。

A Review and A Robust Framework of Data-Efficient 3D Scene Parsing with Traditional/Learned 3D Descriptors

paper_url: http://arxiv.org/abs/2312.01262
repo_url: None
paper_authors: Kangcheng Liu
for: 本文提出了一种普适和简单的框架，用于在Point cloud understanding task中处理有限 labels。
methods: 本文使用了传统的PFH-based 3D descriptor，以及学习获得的 semantics，实现了基于低级别几何特征和高级别 semantics 的区域合并策略。
results: 实验结果显示，我们的框架在三个重要的弱supervised Point cloud understanding任务中，包括semantic segmentation、instance segmentation和对象检测，具有最高的表现。我们的方法在ScanNet数据集上的数据fficient学习在线评测和其他四个大规模3D理解benchmark上，在不同的实验设置下，都超越了现有的艺术。

Abstract
Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.

摘要
现有的State-of-the-art 3D点云理解方法只能在完全监督的情况下表现良好。据我们所知，到目前为止没有一个统一的框架，可以同时解决下游高级理解任务，包括分割和检测，特别是当标签是非常有限的时候。这个工作提出了一个通用和简单的框架，用于解决点云理解当标签是有限的情况。我们的首要贡献是对传统和学习得到的3D描述符进行了广泛的方法比较，并证明了我们适应传统PFH基于的3D描述符在不同领域的总体化能力很好。我们的第二个贡献是基于传统/学习得到的3D描述符和学习得到的语义的学习型Region Merging策略。该 merge 过程会考虑低级 geometric 和高级 semantics 特征相关性。实验结果表明，我们的框架在无标签点云理解三大任务中的semantic segmentation、instance segmentation和对象检测任务中表现最佳，即使标签非常有限。我们的方法，称为Region Merging 3D（RM3D），在ScanNet数据高效学习在线Benchmark和其他四个大规模3D理解Benchmark上表现出色，超过当前艺术品的margin。

A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing

paper_url: http://arxiv.org/abs/2312.02208
repo_url: None
paper_authors: Kangcheng Liu
for: 本研究目的是提出一个普遍和简单的框架，以解决受限 Label 的情况下的 3D 点云理解问题。
methods: 我们提出了一种 novel 的不监督区域扩展基于散列方法，用于生成分群。此外，我们还提出了一种创新的学习将 над分的分群融合，基于地面低层几何特征相似性和学习高层特征相似性，并受到弱 Label 的指导。
results: 我们的框架在三个最重要的弱监督 3D 点云理解任务中均有最好的表现，包括semantic segmentation、instance segmentation和object detection。实验结果显示，我们的框架在受限 Label 情况下，可以在大规模 3D semantic scene parsing 中取得最好的表现。发展的技术具有应用潜力，可以提高下游任务中的表现。代码和模型可以在：https://github.com/KangchengLiu 上取得。

Abstract
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: https://github.com/KangchengLiu.

摘要
现有的状态 искусственный智能3D点云理解方法只能在完全监督的情况下表现良好。据我们所知，没有一个统一的框架，可以同时解决下游高级理解任务，特别是当标签具有极限时。这项工作提出了一个通用和简单的框架，用于解决点云理解问题。我们提议了一种新的不监督区域扩展 clustering 方法，用于生成分区。更重要的是，我们创新地提出了学习将过分区的归一化，基于本地低级几何特征相似性和学习的高级特征相似性，并在弱标签的指导下进行pseudo标签归一化。因此，真正的弱标签导航 pseudo标签归一化，同时考虑地几何特征和semantic特征之间的相互关系。最后，我们提出了自动化重建和数据增强优化模块，用于导引点云内具有相同 semantic特征的点 clouds之间的标签归一化。实验结果表明，我们的框架在三个最重要的弱监督点云理解任务中，包括semantic segmentation、instance segmentation和object detection，在有限点标注的情况下，在数据有效性下表现最佳。开发的技术具有应用于下游任务的潜在优势，如机器人操作和机器人自主导航。代码和模型在：https://github.com/KangchengLiu。

TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

paper_url: http://arxiv.org/abs/2312.01261
repo_url: None
paper_authors: Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, Matthew Turk
for: 这个论文旨在研究和评估文本生成模型中的各种偏见，包括社会偏见、性别偏见和种族偏见等。methods: 该论文提出了一种通用的方法来研究和评估文本生成模型中的各种偏见，Counterfactual reasoning 技术。这种技术可以自动检测和评估给定提示下的各种可能的偏见，并且可以提供相关的Semantic concepts 解释。results: 该论文的实验结果表明，Counterfactual reasoning 技术可以准确地检测和评估文本生成模型中的各种偏见，并且可以提供深刻的Semantic concepts 解释。此外，该论文还发现，文本生成模型中的偏见可以被分为多个维度，并且这些维度之间存在复杂的交叉关系。

Abstract
Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model's ability to generate more diverse imagery. In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. Unlike other works that evaluate generated images on a predefined set of bias axes, our approach automatically identifies potential biases that might be relevant to the given prompt, and measures those biases. In addition, our paper extends quantitative scores with post-hoc explanations in terms of semantic concepts in the images generated. We show that our method is uniquely capable of explaining complex multi-dimensional biases through semantic concepts, as well as the intersectionality between different biases for any given prompt. We perform extensive user studies to illustrate that the results of our method and analysis are consistent with human judgements.

摘要

Meta ControlNet: Enhancing Task Adaptation via Meta Learning

paper_url: http://arxiv.org/abs/2312.01255
repo_url: https://github.com/junjieyang97/meta-controlnet
paper_authors: Junjie Yang, Jinze Zhao, Peihao Wang, Zhangyang Wang, Yingbin Liang
for: 这个论文旨在提出一种基于传播的图像生成方法，以便实现适应不同任务的图像生成。
methods: 这个方法使用了内容学习技术，并将图像生成任务分为多个子任务，以便更好地适应不同的任务。
results: 这个方法可以在不需要训练的情况下，实现图像生成任务的控制，并且可以在不同的任务上 directly zero-shot 适应。

Abstract
Diffusion-based image synthesis has attracted extensive attention recently. In particular, ControlNet that uses image-based prompts exhibits powerful capability in image tasks such as canny edge detection and generates images well aligned with these prompts. However, vanilla ControlNet generally requires extensive training of around 5000 steps to achieve a desirable control for a single task. Recent context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Thus, two important open issues are yet to be addressed to reach the full potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces learning steps to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any finetuning, and achieves control within only 100 finetuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes is available in https://github.com/JunjieYang97/Meta-ControlNet.

摘要
《Diffusion-based图像生成技术在最近几年内引起了广泛的关注。特别是ControlNet使用图像基于的提示，在图像任务中such as canny edge detection中表现出了强大的能力，并且能够很好地与提示保持一致。然而， vanilla ControlNet通常需要大约5000步的训练来达到 Desirable Control的水平。 current context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Therefore, two important open issues need to be addressed to fully exploit the potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we propose a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces the learning steps to attain control ability from 5000 to 1000. Furthermore, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any fine-tuning, and achieves control within only 100 fine-tuning steps in more complex non-edge tasks such as Human Pose, outperforming all existing methods. The codes are available in https://github.com/JunjieYang97/Meta-ControlNet。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

TranSegPGD: Improving Transferability of Adversarial Examples on Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.02207
repo_url: None
paper_authors: Xiaojun Jia, Jindong Gu, Yihao Huang, Simeng Qin, Qing Guo, Yang Liu, Xiaochun Cao
for: 这个论文旨在探讨图像分类领域中对 adversarial examples 的传输性，并提出了一种有效的两stage adversarial attack策略，即 TranSegPGD。
methods: 该方法在第一个阶段对输入图像每个像素进行分支，并将每个分支分配不同的权重进行优化，以提高对所有像素的恶作剂性。在第二个阶段，图像像素被分支成不同的分支，根据其传输性，并将每个分支分配不同的权重进行优化，以提高恶作剂示例的传输性。
results: 对于 PASCAL VOC 2012 和 Cityscapes 数据集，经验表明，提出的恶作剂攻击策略可以 дости得最佳性能。

Abstract
Transferability of adversarial examples on image classification has been systematically explored, which generates adversarial examples in black-box mode. However, the transferability of adversarial examples on semantic segmentation has been largely overlooked. In this paper, we propose an effective two-stage adversarial attack strategy to improve the transferability of adversarial examples on semantic segmentation, dubbed TranSegPGD. Specifically, at the first stage, every pixel in an input image is divided into different branches based on its adversarial property. Different branches are assigned different weights for optimization to improve the adversarial performance of all pixels.We assign high weights to the loss of the hard-to-attack pixels to misclassify all pixels. At the second stage, the pixels are divided into different branches based on their transferable property which is dependent on Kullback-Leibler divergence. Different branches are assigned different weights for optimization to improve the transferability of the adversarial examples. We assign high weights to the loss of the high-transferability pixels to improve the transferability of adversarial examples. Extensive experiments with various segmentation models are conducted on PASCAL VOC 2012 and Cityscapes datasets to demonstrate the effectiveness of the proposed method. The proposed adversarial attack method can achieve state-of-the-art performance.

摘要
“对于图像分类中的攻击性例子的转移性已经得到了系统性的探索，但对于 semantic segmentation 中的攻击性例子的转移性则被忽略了。这篇文章提出了一个高效的 two-stage 攻击战略，以提高 semantic segmentation 中的攻击性例子转移性，称为 TranSegPGD。在第一阶段，每个图像中的每个像素都被分成不同的分支，根据其攻击性质。不同的分支被 assigning 不同的权重来优化攻击性表现。我们将高权重 assign 到困难攻击像素的损失，以将所有像素错分类。在第二阶段，像素被分成不同的分支，根据它们的转移性，即基于 Kullback-Leibler 散度。不同的分支被 assigning 不同的权重来优化转移性。我们将高权重 assign 到高转移性像素的损失，以提高攻击性例子的转移性。我们在 PASCAL VOC 2012 和 Cityscapes dataset 上进行了广泛的实验，以显示提案的攻击方法的效果。提案的攻击方法可以实现state-of-the-art的性能。”