2023-09-14

cs.CV

cs.CV - 2023-09-14

Morphologically-Aware Consensus Computation via Heuristics-based IterATive Optimization (MACCHIatO)

paper_url: http://arxiv.org/abs/2309.08066
repo_url: None
paper_authors: Dimitri Hamzaoui, Sarah Montagne, Raphaële Renard-Penna, Nicholas Ayache, Hervé Delingette
for: 本研究旨在提供一种独立于图像背景大小和选择函数的方法，以获得协调多个评分人的投票结果。
methods: 我们提议使用Fréchet平均 distances来构建一个 binary或概率的协调分割，并提供了一种启发式方法来优化这个标准。
results: 我们对多个数据集进行了广泛的比较，并证明了我们的方法与STAPLE方法和简单的投票均值方法相比，能够得到中间大小的 binary consensus mask，以及不同于Mask Averaging和STAPLE方法的 posterior probabilities。

Abstract
The extraction of consensus segmentations from several binary or probabilistic masks is important to solve various tasks such as the analysis of inter-rater variability or the fusion of several neural network outputs. One of the most widely used methods to obtain such a consensus segmentation is the STAPLE algorithm. In this paper, we first demonstrate that the output of that algorithm is heavily impacted by the background size of images and the choice of the prior. We then propose a new method to construct a binary or a probabilistic consensus segmentation based on the Fr\'{e}chet means of carefully chosen distances which makes it totally independent of the image background size. We provide a heuristic approach to optimize this criterion such that a voxel's class is fully determined by its voxel-wise distance to the different masks, the connected component it belongs to and the group of raters who segmented it. We compared extensively our method on several datasets with the STAPLE method and the naive segmentation averaging method, showing that it leads to binary consensus masks of intermediate size between Majority Voting and STAPLE and to different posterior probabilities than Mask Averaging and STAPLE methods. Our code is available at https://gitlab.inria.fr/dhamzaou/jaccardmap .

摘要
抽取多个二进制或概率性掩模的consensus分割是解决多种任务的关键，如分析间读者差异或神经网络输出的融合。STAPLE算法是最常用的方法获得consensus分割。在这篇论文中，我们首先表明STAPLE算法的输出受到图像背景大小和选择前景的影响。然后，我们提出一种新的方法，基于Fréchet平均 distances，构建二进制或概率性consensus分割，不受图像背景大小的影响。我们提供了一个启发式方法优化这个标准，使每个壳体的类别完全由它与不同掩模的 voxel-wise 距离、所属连通分支和掩模 raters 的分类决定。我们对多个数据集进行了广泛的比较，表明我们的方法与STAPLE方法和naive掩模平均方法不同，并且导致二进制consensus掩模和不同的 posterior probabilities。我们的代码可以在https://gitlab.inria.fr/dhamzaou/jaccardmap 上获取。

Interpretability-Aware Vision Transformer

paper_url: http://arxiv.org/abs/2309.08035
repo_url: https://github.com/qiangyao1988/ia-vit
paper_authors: Yao Qiang, Chengyin Li, Prashant Khanduri, Dongxiao Zhu
for: 解释模型性能和解释模型含义
methods: 提出了一种新的培育方法，即具有解释性的培育方法，该方法通过自我注意机制提供了 faithful的解释
results: 在多个图像分类任务中表现出色，并进行了质量和量化的评估Here is the summary in English:
for: Explaining model performance and model interpretability
methods: Proposed a new training method that inherently enhances model interpretability, which uses a self-attention mechanism to provide faithful explanations
results: Performed excellently in multiple image classification tasks, with both qualitative and quantitative evaluations of model performance and interpretability.

Abstract
Vision Transformers (ViTs) have become prominent models for solving various vision tasks. However, the interpretability of ViTs has not kept pace with their promising performance. While there has been a surge of interest in developing {\it post hoc} solutions to explain ViTs' outputs, these methods do not generalize to different downstream tasks and various transformer architectures. Furthermore, if ViTs are not properly trained with the given data and do not prioritize the region of interest, the {\it post hoc} methods would be less effective. Instead of developing another {\it post hoc} approach, we introduce a novel training procedure that inherently enhances model interpretability. Our interpretability-aware ViT (IA-ViT) draws inspiration from a fresh insight: both the class patch and image patches consistently generate predicted distributions and attention maps. IA-ViT is composed of a feature extractor, a predictor, and an interpreter, which are trained jointly with an interpretability-aware training objective. Consequently, the interpreter simulates the behavior of the predictor and provides a faithful explanation through its single-head self-attention mechanism. Our comprehensive experimental results demonstrate the effectiveness of IA-ViT in several image classification tasks, with both qualitative and quantitative evaluations of model performance and interpretability. Source code is available from: https://github.com/qiangyao1988/IA-ViT.

摘要
目标是解释具有优秀表现的视Transformers（ViTs）模型的含义。然而，解释ViTs的性能未能与其表现相提并且。尽管有一波关注开发{\it post hoc}解释ViTs输出的方法，但这些方法不一致于不同的下游任务和多种变换器架构。此外，如果ViTs没有正确地训练与给定数据并不着眼于区域关注点，那么{\it post hoc}方法就会变得效果更差。相反，我们介绍了一种新的训练方法，可以自然地提高模型解释性。我们的解释具有ViT（IA-ViT）启发自新的见解： Both the class patch和image patches consistently generate predicted distributions and attention maps。IA-ViT包括一个特征提取器、一个预测器和一个解释器，这些部分在一起被训练了一个解释性感知训练目标。因此，解释器可以模拟预测器的行为，并通过其单头自注意机制提供 faithful的解释。我们的实验结果表明IA-ViT在多个图像分类任务中具有优秀的性能和解释性。代码可以从以下链接获取：https://github.com/qiangyao1988/IA-ViT。

Depth Estimation from a Single Optical Encoded Image using a Learned Colored-Coded Aperture

paper_url: http://arxiv.org/abs/2309.08033
repo_url: None
paper_authors: Jhon Lopez, Edwin Vargas, Henry Arguello
for: 本研究旨在提高单张照片中的深度估计，通过在镜头缝中引入多种颜色滤波器，以便在不同深度下生成不同的滤波Pattern，从而提高不同深度之间的差异。
methods: 本研究使用了色码镜头（CCA），并通过结合深度学习来设计Diffractive Optical Element（DOE），以便在单张照片中提取深度信息。
results: 通过三个不同的数据集进行了多种实验，并证明了该方法可以更好地估计深度，并且在实际场景中进行了实验，证明了该方法的可行性。

Abstract
Depth estimation from a single image of a conventional camera is a challenging task since depth cues are lost during the acquisition process. State-of-the-art approaches improve the discrimination between different depths by introducing a binary-coded aperture (CA) in the lens aperture that generates different coded blur patterns at different depths. Color-coded apertures (CCA) can also produce color misalignment in the captured image which can be utilized to estimate disparity. Leveraging advances in deep learning, more recent works have explored the data-driven design of a diffractive optical element (DOE) for encoding depth information through chromatic aberrations. However, compared with binary CA or CCA, DOEs are more expensive to fabricate and require high-precision devices. Different from previous CCA-based approaches that employ few basic colors, in this work we propose a CCA with a greater number of color filters and richer spectral information to optically encode relevant depth information in a single snapshot. Furthermore, we propose to jointly learn the color-coded aperture (CCA) pattern and a convolutional neural network (CNN) to retrieve depth information by using an end-to-end optimization approach. We demonstrate through different experiments on three different data sets that the designed color-encoding has the potential to remove depth ambiguities and provides better depth estimates compared to state-of-the-art approaches. Additionally, we build a low-cost prototype of our CCA using a photographic film and validate the proposed approach in real scenarios.

摘要
摄像机器一张图像深度估计是一项具有挑战性的任务，因为深度指示器在捕获过程中消失。现有的方法可以通过在镜头孔中引入二进制编码的眼镜（CA）或多色编码眼镜（CCA）来提高不同深度之间的区分。CCA可以在捕获图像中产生颜色偏移，并可以用于估计分辨率。通过深度学习的进步，更新的工作将Diffractive optical element（DOE）用于编码深度信息的数据驱动设计。然而，相比于二进制CA或CCA，DOE需要更高精度的设备并且更加昂贵。与前期CCA-based方法不同，我们在本工作中提议使用更多的颜色筛和更丰富的光谱信息来optically编码相关的深度信息。此外，我们还提议通过结合CCA筛和 convolutional neural network（CNN）的结构学习来learns depth information。通过不同的实验在三个不同的数据集上，我们表明了设计的颜色编码具有除去深度歧义的潜力，并且提供了与当前状态对比而言更好的深度估计。此外，我们还建立了一个低成本的CCA原型，使用摄影材料和 validate了我们的提议。

Empowering Visually Impaired Individuals: A Novel Use of Apple Live Photos and Android Motion Photos

paper_url: http://arxiv.org/abs/2309.08022
repo_url: None
paper_authors: Seyedalireza Khoshsirat, Chandra Kambhamettu
for: 这个论文旨在探讨使用Apple Live Photos和Android Motion Photos技术来改善视力障碍者使用机器学习模型处理视觉输入的问题。
methods: 该论文使用了一种简单的方法来评估和比较Live Photos和Motion Photos对于传统图像基本输入方法的效果。
results: 研究发现Live Photos和Motion Photos在常见的视觉助手任务中比单框图像表现更好，尤其是在物体分类和视频问答中。研究还进行了一系列的减少效果和更长的时间范围的实验来深入探讨这些技术的影响。

Abstract
Numerous applications have been developed to assist visually impaired individuals that employ a machine learning unit to process visual input. However, a critical challenge with these applications is the sub-optimal quality of images captured by the users. Given the complexity of operating a camera for visually impaired individuals, we advocate for the use of Apple Live Photos and Android Motion Photos technologies. In this study, we introduce a straightforward methodology to evaluate and contrast the efficacy of Live/Motion Photos against traditional image-based approaches. Our findings reveal that both Live Photos and Motion Photos outperform single-frame images in common visual assisting tasks, specifically in object classification and VideoQA. We validate our results through extensive experiments on the ORBIT dataset, which consists of videos collected by visually impaired individuals. Furthermore, we conduct a series of ablation studies to delve deeper into the impact of deblurring and longer temporal crops.

摘要
很多应用程序已经开发以帮助视障人群，这些应用程序使用机器学习单元处理视觉输入。然而，这些应用程序的一个挑战是用户拍摄的图像质量不够优化。为了解决这个问题，我们建议使用苹果Live Photos和Android Motion Photos技术。在这项研究中，我们提出了一种简单的方法来评估和比较Live/Motion Photos与传统图像基于的方法的效果。我们的发现表明，Live Photos和Motion Photos在常见的视觉协助任务中都能够超越单帧图像，具体来说是在物体分类和VideoQA中。我们通过对ORBIT dataset进行了广泛的实验来验证我们的结果。此外，我们还进行了一系列的剥夺研究，以 deeper 探究延迟和更长的时间割辑的影响。

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.08020
repo_url: https://github.com/zhaochongan/the-mask
paper_authors: Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, Luc Van Gool
for: 提高视频 semantic segmentation（VSS）的表现，增加表达能力。
methods: 引入时间感知层次对象查询，并使用简单的两轮匹配机制，以及层次损失来训练查询。
results: 在最新的VSS测试集VSPW上达到了状态数表现，无需额外设备或特性。

Abstract
Modern approaches have proved the huge potential of addressing semantic segmentation as a mask classification task which is widely used in instance-level segmentation. This paradigm trains models by assigning part of object queries to ground truths via conventional one-to-one matching. However, we observe that the popular video semantic segmentation (VSS) dataset has limited categories per video, meaning less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. This inefficiency limits the full expressive potential of all queries.Thus, we present a novel solution THE-Mask for VSS, which introduces temporal-aware hierarchical object queries for the first time. Specifically, we propose to use a simple two-round matching mechanism to involve more queries matched with minimal cost during training while without any extra cost during inference. To support our more-to-one assignment, in terms of the matching results, we further design a hierarchical loss to train queries with their corresponding hierarchy of primary or secondary. Moreover, to effectively capture temporal information across frames, we propose a temporal aggregation decoder that fits seamlessly into the mask-classification paradigm for VSS. Utilizing temporal-sensitive multi-level queries, our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.

摘要
现代方法已经证明了地址 semantic segmentation 作为一个Mask classification task的巨大潜力，这种思想广泛用于实例级别 segmentation。这种思想通过一般的一对一匹配来训练模型，但我们发现VSS数据集中的视频Semantic segmentation（VSS）中的类别数量很少，这意味着只有少于10%的查询可以得到有意义的梯度更新 durante la formación de VSS。这种不灵活性限制了所有查询的全面表达潜力。因此，我们提出了一种新的解决方案called THE-Mask for VSS，它是在VSS中首次引入了时间感知层次对象查询。具体来说，我们提议使用一个简单的两轮匹配机制，以便更多的查询在训练中得到匹配，而无需在推理过程中添加额外的成本。此外，为了支持我们的更多对一匹配，我们还设计了一个层次损失来训练查询和它们所对应的层次结构。此外，为了有效地捕捉视频帧之间的时间信息，我们提出了一种适应 temporal 的汇集解码器，这种解码器适应 VSS 中的 mask 分类思想。通过使用时间感知的多级查询，我们的方法在最新的 VSPW 测试 benchmark 上实现了状态机器的性能，不需要额外的辅助工具。

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

paper_url: http://arxiv.org/abs/2309.08009
repo_url: None
paper_authors: Iya Chivileva, Philip Lynch, Tomas E. Ward, Alan F. Smeaton
for: 评估文本到视频（T2V）模型生成的视频质量是重要的，以确保生成的输出能够诱导观众接受其真实性。
methods: 文章分析了一些常用的评估metric，并指出它们的局限性。文章还提供了5种最新的T2V模型生成的视频数据集，并应用了一些常用的评估metric。
results: 研究发现，自然性和文本提示使用的semantic匹配是评估T2V模型输出质量的重要因素，但没有单一的度量可以捕捉这些细节。

Abstract
Evaluating the quality of videos generated from text-to-video (T2V) models is important if they are to produce plausible outputs that convince a viewer of their authenticity. We examine some of the metrics used in this area and highlight their limitations. The paper presents a dataset of more than 1,000 generated videos from 5 very recent T2V models on which some of those commonly used quality metrics are applied. We also include extensive human quality evaluations on those videos, allowing the relative strengths and weaknesses of metrics, including human assessment, to be compared. The contribution is an assessment of commonly used quality metrics, and a comparison of their performances and the performance of human evaluations on an open dataset of T2V videos. Our conclusion is that naturalness and semantic matching with the text prompt used to generate the T2V output are important but there is no single measure to capture these subtleties in assessing T2V model output.

摘要
Here's the translation in Simplified Chinese:evaluating the quality of text-to-video（T2V）模型生成的视频是非常重要的，以确保它们生成的输出是真实可信的。我们检查了这个领域中一些常用的度量，并指出它们的局限性。文章发布了一个包含1000多个由5种最新T2V模型生成的视频的数据集，并在这些视频上应用了一些常用的度量。此外，我们还进行了详细的人类评估这些视频，以便比较度量的表现和人类评估的表现。我们的结论是，自然性和文本提示使用的含义匹配是非常重要的，但没有单一的度量可以捕捉这些细节。

Kinship Verification from rPPG using 1DCNN Attention networks

paper_url: http://arxiv.org/abs/2309.08006
repo_url: None
paper_authors: Xiaoting Wu, Xiaoyi Feng, Lili Liu, Constantino Álvarez Casado, Miguel Bordallo López
for: 这 paper 旨在使用 remote Photoplethysmography (rPPG) 信号来验证人们之间的亲属关系。
methods: 这 paper 提出了一种使用一维 Convolutional Neural Network (1DCNN) 和 contrastive loss 来学习从 rPPG 信号中提取的亲属相似性。
results: 这 paper 在 UvANEMO Smile Database 上对不同的亲属关系进行评估，并显示了 rPPG 信号在验证亲属关系方面的有用性。

Abstract
Facial kinship verification aims at automatically determining whether two subjects have a kinship relation. It has been widely studied from different modalities, such as faces, voices, gait, and smiling expressions. However, the potential of bio-signals, such as remote Photoplethysmography (rPPG) extracted from facial videos, remains largely unexplored in the kinship verification problem. In this paper, we investigate for the first time the usage of the rPPG signal for kinship verification. Specifically, we proposed a one-dimensional Convolutional Neural Network (1DCNN) with a 1DCNN-Attention module and contrastive loss to learn the kinship similarity from rPPGs. The network takes multiple rPPG signals extracted from various facial Regions of Interest (ROIs) as inputs. Additionally, the 1DCNN attention module is designed to learn and capture the discriminative kin features from feature embeddings. Finally, the proposed method is evaluated on the UvANEMO Smile Database from different kin relations, showing the usefulness of rPPG signals in verifying kinship.

摘要
facial kinship verification aimed at automatically determining whether two subjects have a kinship relationship. It has been widely studied from different modalities, such as faces, voices, gait, and smiling expressions. However, the potential of bio-signals, such as remote Photoplethysmography (rPPG) extracted from facial videos, remains largely unexplored in the kinship verification problem. In this paper, we investigate for the first time the usage of the rPPG signal for kinship verification. Specifically, we proposed a one-dimensional Convolutional Neural Network (1DCNN) with a 1DCNN-Attention module and contrastive loss to learn the kinship similarity from rPPGs. The network takes multiple rPPG signals extracted from various facial Regions of Interest (ROIs) as inputs. Additionally, the 1DCNN attention module is designed to learn and capture the discriminative kin features from feature embeddings. Finally, the proposed method is evaluated on the UvANEMO Smile Database from different kin relations, showing the usefulness of rPPG signals in verifying kinship.Here's the word-for-word translation:面部亲属验证旨在自动确定两个主题是否有亲属关系。从不同的modalities来研究，如面部、声音、走势和笑容表达。然而，远程光谱 Plethysmography（rPPG）从视频中提取的生物信号的潜在用于亲属验证问题仍未得到了广泛的利用。在这篇文章中，我们为首次利用rPPG信号进行亲属验证。我们提出了一个一维Convolutional Neural Network（1DCNN）以及1DCNN注意模块和对比损失来学习 Kinship similarity from rPPGs。网络接受多个从不同的面部Region of Interest（ROI）提取的rPPG信号作为输入。此外，1DCNN注意模块是用于学习和捕捉特征嵌入中的可塑性亲属特征。最后，我们对UvANEMO Smile Database中的不同亲属关系进行评估， demonstrably show rPPG信号在验证亲属方面的用于。

TCGF: A unified tensorized consensus graph framework for multi-view representation learning

paper_url: http://arxiv.org/abs/2309.09987
repo_url: None
paper_authors: Xiangzhu Meng, Wei Wei, Qiang Liu, Shu Wu, Liang Wang
for: 本文旨在提出一种通用多视图学习框架，即矩阵化一致图模型（TCGF），以解决现有多视图学习方法的缺乏整合和普适性问题。
methods: 本文提出了一种综合多视图表示学习框架，包括：（1）提供一种可靠的多视图表示框架，可适应任意假设和不同缩放的数据集。（2）将各视图的表示树堆叠在一起，形成高级别表示，以实现视图之间的稳定协同传递。（3）学习一个共识嵌入，通过各视图相互协同适应，揭示多视图数据中的 essence 结构。
results: 实验结果表明，相比现有的多视图学习方法，TCGF在七个不同缩放的数据集上具有显著的优势。

Abstract
Multi-view learning techniques have recently gained significant attention in the machine learning domain for their ability to leverage consistency and complementary information across multiple views. However, there remains a lack of sufficient research on generalized multi-view frameworks that unify existing works into a scalable and robust learning framework, as most current works focus on specific styles of multi-view models. Additionally, most multi-view learning works rely heavily on specific-scale scenarios and fail to effectively comprehend multiple scales holistically. These limitations hinder the effective fusion of essential information from multiple views, resulting in poor generalization. To address these limitations, this paper proposes a universal multi-view representation learning framework named Tensorized Consensus Graph Framework (TCGF). Specifically, it first provides a unified framework for existing multi-view works to exploit the representations for individual view, which aims to be suitable for arbitrary assumptions and different-scales datasets. Then, stacks them into a tensor under alignment basics as a high-order representation, allowing for the smooth propagation of consistency and complementary information across all views. Moreover, TCGF proposes learning a consensus embedding shared by adaptively collaborating all views to uncover the essential structure of the multi-view data, which utilizes view-consensus grouping effect to regularize the view-consensus representation. To further facilitate related research, we provide a specific implementation of TCGF for large-scale datasets, which can be efficiently solved by applying the alternating optimization strategy. Experimental results conducted on seven different-scales datasets indicate the superiority of the proposed TCGF against existing state-of-the-art multi-view learning methods.

摘要
多视图学习技术在机器学习领域最近受到了广泛关注，因为它们可以利用多视图之间的一致性和补充信息。然而，目前还缺乏一个可扩展和可靠的多视图框架，因为大多数当前的工作都是特定风格的多视图模型。另外，大多数多视图学习工作都是特定级别的场景，而不是全面地理解多个级别。这些限制了多视图数据的有效融合，导致泛化不佳。为了解决这些限制，本文提出了一种通用的多视图表示学习框架，名为张量化共识图 Framework（TCGF）。具体来说，它首先提供了一个可以拓展的框架，以便已有的多视图工作可以利用各个视图的表示，并且适用于任何假设和不同级别的数据集。然后，它将这些表示排列在一个张量上，并通过基本的对齐方式将它们作为高阶表示进行细致的传递。此外，TCGF还提出了一种共识嵌入，用于在所有视图之间共享一个共识表示，以探索多视图数据的基本结构。这种共识嵌入使用视图协调组效果来规范视图协调表示。为了进一步促进相关的研究，我们在大规模数据集上提供了TCGF的具体实现，可以通过 alternating 优化策略进行高效地解决。实验结果表明，提出的TCGF在七个不同级别数据集上的性能较好，比现有的多视图学习方法更高。

M3Dsynth: A dataset of medical 3D images with AI-generated local manipulations

paper_url: http://arxiv.org/abs/2309.07973
repo_url: None
paper_authors: Giada Zingarini, Davide Cozzolino, Riccardo Corvi, Giovanni Poggi, Luisa Verdoliva
for: 这个论文主要是为了研究如何检测图像受到修改的问题，尤其是在医疗图像领域，以避免因图像修改而导致诊断错误。
methods: 作者使用了三种基于生成对抗网络（GAN）或扩散模型（DM）的方法来创建修改后的计算机tomography（CT）肺部图像，共计8,577个修改样本。
results: 实验表明，这些修改图像可以轻松地骗唬自动诊断工具，而state-of-the-art的审核检测器也可以准确地检测和定位修改的synthetic内容，包括当训练集和测试集不一致时，表现良好的泛化能力。

Abstract
The ability to detect manipulated visual content is becoming increasingly important in many application fields, given the rapid advances in image synthesis methods. Of particular concern is the possibility of modifying the content of medical images, altering the resulting diagnoses. Despite its relevance, this issue has received limited attention from the research community. One reason is the lack of large and curated datasets to use for development and benchmarking purposes. Here, we investigate this issue and propose M3Dsynth, a large dataset of manipulated Computed Tomography (CT) lung images. We create manipulated images by injecting or removing lung cancer nodules in real CT scans, using three different methods based on Generative Adversarial Networks (GAN) or Diffusion Models (DM), for a total of 8,577 manipulated samples. Experiments show that these images easily fool automated diagnostic tools. We also tested several state-of-the-art forensic detectors and demonstrated that, once trained on the proposed dataset, they are able to accurately detect and localize manipulated synthetic content, including when training and test sets are not aligned, showing good generalization ability. Dataset and code will be publicly available at https://grip-unina.github.io/M3Dsynth/.

摘要
“检测修改图像内容的能力在许多应用领域变得日益重要，尤其是在医疗领域，因为修改图像可能会影响诊断结果。然而，这个问题在研究community中得到的关注相对较少。一个原因是没有大量的整理好的数据集，用于开发和测试目的。在这篇文章中，我们调查这个问题，并提出了M3Dsynth数据集，这是一个大量的修改了计算机扫描图像（CT）肺部影像的数据集。我们使用了三种基于生成对抗网络（GAN）或扩散模型（DM）的方法来创建修改后的图像，总共有8,577个修改样本。实验表明，这些图像容易让自动诊断工具进行误报。我们还测试了一些当前最佳的审核检测器，并证明它们可以准确地检测和定位修改的 sintetic内容，包括在训练和测试集不一致的情况下，这表明了它们的泛化能力。数据集和代码将在https://grip-unina.github.io/M3Dsynth/上公开。”

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

paper_url: http://arxiv.org/abs/2309.07970
repo_url: None
paper_authors: Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Chen, Angjoo Kanazawa, Ken Goldberg
For: + The paper aims to improve the ability of learning-based grasp planners to grasp objects by specific parts, especially for objects with diverse shapes and appearances. + The proposed method, LERF-TOGO, is designed to handle task-oriented grasping of objects using natural language queries. + The method is expected to be useful in applications where grasping objects by specific parts is crucial, such as robotic grasping and manipulation tasks.* Methods: + The proposed method, LERF-TOGO, uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. + The method first reconstructs a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. + To mitigate the lack of spatial grouping in LERF, the method extracts a 3D object mask via DINO features and then conditionally queries LERF on this mask to obtain a semantic distribution over the object. + The method uses an off-the-shelf grasp planner to rank grasps based on the semantic distribution.* Results: + The proposed method, LERF-TOGO, selects grasps on the correct part in 81% of all trials and grasps successfully in 69% of all trials on 31 different physical objects.Here is the result in Simplified Chinese text:* For: + 该研究旨在提高学习型抓取器对物体特定部分的抓取能力，特别是对物体形态和外观多样化。 + 提出的方法LERF-TOGO使用自然语言查询来实现任务对应的抓取。 + 方法适用于需要抓取物体特定部分的应用，如机器人抓取和操作任务。* Methods: + LERF-TOGO使用视觉语言模型零扫零抓取来输出对物体的抓取分布。 + 方法首先重建场景的LERF，将CLIP编码转换为多级3D语言场景，可以通过文本进行查询。 + 为了解决LERF中的空间分组问题，方法提取了3D物体Mask，并将其conditionally查询LERF以获取对物体的语义分布。 + 方法使用的是一个废弃的抓取 плаanner来对语义分布进行排名。* Results: + LERF-TOGO在31种不同的物体上选择了81%的正确部分，并在69%的试验中成功抓取。

Abstract
Grasping objects by a specific part is often crucial for safety and for executing downstream tasks. Yet, learning-based grasp planners lack this behavior unless they are trained on specific object part data, making it a significant challenge to scale object diversity. Instead, we propose LERF-TOGO, Language Embedded Radiance Fields for Task-Oriented Grasping of Objects, which uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. To accomplish this, we first reconstruct a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. However, LERF has no sense of objectness, meaning its relevancy outputs often return incomplete activations over an object which are insufficient for subsequent part queries. LERF-TOGO mitigates this lack of spatial grouping by extracting a 3D object mask via DINO features and then conditionally querying LERF on this mask to obtain a semantic distribution over the object with which to rank grasps from an off-the-shelf grasp planner. We evaluate LERF-TOGO's ability to grasp task-oriented object parts on 31 different physical objects, and find it selects grasps on the correct part in 81% of all trials and grasps successfully in 69%. See the project website at: lerftogo.github.io

摘要
通过特定部分抓取物体是安全和执行下游任务的关键。然而，学习基于的抓取 планиFiPG 缺乏这种行为，除非它们在特定对象部分数据上接受训练，这使得扩展对象多样性成为主要挑战。在这种情况下，我们提议LERF-TOGO，基于视觉语言模型的语言嵌入Radiance Fields для任务oriented抓取物体。LERF-TOGO使用视觉语言模型的零shot输出抓取分布ributable over an object given a natural language query。为了完成这个任务，我们首先重建LERF of the scene，这使得CLIP embedding into a multi-scale 3D language field queryable with text。然而，LERF没有物体感知，这意味着其归类输出通常返回对象上的不完整活动，这些活动不够 для后续的部分查询。LERF-TOGO解决了这个缺失，通过提取 DINO 特征来提取3D object mask，然后使用这个mask来 conditionally query LERF，以获取对象上的semantic distribution，并用这个分布来排序从存储库中获取的抓取器。我们对LERF-TOGO在31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不同的物理对象上进行了31种不

Large-Vocabulary 3D Diffusion Model with Transformer

paper_url: http://arxiv.org/abs/2309.07920
repo_url: None
paper_authors: Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu
for: 本研究旨在生成具有大量分类的真实世界3D对象，使用一个单一的生成模型。
methods: 本文提出了一种基于扩散的Feed-Forward框架，通过综合利用三种策略来Synthesize 大量分类的真实世界3D对象。
results: 实验表明，使用单一的DiffTF模型可以 дости到现状最佳的大量分类3D对象生成性能，具有大量多样性、彩色 semantics和高质量。

Abstract
Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.

摘要
创造多样化和高质量的3D资产使用自动生成模型非常有优势。尽管现有很多3D生成研究，但大多数现有工作都专注于生成单个类或几个类。在这篇论文中，我们介绍了一种基于扩散的前向框架，可以使用单个生成模型来生成庞大类型的真实世界3D对象。需要解决的主要挑战包括：a) 需要高效强大的3D表示方式; b) 类别之间的几何和文本特征差异较大; c) 实际世界对象的外观复杂。为此，我们提出了一种基于三平面的3D-aware扩散模型，DiffTF，以解决以下三个方面的挑战。1) 为了保证效率和可靠性，我们改进了三平面表示方式，提高了适应速度和准确性。2) 为了处理类别之间的几何和文本特征差异，我们认为所有3D对象的特征是一种总结了各种3D知识的组合。为了从多个类别提取总结了3D知识的特征，我们提出了一种3D-aware transformer， Shared Cross-plane Attention，可以在不同的平面之间学习交叉平面关系，并将总结了3D知识与特定3D特征相结合。3) 此外，我们还设计了3D-aware编码器/解码器，以增强生成的总结3D知识，处理类别具有复杂的外观。广泛的实验证明，使用单个DiffTF模型可以在ShapeNet和OmniObject3D（超过200种真实世界类别）中实现状态之最高的大 vocabulary 3D对象生成性能，具有大量多样化、充分 semantics 和高质量。

OpenIllumination: A Multi-Illumination Dataset for Inverse Rendering Evaluation on Real Objects

paper_url: http://arxiv.org/abs/2309.07921
repo_url: None
paper_authors: Isabella Liu, Linghao Chen, Ziyang Fu, Liwen Wu, Haian Jin, Zhong Li, Chin Ming Ryan Wong, Yi Xu, Ravi Ramamoorthi, Zexiang Xu, Hao Su
for: 这个论文是为了提供一个大量的实际拍摄数据集，用于评估反射渲染和材质分解算法的性能。
methods: 论文使用了多种现有的反射渲染和材质分解方法，并对其进行了评估。
results: 论文通过对实际拍摄数据集进行评估，发现了一些现有方法的缺陷和限制，并提出了一些改进方案。Here’s the full translation in Simplified Chinese:
for: 这个论文是为了提供一个大量的实际拍摄数据集，用于评估反射渲染和材质分解算法的性能。
methods: 论文使用了多种现有的反射渲染和材质分解方法，并对其进行了评估。
results: 论文通过对实际拍摄数据集进行评估，发现了一些现有方法的缺陷和限制，并提出了一些改进方案。

Abstract
We introduce OpenIllumination, a real-world dataset containing over 108K images of 64 objects with diverse materials, captured under 72 camera views and a large number of different illuminations. For each image in the dataset, we provide accurate camera parameters, illumination ground truth, and foreground segmentation masks. Our dataset enables the quantitative evaluation of most inverse rendering and material decomposition methods for real objects. We examine several state-of-the-art inverse rendering methods on our dataset and compare their performances. The dataset and code can be found on the project page: https://oppo-us-research.github.io/OpenIllumination.

摘要
我们介绍OpenIllumination dataset，包含108,000几个不同材料的物品，在72个摄像头视角下拍摄，并提供了准确的摄像头参数、照明真实值和前景分类标识。这个 dataset 允许大多数反向渲染和材料分解方法的量化评估，并且我们在这个 dataset 上评估了一些现有的反向渲染方法的性能。dataset 和代码可以在项目页面上找到：https://oppo-us-research.github.io/OpenIllumination。

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

paper_url: http://arxiv.org/abs/2309.07918
repo_url: https://github.com/openrobotlab/unihsi
paper_authors: Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang
for: 这篇论文的目的是提出一种统一的人Scene交互（HSI）框架，以便在实体AI和虚拟现实等领域中实现多样化的交互控制和用户友好的界面。
methods: 这篇论文使用了一种基于“链接Contacts”（CoC）的定义来支持多种交互，并提出了一种基于大语言模型（LLM）的 плаanner来将自然语言提示转化为任务计划，以及一种统一控制器来将计划转化为具体的任务执行。
results: 该论文的实验结果表明，该框架在多种任务执行中具有广泛的应用性和可重复性，并且可以在真实扫描的场景中进行实际应用。

Abstract
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

摘要
人机场景交互（HSI）是embodied AI和虚拟现实领域的关键组件。尽管有进步在运动质量和物理可能性方面，但是多样化的交互控制和用户友好的界面的开发仍需进一步探索，以便实际应用HSI。本文提出了一个统一的HSI框架，称为UniHSI，该框架通过语言命令支持统一控制多种交互。这个框架基于交互定义为链接（Chain of Contacts，CoC）：人 JOINT-object部分的步骤，这是基于交互类型和人机对象接触区域之间的强相关性。UniHSI包括一个大语言模型（LLM） плаanner，用于将语言提示转化为任务计划的CoC形式，以及一个统一控制器，用于将CoC转化为统一的任务执行。为了便于训练和评估，我们收集了一个新的数据集名为ScenePlan，其包括由LLMs根据多样化的场景生成的 thousands个任务计划。广泛的实验表明了我们框架在多样化任务执行和真实扫描场景的普适性。项目页面位于https://github.com/OpenRobotLab/UniHSI。

Looking at words and points with attention: a benchmark for text-to-shape coherence

paper_url: http://arxiv.org/abs/2309.07917
repo_url: None
paper_authors: Andrea Amaduzzi, Giuseppe Lisanti, Samuele Salti, Luigi Di Stefano
for: 提供了一个全面的解决方案，以改善文本描述和三维形状之间的准确性。
methods: 使用大型自然语言模型来自动修正三维形状的文本描述，并提出了一种新的量化度量来评估文本和形状之间的准确性。
results: 通过用户研究和比较量化分析，证明了新的度量在评估文本和形状之间的准确性方面具有更高的效果。同时，我们还公开发布了一个新的、精细的 benchmark，以便驱动研究在文本条件下的三维生成模型中的准确性。

Abstract
While text-conditional 3D object generation and manipulation have seen rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.

摘要
While text-conditional 3D object generation and manipulation have made rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.Here's the translation in Traditional Chinese:虽然文本条件下的3D物体生成和修改技术已经做出了快速的进步，但评估这些生成的3D形状和输入文本描述之间的协调性仍然缺乏明确的benchmark。这是因为：a) 公开 disponibles的文本描述资料伙伴的质量低下; b) 现有的评估 metric 对 text-to-shape coherence 的评估有限的效iveness。在这篇论文中，我们提出了一个完整的解决方案，它可以解决这两个问题。首先，我们使用大型语言模型来自动修复与形状相关的文本描述。第二，我们提出了一个新的评估metric，通过跨注意力机制来评估文本与形状之间的协调性。为了验证我们的方法，我们进行了用户研究，并与现有的评估metric进行比较。我们提出的新metric、修复的资料集和由用户验证的文本-形状对组成一个新的、精确的benchmark，我们公开发布这个benchmark，以促进文本条件下的3D生成模型协调性的研究。benchmark可以在https://cvlab-unibo.github.io/CrossCoherence-Web/ 获取。

ALWOD: Active Learning for Weakly-Supervised Object Detection

paper_url: http://arxiv.org/abs/2309.07914
repo_url: None
paper_authors: Yuting Wang, Velibor Ilic, Jiatong Li, Branislav Kisacanin, Vladimir Pavlovic
for: 提高对象检测（OD）任务中的训练数据质量，addressing the lack of large training datasets with precise object localization labels.
methods: fusion of active learning (AL) with weakly and semi-supervised object detection paradigms, including a new auxiliary image generator strategy and a new AL acquisition function.
results: significantly narrowing the gap between ODs trained on few partially labeled but strategically selected image instances and those that rely on fully-labeled data, demonstrated across several challenging benchmarks.Here’s the full text in Simplified Chinese:
for: 本研究旨在提高对象检测（OD）任务中的训练数据质量， Addressing the lack of large training datasets with precise object localization labels.
methods: 本研究提出了一种基于active learning（AL）和弱式和半supervised对象检测方法的新框架，包括一种新的auxiliary image generator策略和一种新的AL acquisition function。
results: 本研究表明，通过使用ALWOD框架，可以significantly narrow the gap betweenODs trained on few partially labeled but strategically selected image instances and those that rely on fully-labeled data，demonstrated across several challenging benchmarks.I hope this helps! Let me know if you have any further questions or if there’s anything else I can assist you with.

Abstract
Object detection (OD), a crucial vision task, remains challenged by the lack of large training datasets with precise object localization labels. In this work, we propose ALWOD, a new framework that addresses this problem by fusing active learning (AL) with weakly and semi-supervised object detection paradigms. Because the performance of AL critically depends on the model initialization, we propose a new auxiliary image generator strategy that utilizes an extremely small labeled set, coupled with a large weakly tagged set of images, as a warm-start for AL. We then propose a new AL acquisition function, another critical factor in AL success, that leverages the student-teacher OD pair disagreement and uncertainty to effectively propose the most informative images to annotate. Finally, to complete the AL loop, we introduce a new labeling task delegated to human annotators, based on selection and correction of model-proposed detections, which is both rapid and effective in labeling the informative images. We demonstrate, across several challenging benchmarks, that ALWOD significantly narrows the gap between the ODs trained on few partially labeled but strategically selected image instances and those that rely on the fully-labeled data. Our code is publicly available on https://github.com/seqam-lab/ALWOD.

摘要

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

paper_url: http://arxiv.org/abs/2309.07911
repo_url: https://github.com/alibaba-mmai-research/dist
paper_authors: Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
for: 这个论文旨在解决现有的视频识别方法中的缺陷，即将大规模预训练的语言-图像模型（如CLIP）应用于视频识别 task 时，它们的 temporal 理解能力受到限制。
methods: 该论文提出了 DiST 方法，它使用 dual-encoder 结构，其中一个是预训练的基础模型，另一个是一个轻量级的网络，用于Temporal 编码。它们之间的拓展分支用于融合空间-时间信息。
results: 该论文的实验结果表明，DiST 方法可以更好地理解视频中的空间和时间信息，并且比现有的方法更高效。在五个标准测试集上，DiST 方法的性能较现有方法优越。此外，当预训练在大规模的 Kinetics-710 上进行时，DiST 方法在冻结 ViT-L 模型下达到了 89.7% 的表现。

Abstract
Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modeling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder, and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The disentangled spatial and temporal learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that disentangled learning with an extra network for integration benefits both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Codes and models can be found in https://github.com/alibaba-mmai-research/DiST.

摘要
近些年，大规模预训练语言图像模型如CLIP表现出了对空间内容的杰出理解能力，但将这些模型应用于视频识别仍然存在不满足的时间理解能力。现有方法通常是在预训练模型中插入可调整的结构，这些结构可以通过整个预训练模型进行反向传播，但这会占用大量资源。或者，这些方法可以通过预训练结构的时间逻辑能力限制，从而导致模型的性能不佳。在这种情况下，我们提出了DiST，它可以分离空间和时间方面的学习。具体来说，DiST使用了一个双encoder结构，预训练基础模型 acting as the spatial encoder，并 introduce a lightweight network as the temporal encoder。一个整合分支被插入到encoder之间，以融合空间-时间信息。DiST中的分离学习非常高效，因为它可以避免大量预训练参数的反向传播。同时，我们经验表明，分离学习并在额外的整合网络上进行融合，可以提高空间和时间理解能力。我们在五个标准 benchMark上进行了广泛的实验，并证明了DiST的性能比现有的状态码表示出较大的差距。当我们在大规模的 Kinetics-710 上预训练时，我们在 Kinetics-400 上 achieved 89.7%，这证明了 DiST 的可扩展性。codes和模型可以在 https://github.com/alibaba-mmai-research/DiST 找到。

TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting

paper_url: http://arxiv.org/abs/2309.07910
repo_url: None
paper_authors: Rohan Choudhury, Kris Kitani, Laszlo A. Jeni
for: 预测3D人姿 pose estimation
methods: 使用循环计算每个人2D姿势特征，并将空间和时间信息 fusion到一个表示中，以提高人姿准确性并降低计算量
results: 在CMU Panoptic Studio数据集上，与TesseTrack相比，TEMPO模型可以达到10%更高的MPJPE，同时提高33倍的FPS速度

Abstract
Existing volumetric methods for predicting 3D human pose estimation are accurate, but computationally expensive and optimized for single time-step prediction. We present TEMPO, an efficient multi-view pose estimation model that learns a robust spatiotemporal representation, improving pose accuracy while also tracking and forecasting human pose. We significantly reduce computation compared to the state-of-the-art by recurrently computing per-person 2D pose features, fusing both spatial and temporal information into a single representation. In doing so, our model is able to use spatiotemporal context to predict more accurate human poses without sacrificing efficiency. We further use this representation to track human poses over time as well as predict future poses. Finally, we demonstrate that our model is able to generalize across datasets without scene-specific fine-tuning. TEMPO achieves 10$\%$ better MPJPE with a 33$\times$ improvement in FPS compared to TesseTrack on the challenging CMU Panoptic Studio dataset.

摘要
现有的体量方法可以准确地预测3D人姿，但是计算成本高和优化为单步预测。我们介绍TEMPO，一种高效的多视图人姿预测模型，学习了稳定的时空表示，提高了人姿准确性，同时也可以跟踪和预测人姿。我们大幅减少了与当前状态艺技的计算量，通过在每个人物2D姿势特征上进行回卷计算，将空间和时间信息 fusion 到一个表示中。这使得我们的模型可以使用时空上下文来预测更加准确的人姿，而不需要牺牲效率。我们还使用这种表示来跟踪人姿在时间上的变化，以及预测未来的姿势。最后，我们证明了我们的模型可以在不进行场景特定的细化 fine-tuning 的情况下进行泛化。TEMPO在CMU Panoptic Studio dataset上达到了10%更好的MPJPE，并且33倍提高了FPS。

Physically Plausible Full-Body Hand-Object Interaction Synthesis

paper_url: http://arxiv.org/abs/2309.07907
repo_url: https://github.com/eth-ait/phys-fullbody-grasp
paper_authors: Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, Otmar Hilliges
for: 本研究旨在提出一种基于物理学的方法，用于 sintesizing 人工智能系统中的协调够强的手对象互动。
methods: 该方法使用了反馈学习（RL）和物理 simulate 来解决数据驱动方法的局限性。具体来说，我们首先在分离的设定下学习了身体和手部运动的能量套件，然后通过高级策略控制手对象互动，以满足任务目标。
results: 该方法成功完成了从接近对象到握持和后续操作的完整互动任务，并与基于骨骼动画的基线比较，显示更加物理合理的运动。

Abstract
We propose a physics-based method for synthesizing dexterous hand-object interactions in a full-body setting. While recent advancements have addressed specific facets of human-object interactions, a comprehensive physics-based approach remains a challenge. Existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. In contrast, our proposed method embraces reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches. Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives. Our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.

摘要
Through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. The generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. A high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. It is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives.Our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. We compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.Translated into Simplified Chinese:我们提出一种基于物理学的方法，用于Synthesizing skillful hand-object interactions in a full-body setting. although recent advancements have addressed specific aspects of human-object interactions, a comprehensive physics-based approach remains a challenge. existing methods often focus on isolated segments of the interaction process and rely on data-driven techniques that may result in artifacts. in contrast, our proposed method combines reinforcement learning (RL) and physics simulation to mitigate the limitations of data-driven approaches.through a hierarchical framework, we first learn skill priors for both body and hand movements in a decoupled setting. the generic skill priors learn to decode a latent skill embedding into the motion of the underlying part. a high-level policy then controls hand-object interactions in these pretrained latent spaces, guided by task objectives of grasping and 3D target trajectory following. it is trained using a novel reward function that combines an adversarial style term with a task reward, encouraging natural motions while fulfilling the task incentives.our method successfully accomplishes the complete interaction task, from approaching an object to grasping and subsequent manipulation. we compare our approach against kinematics-based baselines and show that it leads to more physically plausible motions.

Generative Image Dynamics

paper_url: http://arxiv.org/abs/2309.07906
repo_url: https://github.com/generative-dynamics/generative-dynamics.github.io
paper_authors: Zhengqi Li, Richard Tucker, Noah Snavely, Aleksander Holynski
for: 该论文是用来模型图像空间的场景动力模型的方法。
methods: 该方法通过从真实视频序列中提取的运动轨迹来学习图像空间的动力模型。每个图像都通过一种频率协调的扩散抽象过程来预测每个像素的长期运动表示，这被称为神经随机运动xture。
results: 这种方法可以将单个图像转换为整个视频的密集运动轨迹，并且可以用于许多下游应用程序，如将Stationary图像转换为无缝循环动画，或让用户在真实图像中真实地与物体互动。

Abstract
We present an approach to modeling an image-space prior on scene dynamics. Our prior is learned from a collection of motion trajectories extracted from real video sequences containing natural, oscillating motion such as trees, flowers, candles, and clothes blowing in the wind. Given a single image, our trained model uses a frequency-coordinated diffusion sampling process to predict a per-pixel long-term motion representation in the Fourier domain, which we call a neural stochastic motion texture. This representation can be converted into dense motion trajectories that span an entire video. Along with an image-based rendering module, these trajectories can be used for a number of downstream applications, such as turning still images into seamlessly looping dynamic videos, or allowing users to realistically interact with objects in real pictures.

摘要
我们提出了一种基于图像空间的场景动力模型。我们的估计来自于实际视频序列中的自然摆动，如树、花、灯火和衣服在风中摆动。给定一幅图像，我们的训练模型使用协调频率扩散采样过程预测每个像素的长期运动表示，我们称之为神经随机动 texture。这个表示可以转化为涵盖整个视频的精密运动轨迹，并且可以与图像基于渲染模块结合使用，以实现将静止图像转化为流畅动态视频，或让用户在真实图像中实现真实互动。

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

paper_url: http://arxiv.org/abs/2309.07891
repo_url: None
paper_authors: Hongsuk Choi, Nikhil Chavan-Dafle, Jiacheng Yuan, Volkan Isler, Hyunsoo Park
for: 这个论文旨在学习手-物体交互情况，从单个RGB图像中重建3D手-物体场景。
methods: 这篇论文使用了一种普适的启发函数HandNeRF，该函数考虑了手形特征和物体特征之间的相关性，以预测手和物体场景几何。
results: 经过实验表明，HandNeRF能够更准确地重建手-物体场景，并且对下游任务如机器人手上吸取具有较高精度。

Abstract
This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to constrain the possible relative configuration of the hand and object geometry. We design a generalizable implicit function, HandNeRF, that explicitly encodes the correlation of the 3D hand shape features and 2D object features to predict the hand and object scene geometry. With experiments on real-world datasets, we show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods. Moreover, we demonstrate that object reconstruction from HandNeRF ensures more accurate execution of a downstream task, such as grasping for robotic hand-over.

摘要
Translation notes:* "hand-object scene" is translated as "手指物品场景" (shǒu zhǐ wù jǐ zhèng jīng)* "3D hand shape" is translated as "三维手势" (sān wéi shǒu xíng)* "2D object features" is translated as "二维物品特征" (èr wéi wù jīn tiě fèi)* "grasp configuration" is translated as "抓取配置" (cuò qǔ pèng jì)* "novel grasp configurations" is translated as "新型抓取配置" (xīn xíng cuò qǔ pèng jì)* "hand-over" is translated as "手势传递" (shǒu xíng chuán zhèng)

A Novel Local-Global Feature Fusion Framework for Body-weight Exercise Recognition with Pressure Mapping Sensors

paper_url: http://arxiv.org/abs/2309.07888
repo_url: None
paper_authors: Davinder Pal Singh, Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
for: 本研究旨在提出一种新的本地-全局特征融合框架，用于地板式动压图像上的身体运动识别。
methods: 该框架使用图像处理技术和YOLO物体检测来融合本地和全局特征，并使用知识储存法进行正则化，以提高运动识别性能。
results: 我们的实验结果表明，该框架可以提高运动识别精度，同时保持标签特征。

Abstract
We present a novel local-global feature fusion framework for body-weight exercise recognition with floor-based dynamic pressure maps. One step further from the existing studies using deep neural networks mainly focusing on global feature extraction, the proposed framework aims to combine local and global features using image processing techniques and the YOLO object detection to localize pressure profiles from different body parts and consider physical constraints. The proposed local feature extraction method generates two sets of high-level local features consisting of cropped pressure mapping and numerical features such as angular orientation, location on the mat, and pressure area. In addition, we adopt a knowledge distillation for regularization to preserve the knowledge of the global feature extraction and improve the performance of the exercise recognition. Our experimental results demonstrate a notable 11 percent improvement in F1 score for exercise recognition while preserving label-specific features.

摘要
我们提出了一种新的本地-全局特征融合框架，用于loor-based动态压力图像 recognition of body-weight exercises. 在现有研究主要通过深度神经网络进行全局特征EXTRACTION的基础上，我们的框架尝试将本地和全局特征结合使用图像处理技术和YOLO对象检测来地理 pressure profiles from different body parts and consider physical constraints. 我们的本地特征提取方法生成了两组高级本地特征，包括cropped pressure mapping和数值特征如angular orientation, location on the mat, and pressure area. 此外，我们采用了知识储存 distillation 来增强性能。我们的实验结果表明，使用本地-全局特征融合框架可以提高运动认知率11%，同时保持标签特征。

mEBAL2 Database and Benchmark: Image-based Multispectral Eyeblink Detection

paper_url: http://arxiv.org/abs/2309.07880
repo_url: None
paper_authors: Roberto Daza, Aythami Morales, Julian Fierrez, Ruben Tolosana, Ruben Vera-Rodriguez
for: 这篇论文旨在开发一个新的多spectrum数据库和多spectrum眼睛跟踪方法，以提高数据驱动的眼睛跟踪方法的性能。
methods: 这篇论文使用了多种感知器，包括两个 Near-Infrared (NIR) 感知器和一个 RGB 感知器，记录学生在完成不同难度的电子学习任务时的面部姿态和眼睛运动。此外，这篇论文还使用了一个电encephalogram (EEG) 带来获取用户的认知活动和眼睛跟踪事件。
results: 这篇论文提出了一种 Convolutional Neural Network (CNN) 架构作为眼睛跟踪的标准准则，并实现了不同的训练方法，包括使用 RGB spectrum、NIR spectrum 和两者的组合来提高现有的眼睛跟踪器的性能。研究表明，将 NIR 和 RGB 图像在训练时结合使用可以提高 RGB 眼睛跟踪器的性能。此外，这篇论文还验证了提出的眼睛跟踪器在更加惊喜和更加具有挑战性的环境中的一致性。

Abstract
This work introduces a new multispectral database and novel approaches for eyeblink detection in RGB and Near-Infrared (NIR) individual images. Our contributed dataset (mEBAL2, multimodal Eye Blink and Attention Level estimation, Version 2) is the largest existing eyeblink database, representing a great opportunity to improve data-driven multispectral approaches for blink detection and related applications (e.g., attention level estimation and presentation attack detection in face biometrics). mEBAL2 includes 21,100 image sequences from 180 different students (more than 2 million labeled images in total) while conducting a number of e-learning tasks of varying difficulty or taking a real course on HTML initiation through the edX MOOC platform. mEBAL2 uses multiple sensors, including two Near-Infrared (NIR) and one RGB camera to capture facial gestures during the execution of the tasks, as well as an Electroencephalogram (EEG) band to get the cognitive activity of the user and blinking events. Furthermore, this work proposes a Convolutional Neural Network architecture as benchmark for blink detection on mEBAL2 with performances up to 97%. Different training methodologies are implemented using the RGB spectrum, NIR spectrum, and the combination of both to enhance the performance on existing eyeblink detectors. We demonstrate that combining NIR and RGB images during training improves the performance of RGB eyeblink detectors (i.e., detection based only on a RGB image). Finally, the generalization capacity of the proposed eyeblink detectors is validated in wilder and more challenging environments like the HUST-LEBW dataset to show the usefulness of mEBAL2 to train a new generation of data-driven approaches for eyeblink detection.

摘要
这个研究引入了一个新的多spectrum数据库和一种新的方法来探测眼睛跳动在RGB和近红外（NIR）个人图像中。我们的贡献数据集（mEBAL2）是目前最大的跳动数据库，它提供了一个很好的机会来提高数据驱动的多spectrum方法来探测跳动和相关应用（例如注意水平估计和面部生物特征识别中的攻击检测）。mEBAL2包含21,100个图像序列从180名不同的学生（总共超过200万个标记图像），他们在完成不同的电子学习任务或通过MOOC平台上的HTML入门课程进行了实际的学习。mEBAL2使用多种传感器，包括两个近红外（NIR）和一个RGB摄像头来捕捉面部姿势在执行任务时，以及一个电enzephalogram（EEG）带来用户的认知活动和跳动事件。此外，这个工作还提出了一种卷积神经网络架构作为mEBAL2上的跳动探测标准，其性能可达97%。不同的训练方法被实现使用RGB谱和NIR谱，以及两者的组合来提高现有的跳动探测器的性能。我们示出，在训练时使用RGB和NIR图像可以提高RGB跳动探测器的性能（即基于RGB图像alone的跳动探测）。最后，我们验证了提出的跳动探测器在更加恶劣的环境中的一致性，如HUST-LEBW数据集，以证明mEBAL2可以用来训练一代新的数据驱动的跳动探测器。

Using network metrics to explore the community structure that underlies movement patterns

paper_url: http://arxiv.org/abs/2309.07878
repo_url: None
paper_authors: Anh Pham Thi Minh, Abhishek Kumar Singh, Soumya Snigdha Kundu
for: 研究圣地亚哥的社区结构，通过分析居民的运动趋势。
methods: 使用匿名居民的家庭和工作地点数据构建了运动网络，并使用模块化优化算法和聚类技术确定社区。
results: 结果显示，结合社区探测算法和分化工具可以提供新的洞察，深入了解劳动时间的复杂地理 segregation。

Abstract
This work aims to explore the community structure of Santiago de Chile by analyzing the movement patterns of its residents. We use a dataset containing the approximate locations of home and work places for a subset of anonymized residents to construct a network that represents the movement patterns within the city. Through the analysis of this network, we aim to identify the communities or sub-cities that exist within Santiago de Chile and gain insights into the factors that drive the spatial organization of the city. We employ modularity optimization algorithms and clustering techniques to identify the communities within the network. Our results present that the novelty of combining community detection algorithms with segregation tools provides new insights to further the understanding of the complex geography of segregation during working hours.

摘要
这项工作的目的是探索 Santiagode Chile 城市的社区结构，通过分析居民的行为方式来描述城市内部的运动模式。我们使用一个包含一些匿名居民的家庭和工作地点的数据集来构建一个表示城市内部运动模式的网络。通过网络分析，我们希望发现城市中存在的社区或子城市，并了解城市空间组织的因素。我们使用模块化优化算法和聚类技术来确定社区内部的结构。我们的结果表明，结合社区探测算法和分化工具可以提供新的视角，deepen我们对城市复杂的地理结构的理解。

Gradient constrained sharpness-aware prompt learning for vision-language models

paper_url: http://arxiv.org/abs/2309.07866
repo_url: None
paper_authors: Liangchen Liu, Nannan Wang, Dawei Zhou, Xinbo Gao, Decheng Liu, Xi Yang, Tongliang Liu
for: 提高视语模型在未经见过的类型上的性能，同时保持已经见过的类型上的性能。
methods: 基于权衡采用极性感知（SAM）方法，通过控制优化器的梯度来实现在性能和极性之间的权衡。
results: 实验证明GCSCoOp方法可以在视语模型中实现在性能和极性之间的权衡，并且在不同的预测任务上具有显著的优势。

Abstract
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is more strict and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem.

摘要
We compare the loss landscapes of state-of-the-art methods and vanilla Sharpness-aware Minimization (SAM) based methods, and find that the trade-off performance is correlated to both loss value and loss sharpness, but each of them is essential. However, we discover that existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance.To address this issue, we propose a novel SAM-based method for prompt learning, called Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), which dynamically constrains the optimizing gradient to achieve both performance objectives simultaneously. Extensive experiments demonstrate the effectiveness of GCSCoOp in the trade-off problem.In summary, this paper targets a novel trade-off problem in VLM prompt learning, and proposes a novel SAM-based method (GCSCoOp) to solve this problem by dynamically constraining the optimizing gradient. The proposed method achieves better trade-off performance than existing methods.

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.07849
repo_url: None
paper_authors: Rong Li, ShiJie Li, Xieyuanli Chen, Teli Ma, Wang Hao, Juergen Gall, Junwei Liang
for: 这篇论文旨在提高 LiDAR semantic segmentation 的精度和可靠性，以便自动驾驶和机器人更好地理解它们周围环境。
methods: 这篇论文提出了一种基于范围图像的 LiDAR semantic segmentation 方法，使用时间信息来解决“多到一”问题。特别是，我们在抽象层中添加了时间混合层，将前一批扫描结果与当前扫描结果进行混合，以提高精度。
results: 我们在两个标准评测 dataset 上进行了评测，并证明了我们的后处理技术可以适应不同的网络。我们的方法可以减少约20%的点云 occlusion，提高 semantic segmentation 的精度和可靠性。

Abstract
LiDAR semantic segmentation plays a crucial role in enabling autonomous driving and robots to understand their surroundings accurately and robustly. There are different types of methods, such as point-based, range-image-based, polar-based, and hybrid methods. Among these, range-image-based methods are widely used due to their efficiency. However, they face a significant challenge known as the ``many-to-one'' problem caused by the range image's limited horizontal and vertical angular resolution. As a result, around 20\% of the 3D points can be occluded. In this paper, we present TFNet, a range-image-based LiDAR semantic segmentation method that utilizes temporal information to address this issue. Specifically, we incorporate a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. We then design a max-voting-based post-processing technique to correct false predictions, particularly those caused by the ``many-to-one'' issue. We evaluated the approach on two benchmarks and demonstrate that the post-processing technique is generic and can be applied to various networks. We will release our code and models.

摘要
lidar semantic segmentation 在自动驾驶和机器人理解境外环境的准确性和稳定性方面扮演着关键性的角色。有多种方法，如点云、范围图像、极坐标基本方法和混合方法。其中，范围图像基本方法因其高效性而广泛使用，但它们面临着“多个对一”问题，即范围图像的水平和垂直angular resolution有限，导致约20%的3D点被遮盖。在本文中，我们提出了TFNet，一种基于范围图像的 LiDAR semantic segmentation 方法，利用时间信息解决这个问题。具体来说，我们添加了时间融合层，从前一批扫描中提取有用信息，并与当前扫描结合。然后，我们设计了最大投票基本的后处理技术，以 corrrect false predictions，特别是由“多个对一”问题引起的错误预测。我们在两个标准准点上评估了方法，并证明了后处理技术可以应用于多种网络。我们将发布代码和模型。

MC-NeRF: Muti-Camera Neural Radiance Fields for Muti-Camera Image Acquisition Systems

paper_url: http://arxiv.org/abs/2309.07846
repo_url: https://github.com/IN2-ViAUn/MC-NeRF
paper_authors: Yu Gao, Lutong Su, Hao Liang, Yufeng Yue, Yi Yang, Mengyin Fu
for: 用于3D场景表示和NeRF的多视图图像处理
methods: 提出了一种能够同时优化内部和外部参数的MC-NeRF方法，包括对叠加问题的理论分析和高效的投影网络
results: 实验表明MC-NeRF方法能够在不提供初始姿态的情况下，使用110张图像和110个不同的内部和外部参数来获得3D场景表示

Abstract
Neural Radiance Fields (NeRF) employ multi-view images for 3D scene representation and have shown remarkable performance. As one of the primary sources of multi-view images, multi-camera systems encounter challenges such as varying intrinsic parameters and frequent pose changes. Most previous NeRF-based methods often assume a global unique camera and seldom consider scenarios with multiple cameras. Besides, some pose-robust methods still remain susceptible to suboptimal solutions when poses are poor initialized. In this paper, we propose MC-NeRF, a method can jointly optimize both intrinsic and extrinsic parameters for bundle-adjusting Neural Radiance Fields. Firstly, we conduct a theoretical analysis to tackle the degenerate case and coupling issue that arise from the joint optimization between intrinsic and extrinsic parameters. Secondly, based on the proposed solutions, we introduce an efficient calibration image acquisition scheme for multi-camera systems, including the design of calibration object. Lastly, we present a global end-to-end network with training sequence that enables the regression of intrinsic and extrinsic parameters, along with the rendering network. Moreover, most existing datasets are designed for unique camera, we create a new dataset that includes four different styles of multi-camera acquisition systems, allowing readers to generate custom datasets. Experiments confirm the effectiveness of our method when each image corresponds to different camera parameters. Specifically, we adopt up to 110 images with 110 different intrinsic and extrinsic parameters, to achieve 3D scene representation without providing initial poses. The Code and supplementary materials are available at https://in2-viaun.github.io/MC-NeRF.

摘要
neural radiance fields (NeRF) 使用多视图图像表示3D场景，并显示出杰出的性能。多摄像头系统面临多个挑战，包括不同的内参参数和频繁的pose变化。大多数前一代NeRF基于方法通常假设全局唯一的摄像头，并rarely考虑多摄像头场景。此外，一些pose鲁棒的方法仍然容易受到差初始化的pose的影响。在这篇论文中，我们提出MC-NeRF方法，可以同时优化内参和外参参数，为Neural Radiance Fields进行束致调整。首先，我们进行理论分析，解决由共同优化内参和外参参数而产生的协调问题和缺失问题。其次，我们提出了一种高效的准备图像采集方案，包括设计准备对象。最后，我们介绍了一个全球端到端网络，可以对内参和外参参数进行回归，同时实现图像渲染。此外，大多数现有的数据集都是为唯一摄像头设计的，我们创建了一个新的数据集，包括四种不同的多摄像头采集系统风格，让读者可以生成自定义数据集。实验证明我们的方法在每个图像对应不同的摄像头参数时显示出效果。具体来说，我们采用了110张图像，每张图像对应110个内参和外参参数，以实现3D场景表示，无需提供初始pose。代码和补充材料可以在https://in2-viaun.github.io/MC-NeRF上获取。

Decomposition of linear tensor transformations

paper_url: http://arxiv.org/abs/2309.07819
repo_url: None
paper_authors: Claudio Turchetti
for: 这篇论文的目的是开发一种数学框架，用于精确地 decomposes a tensor as the sum of a finite number of low-rank tensors。
methods: 该论文使用的方法包括： solving an optimization problem to find a low-dimensional subspace, and assuming the number of components is fixed。
results: 该论文的结果包括： deriving three different problems to decompose a non-negative self-adjoint tensor operator, a linear tensor transformation, and a generic tensor。

Abstract
One of the main issues in computing a tensor decomposition is how to choose the number of rank-one components, since there is no finite algorithms for determining the rank of a tensor. A commonly used approach for this purpose is to find a low-dimensional subspace by solving an optimization problem and assuming the number of components is fixed. However, even though this algorithm is efficient and easy to implement, it often converges to poor local minima and suffers from outliers and noise. The aim of this paper is to develop a mathematical framework for exact tensor decomposition that is able to represent a tensor as the sum of a finite number of low-rank tensors. In the paper three different problems will be carried out to derive: i) the decomposition of a non-negative self-adjoint tensor operator; ii) the decomposition of a linear tensor transformation; iii) the decomposition of a generic tensor.

摘要
一个主要问题在计算张量归一化是选择张量的级数，因为没有定finite的算法来确定张量的级数。一种常用的方法是通过解一个优化问题来找到一个低维度子空间，假设级数是固定的。然而，即使这种算法是效率高且易于实现，它经常会收敛到差异性和噪声的局部最优点，并且受到噪声和噪声的影响。本文的目标是开发一种数学框架，可以表示张量为一些有限级的低维度张量的和。在本文中，将解决以下三个问题：1. 非正式自Symmetric tensor operator的归一化分解;2. 线性张量变换的归一化分解;3. 一般张量的归一化分解。

For A More Comprehensive Evaluation of 6DoF Object Pose Tracking

paper_url: http://arxiv.org/abs/2309.07796
repo_url: None
paper_authors: Yang Li, Fan Zhong, Xin Wang, Shuangbing Song, Jiachen Li, Xueying Qin, Changhe Tu
for: 本研究旨在提供一个统一的对比方法，以解决6DoF物体pose追踪的评估问题。
methods: 本研究提出了一种多视图多物体全pose精确调整方法，可以同时精确调整所有物体和摄像头的 pose，实现sub-pixel sub-millimeter的对齐误差。
results: 本研究透过实验验证了提案的全pose精确调整方法的精度和可靠性，并在YCBV和BCOT两个基本数据集上进行了统一的评估。 results show that the proposed method outperforms previous methods in terms of accuracy and robustness.

Abstract
Previous evaluations on 6DoF object pose tracking have presented obvious limitations along with the development of this area. In particular, the evaluation protocols are not unified for different methods, the widely-used YCBV dataset contains significant annotation error, and the error metrics also may be biased. As a result, it is hard to fairly compare the methods, which has became a big obstacle for developing new algorithms. In this paper we contribute a unified benchmark to address the above problems. For more accurate annotation of YCBV, we propose a multi-view multi-object global pose refinement method, which can jointly refine the poses of all objects and view cameras, resulting in sub-pixel sub-millimeter alignment errors. The limitations of previous scoring methods and error metrics are analyzed, based on which we introduce our improved evaluation methods. The unified benchmark takes both YCBV and BCOT as base datasets, which are shown to be complementary in scene categories. In experiments, we validate the precision and reliability of the proposed global pose refinement method with a realistic semi-synthesized dataset particularly for YCBV, and then present the benchmark results unifying learning&non-learning and RGB&RGBD methods, with some finds not discovered in previous studies.

摘要
In this paper, we contribute a unified benchmark to address these problems. We propose a multi-view multi-object global pose refinement method that can jointly refine the poses of all objects and view cameras, resulting in sub-pixel sub-millimeter alignment errors. We also analyze the limitations of previous scoring methods and error metrics and introduce improved evaluation methods.The unified benchmark uses both YCBV and BCOT as base datasets, which are shown to be complementary in scene categories. In experiments, we validate the precision and reliability of the proposed global pose refinement method with a realistic semi-synthesized dataset particularly for YCBV, and then present the benchmark results unifying learning&non-learning and RGB&RGBD methods. Our findings reveal some new insights that were not discovered in previous studies.Translated into Simplified Chinese:前期评估6DoF对象pose追踪技术已经存在了明显的限制，这些限制障碍了这一领域的发展。特别是评估协议没有统一的，广泛使用的YCBV数据集中存在了重要的注释错误，而且错误指标也可能受到偏见。这使得不能公正地比较不同的方法，成为了大型算法开发的主要障碍。在本文中，我们提出了一个统一的评估标准，以解决这些问题。我们提议一种多视图多对象全 pose 精度修正方法，可以同时修正所有对象和视角摄像头的pose，从而实现sub-pixel sub-millimeter的Alignment Error。我们还分析了过去评估方法和指标的限制，并在基于这些分析结果提出了改进的评估方法。统一评估标准使用了YCBV和BCOT作为基本数据集，这些数据集在场景类型上是补偿的。在实验中，我们验证了提posed的全 pose 精度修正方法的准确性和可靠性，使用了特制的半Synthesized数据集，尤其是YCBV，然后公布了统一的评估结果，包括学习&非学习和RGB&RGBD方法的比较。我们的发现包括一些在过去的研究中没有发现的新发现。

Virchow: A Million-Slide Digital Pathology Foundation Model

paper_url: http://arxiv.org/abs/2309.07778
repo_url: None
paper_authors: Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Millar, Matthew Hanna, Juan Retamero, William A. Moye, Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Rothrock, Thomas J. Fuchs
for: 这个论文旨在提高计算 PATHOLOGY 中的人工智能应用，以实现精准医疗和决策支持系统，特别是在诊断和治疗癌症方面。methods: 这篇论文使用了自然语言处理技术，通过对整个染色片图像进行分析，来实现计算 PATHOLOGY 的目标。具体来说，他们使用了一个名为 Virchow 的深度神经网络基础模型，通过自我超vised学习来训练该模型，并使用了150万张染色片图像来进行训练。results: 根据论文的报告，使用 Virchow 模型可以在许多计算 PATHOLOGY 任务中获得出色的表现，包括板块级普ancer检测和分类、板块级生物标志预测等。模型的性能达到了93%的平衡准确率，并且在colon microsatellite instability status预测和Breast CDH1 status预测等任务中获得了0.983的AUC和0.967的AUC。这些表现表明，通过在大量的病理图像数据集上进行预训练，可以提高计算 PATHOLOGY 中的表现，并且可能会继续提高表现，如果进一步采用更大的数据集进行预训练。

Abstract
Computational pathology uses artificial intelligence to enable precision medicine and decision support systems through the analysis of whole slide images. It has the potential to revolutionize the diagnosis and treatment of cancer. However, a major challenge to this objective is that for many specific computational pathology tasks the amount of data is inadequate for development. To address this challenge, we created Virchow, a 632 million parameter deep neural network foundation model for computational pathology. Using self-supervised learning, Virchow is trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue groups, which is orders of magnitude more data than previous works. When evaluated on downstream tasks including tile-level pan-cancer detection and subtyping and slide-level biomarker prediction, Virchow outperforms state-of-the-art systems both on internal datasets drawn from the same population as the pretraining data as well as external public datasets. Virchow achieves 93% balanced accuracy for pancancer tile classification, and AUCs of 0.983 for colon microsatellite instability status prediction and 0.967 for breast CDH1 status prediction. The gains in performance highlight the importance of pretraining on massive pathology image datasets, suggesting pretraining on even larger datasets could continue improving performance for many high-impact applications where limited amounts of training data are available, such as drug outcome prediction.

摘要
computacional patología utiliza inteligencia artificial para permitir medicina de precisión y sistemas de soporte de decisión a través del análisis de imágenes enteras de técnicas. tiene el potencial de revolucionar el diagnóstico y el tratamiento del cáncer. sin embargo, un desafío importante para este objetivo es que para muchos tareas específicas de computacional patología, la cantidad de datos es insuficiente para el desarrollo. para abordar este desafío, creamos Virchow, un modelo de red neuronal profunda de 632 millones de parámetros para computacional patología. utilizando aprendizaje auto-supervisado, Virchow se entrenó en 1,5 millones de imágenes enteras de técnicas hematoxylin y eosin de grupos de tejidos diversas, lo que es órdenes de magnitud más datos que trabajos anteriores. cuando se evaluó en tareas downstream, incluyendo la detección y clasificación de paneles de cáncer y la predicción de marcadores en las placas, Virchow superó a sistemas de estado del arte en datos internos y externos. Virchow obtuvo una precisión equilibrada del 93% para la clasificación de paneles de cáncer, y áreas bajo el 0,983 para la predicción de la status de instabilidad microsatélite en el colon y el 0,967 para la predicción de la status de CDH1 en el seno. los ganancias en rendimiento destacan la importancia de preentrenar en conjuntos de datos de imágenes patológicas masivos, sugiriendo que preentrenar en conjuntos de datos aún más grandes podría continuar mejorando el rendimiento para muchas aplicaciones de alto impacto en las que se tienen cantidades limitadas de datos de entrenamiento.

Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion

paper_url: http://arxiv.org/abs/2309.07753
repo_url: None
paper_authors: Peiran Xu, Yadong Mu
for: 高级对象检测（CoSOD）任务的目的是在每个图像中强调共同突出的对象。
methods: 我们使用层次结构的Transformer模块来提取 semantic-level的共识，从而更好地捕捉共同对象类别的全面表示，并 exclude 其他对象的本地相似性干扰。我们还提出了基于Transformer的分布模块，该模块考虑了不同场景中共同对象的变化。它在图像特征地图上分布了共识，并且利用图像之间的交互来全面利用图像特征。
results: 我们在三个常用的CoSOD数据集上进行评估，并达到了当前最佳性能。

Abstract
Given a group of images, co-salient object detection (CoSOD) aims to highlight the common salient object in each image. There are two factors closely related to the success of this task, namely consensus extraction, and the dispersion of consensus to each image. Most previous works represent the group consensus using local features, while we instead utilize a hierarchical Transformer module for extracting semantic-level consensus. Therefore, it can obtain a more comprehensive representation of the common object category, and exclude interference from other objects that share local similarities with the target object. In addition, we propose a Transformer-based dispersion module that takes into account the variation of the co-salient object in different scenes. It distributes the consensus to the image feature maps in an image-specific way while making full use of interactions within the group. These two modules are integrated with a ViT encoder and an FPN-like decoder to form an end-to-end trainable network, without additional branch and auxiliary loss. The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance.

摘要
给定一组图像，共同突出物检测（CoSOD）目标是将每个图像中的共同突出物高亮显示出来。这个任务的成功取决于两个因素：一是共识EXTRACTION，二是共识的分布到每个图像中。大多数前一代方法使用本地特征来表示群组共识，而我们则使用层次Transformer模块来提取semantic级共识。这可以获得更全面的共同对象类划分，并排除其他对象的本地相似性干扰。此外，我们提议使用Transformer基于的分布模块，该模块考虑了不同场景中共同突出物的变化。它在图像特征图中分布共识，并且利用图像之间的交互来实现图像特征图中的分布。这两个模块与ViT编码器和FPN-like解码器集成，形成一个端到端可训练的网络，不需要额外支持和副任务损失。我们的方法在三个常用CoSOD数据集上进行评估，并达到了当前最佳性能。

DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

paper_url: http://arxiv.org/abs/2309.07752
repo_url: None
paper_authors: Yaoyu Su, Shaohui Wang, Haoqian Wang
for: 提高人脸朗读的实时渲染效果
methods: 使用特殊的三面体 decomposition 和 Neural Radiance Fields (NeRF) 技术
results: 实现人脸朗读的State-of-the-art 结果

Abstract
In this paper, we present the decomposed triplane-hash neural radiance fields (DT-NeRF), a framework that significantly improves the photorealistic rendering of talking faces and achieves state-of-the-art results on key evaluation datasets. Our architecture decomposes the facial region into two specialized triplanes: one specialized for representing the mouth, and the other for the broader facial features. We introduce audio features as residual terms and integrate them as query vectors into our model through an audio-mouth-face transformer. Additionally, our method leverages the capabilities of Neural Radiance Fields (NeRF) to enrich the volumetric representation of the entire face through additive volumetric rendering techniques. Comprehensive experimental evaluations corroborate the effectiveness and superiority of our proposed approach.

摘要
在这篇论文中，我们介绍了分解三平面哈希神经辐射场（DT-NeRF），一种框架，可以大幅提高描述讲话脸部的真实渲染效果，并在关键评估数据集上达到状态之最的结果。我们的架构将脸部分成两个特殊的三平面：一个专门用于表示嘴，另一个用于更广泛的脸部特征。我们引入音频特征作为剩余项，并通过音频-口型-脸部变换器将其纳入我们的模型。此外，我们的方法利用神经辐射场（NeRF）的能力，通过添加卷积渲染技术，进一步丰富脸部的材料表示。经过全面的实验评估，我们的提议方法的效果和优势得到了证明。

OmnimatteRF: Robust Omnimatte with 3D Background Modeling

paper_url: http://arxiv.org/abs/2309.07749
repo_url: None
paper_authors: Geng Lin, Chen Gao, Jia-Bin Huang, Changil Kim, Yipeng Wang, Matthias Zwicker, Ayush Saraf
for: 这 paper 是为了提出一种新的视频分割方法，以提高现实世界视频中的背景和前景分割精度。
methods: 该 paper 使用了 OmnimatteRF 方法，该方法结合了动态 2D 前景层和 3D 背景模型，以保留视频中主题的细节，并在实际视频中坚定重建场景。
results: 广泛的实验表明，OmnimatteRF 方法能够在各种视频中提供更高质量的场景重建。

Abstract
Video matting has broad applications, from adding interesting effects to casually captured movies to assisting video production professionals. Matting with associated effects such as shadows and reflections has also attracted increasing research activity, and methods like Omnimatte have been proposed to separate dynamic foreground objects of interest into their own layers. However, prior works represent video backgrounds as 2D image layers, limiting their capacity to express more complicated scenes, thus hindering application to real-world videos. In this paper, we propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model. The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos. Extensive experiments demonstrate that our method reconstructs scenes with better quality on various videos.

摘要
视频幕制有广泛的应用，从加入有趣的效果到casually captured movies 到协助视频制作专业人员。与关联的效果，如阴影和反射，一起吸引了增加研究活动，而方法如 Omnimatte 已经被提议来分离动态前景对象。但先前的工作 Represent video backgrounds as 2D image layers，限制了它们表达更复杂的场景，因此阻碍了应用于实际视频。在这篇论文中，我们提出了一种新的视频幕制方法，OmnimatteRF，它结合动态2D前景层和3D背景模型。2D层保留主题的细节，而3D背景坚定地重建了实际视频中的场景。广泛的实验表明，我们的方法可以在各种视频中重建场景质量更高。

Dataset Condensation via Generative Model

paper_url: http://arxiv.org/abs/2309.07698
repo_url: None
paper_authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou
for: 本研究旨在适应大规模 dataset 的 condensation，以减少训练样本数量，提高模型的训练速度和性能。
methods: 本研究提出了一种基于生成模型的 condensation 方法，通过将 dataset 转换为生成模型的形式，使得 condensation 可以适应大规模 dataset 和多种类别。该方法还提出了 intra-class 和 inter-class 损失函数，以保证 condensation 后的样本具有多样性和抗预测性。
results: Comparative studies with state-of-the-art methods and ablation studies confirm the effectiveness of the proposed method and its individual components. 本研究成功地应用于 ImageNet-1k dataset，并显示了与传统方法相比的性能提升。

Abstract
Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation methods from scaling up to large datasets with diverse classes. Moreover, the relations among condensed samples have been neglected and hence the feature distribution of condensed samples is often not diverse. To solve these problems, we propose to condense the dataset into another format, a generative model. Such a novel format allows for the condensation of large datasets because the size of the generative model remains relatively stable as the number of classes or image resolution increases. Furthermore, an intra-class and an inter-class loss are proposed to model the relation of condensed samples. Intra-class loss aims to create more diverse samples for each class by pushing each sample away from the others of the same class. Meanwhile, inter-class loss increases the discriminability of samples by widening the gap between the centers of different classes. Extensive comparisons with state-of-the-art methods and our ablation studies confirm the effectiveness of our method and its individual component. To our best knowledge, we are the first to successfully conduct condensation on ImageNet-1k.

摘要

CoRF : Colorizing Radiance Fields using Knowledge Distillation

paper_url: http://arxiv.org/abs/2309.07668
repo_url: None
paper_authors: Ankit Dhiman, R Srinath, Srinjay Sarkar, Lokesh R Boregowda, R Venkatesh Babu
for: Synthesizing colorized novel views from input grey-scale multi-view images.
methods: 使用NeRF基于方法实现高质量的多视图图像新观 synthesis.
results: 比基eline提供更高质量的彩色化新视图图像，同时保持cross-view consistency.

Abstract
Neural radiance field (NeRF) based methods enable high-quality novel-view synthesis for multi-view images. This work presents a method for synthesizing colorized novel views from input grey-scale multi-view images. When we apply image or video-based colorization methods on the generated grey-scale novel views, we observe artifacts due to inconsistency across views. Training a radiance field network on the colorized grey-scale image sequence also does not solve the 3D consistency issue. We propose a distillation based method to transfer color knowledge from the colorization networks trained on natural images to the radiance field network. Specifically, our method uses the radiance field network as a 3D representation and transfers knowledge from existing 2D colorization methods. The experimental results demonstrate that the proposed method produces superior colorized novel views for indoor and outdoor scenes while maintaining cross-view consistency than baselines. Further, we show the efficacy of our method on applications like colorization of radiance field network trained from 1.) Infra-Red (IR) multi-view images and 2.) Old grey-scale multi-view image sequences.

摘要
neuronal radiance field (NeRF) 方法可以实现高质量的新视图合成 для多视图图像。这项工作提出了从输入灰度多视图图像中生成彩色化的新视图的方法。当我们应用图像或视频基于色化方法于生成的灰度新视图中时，我们会 observer artifacts due to inconsistency across views。训练 radiance field 网络在颜色化的灰度图像序列上也不能解决3D consistency issue。我们提出了一种液态基于方法，将自然图像中的颜色知识传递给 radiance field 网络。具体来说，我们的方法使用 radiance field 网络作为3D表示，并将自然图像中的颜色化方法中的知识传递给它。实验结果表明，我们提出的方法可以在indoor和outdoor场景中生成高质量的彩色化新视图，同时保持视图间的一致性。此外，我们还证明了我们的方法在以下两个应用中的效果：1.) 红外（IR）多视图图像和2.) 老灰度多视图图像序列中进行颜色化。

Towards Robust and Unconstrained Full Range of Rotation Head Pose Estimation

paper_url: http://arxiv.org/abs/2309.07654
repo_url: https://github.com/thohemp/6drepnet360
paper_authors: Thorsten Hempel, Ahmed A. Abdelrahman, Ayoub Al-Hamadi
for: 这篇论文是用于解决人体头部pose预测问题的，这是许多应用场景中的一个关键问题，但目前主要被视为前置 pose 预测的一个子任务。
methods: 我们提出了一种新的无约束端到端头部pose预测方法，用于解决极具挑战性的全范围orientation head pose预测问题。我们在ground truth数据中引入了旋转矩阵 formalism，并提议了一种连续的6D旋转矩阵表示方式，以便高效地学习整个旋转表现和超越当前状态艺。
results: 我们的方法在公共数据集上进行了广泛的评估，并证明了与其他状态艺方法相比，它在高效和可靠的情况下显著超越了其他方法，而且其先进的预测范围允许扩展应用领域。我们将训练和测试代码以及我们训练过的模型公开在 GitHub：https://github.com/thohemp/6DRepNet360。

Abstract
Estimating the head pose of a person is a crucial problem for numerous applications that is yet mainly addressed as a subtask of frontal pose prediction. We present a novel method for unconstrained end-to-end head pose estimation to tackle the challenging task of full range of orientation head pose prediction. We address the issue of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This allows to efficiently learn full rotation appearance and to overcome the limitations of the current state-of-the-art. Together with new accumulated training data that provides full head pose rotation data and a geodesic loss approach for stable learning, we design an advanced model that is able to predict an extended range of head orientations. An extensive evaluation on public datasets demonstrates that our method significantly outperforms other state-of-the-art methods in an efficient and robust manner, while its advanced prediction range allows the expansion of the application area. We open-source our training and testing code along with our trained models: https://github.com/thohemp/6DRepNet360.

摘要
“估算人姿pose是许多应用中的关键问题，然而现在主要被视为前方姿态预测的子任务。我们提出了一种新的方法，用于无条件端到端的人姿pose估算，以解决困难的全角度姿态预测问题。我们为我们的真实数据引入了旋转矩阵 formalism，并提出了连续的6D旋转矩阵表示，以实现高效和稳定的直接预测。这允许我们高效地学习全旋转样式，并超越现有的状态顶峰。furthermore，我们新增了大量的全头姿态资料，并使用地odesic损失函数，以确保稳定的学习。我们设计了一个进步的模型，能够预测更广泛的头姿范围。实验结果显示，我们的方法与其他状态顶峰方法相比，在高效和稳定的情况下，有 statistically significant 的进步。此外，我们的预测范围的扩展，允许扩展应用领域。我们将训练和测试代码，以及我们训练的模型，公开发布在 GitHub 上：https://github.com/thohemp/6DRepNet360。”

Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement

paper_url: http://arxiv.org/abs/2309.07640
repo_url: None
paper_authors: Sheng Ye, Yubin Hu, Matthieu Lin, Yu-Hui Wen, Wang Zhao, Wenping Wang, Yong-Jin Liu
for: 这 paper 的目的是 reconstruction indoor scene 从多视图 RGB 图像中。
methods: 这 paper 使用 neural radiance fields 和 predicted surface normal priors 来恢复场景的geometry。
results: 这 paper 的方法可以生成完整和平滑的结果，但是它们在复杂表面上受到限制，因为 implicit representation 不够表示高频结构。

Abstract
The reconstruction of indoor scenes from multi-view RGB images is challenging due to the coexistence of flat and texture-less regions alongside delicate and fine-grained regions. Recent methods leverage neural radiance fields aided by predicted surface normal priors to recover the scene geometry. These methods excel in producing complete and smooth results for floor and wall areas. However, they struggle to capture complex surfaces with high-frequency structures due to the inadequate neural representation and the inaccurately predicted normal priors. To improve the capacity of the implicit representation, we propose a hybrid architecture to represent low-frequency and high-frequency regions separately. To enhance the normal priors, we introduce a simple yet effective image sharpening and denoising technique, coupled with a network that estimates the pixel-wise uncertainty of the predicted surface normal vectors. Identifying such uncertainty can prevent our model from being misled by unreliable surface normal supervisions that hinder the accurate reconstruction of intricate geometries. Experiments on the benchmark datasets show that our method significantly outperforms existing methods in terms of reconstruction quality.

摘要
重建室内场景从多视图RGB图像中是一项挑战，因为场景中存在平滑无 texture 的区域以及细节rich和细腻的区域。现有方法利用神经辐射场 aid by predicted surface normal priors来恢复场景几何。这些方法在 floor 和墙面上 producing complete and smooth results 表现出色。然而，它们在复杂的表面上 Capture high-frequency structures 困难，因为神经表示不够强大和预测的normal priors不准确。为了提高隐式表示能力，我们提议将低频和高频区域分别表示。为了提高normal priors，我们引入了一种简单 yet effective的图像锐化和减噪技术，并 coupling 一个网络来估算图像中每个像素的surface normal vector的uncertainty。这种uncertainty可以让我们的模型不会被不可靠的表面normal supervision mislead，从而精准地重建复杂的几何结构。实验表明，我们的方法与现有方法相比，在重建质量方面具有显著的优势。

SwitchGPT: Adapting Large Language Models for Non-Text Outputs

paper_url: http://arxiv.org/abs/2309.07623
repo_url: None
paper_authors: Xinyu Wang, Bohan Zhuang, Qi Wu
for: This paper aims to bridge the gap between language models (LLMs) and modality conversion models, such as text-to-image, by proposing a novel approach that can adapt LLMs to comprehend requests for non-text responses.
methods: The proposed approach, called \methodname, employs a minimal dataset to instruct LLMs to recognize the intended output modality as directed by the instructions. The adapted LLM can then effectively summon various off-the-shelf modality conversion models from the model zoos to generate non-text responses.
results: The experiment results show that, with minimal training, LLMs can be conveniently adapted to comprehend requests for non-text responses, thus achieving higher flexibility in multi-modal scenarios.

Abstract
Large Language Models (LLMs), primarily trained on text-based datasets, exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs. However, they falter when requests to generate non-text ones. Concurrently, modality conversion models, such as text-to-image, despite generating high-quality images, suffer from a lack of extensive textual pretraining. As a result, these models are only capable of accommodating specific image descriptions rather than comprehending more complex instructions. To bridge this gap, we propose a novel approach, \methodname, from a modality conversion perspective that evolves a text-based LLM into a multi-modal one. We specifically employ a minimal dataset to instruct LLMs to recognize the intended output modality as directed by the instructions. Consequently, the adapted LLM can effectively summon various off-the-shelf modality conversion models from the model zoos to generate non-text responses. This circumvents the necessity for complicated pretraining that typically requires immense quantities of paired multi-modal data, while simultaneously inheriting the extensive knowledge of LLMs and the ability of high-quality generative models. To evaluate and compare the adapted multi-modal LLM with its traditional counterparts, we have constructed a multi-modal instruction benchmark that solicits diverse modality outputs. The experiment results reveal that, with minimal training, LLMs can be conveniently adapted to comprehend requests for non-text responses, thus achieving higher flexibility in multi-modal scenarios. Code and data will be made available at https://github.com/xinke-wang/SwitchGPT.

摘要
大型语言模型（LLM），主要基于文本数据集进行训练，在理解和执行复杂语言指令方面表现出色，但在产生非文本回应方面则有问题。同时，模式转换模型，如文本转像，优秀地生成高品质的图像，但由于缺乏广泛的文本预训，因此只能适应特定的图像描述，而不能理解更复杂的指令。为了跨越这个差距，我们提出了一个新的方法，\methodname，从模式转换角度来进行。我们specifically使用了一个最小的数据集，将LLMs教育到识别所需的输出模式，以实现对不同模式的认知。因此，已经适应了LLMs的改进后，可以从模型zoo中选择多种高品质的生成模型，以生成非文本回应。这样可以避免需要大量的对整合多个模式的预训，同时继承了LLMs的广泛知识和高品质生成模型的能力。为了评估和比较改进后的多个模式LMM，我们建立了一个多模式指令库， solicits多种模式的回应。实验结果显示，仅需要最小的训练，LLMs可以轻松地适应产生非文本回应，因此在多模式enario中得到更高的灵活性。代码和数据将会在https://github.com/xinke-wang/SwitchGPT中公开。

Road Disease Detection based on Latent Domain Background Feature Separation and Suppression

paper_url: http://arxiv.org/abs/2309.07616
repo_url: None
paper_authors: Juwu Zheng, Jiangtao Ren
for: 这篇论文目的是提出一种新的 Latent Domain Background Feature Separation and Suppression（LDBFSS）网络，用于减少背景信息的影响，提高道路疾病检测的准确率。
methods: 该论文提出了一种新的 LDBFSS 网络，包括幽默领域发现模块、领域对抗学习模块和对比学习模块。通过不需要领域监督和对比提高对象特征表示，LDBFSS 网络可以减少背景信息的影响，提高道路疾病检测的准确率。
results: 实验结果表明，与最佳模型相比，LDBFSS 网络在 GRDDC 数据集上提高了约4%，在 CNRDD 数据集上提高了4.6%。这些结果证明了 LDBFSS 网络的有效性和优越性。

Abstract
Road disease detection is challenging due to the the small proportion of road damage in target region and the diverse background,which introduce lots of domain information.Besides, disease categories have high similarity,makes the detection more difficult. In this paper, we propose a new LDBFSS(Latent Domain Background Feature Separation and Suppression) network which could perform background information separation and suppression without domain supervision and contrastive enhancement of object features.We combine our LDBFSS network with YOLOv5 model to enhance disease features for better road disease detection. As the components of LDBFSS network, we first design a latent domain discovery module and a domain adversarial learning module to obtain pseudo domain labels through unsupervised method, guiding domain discriminator and model to train adversarially to suppress background information. In addition, we introduce a contrastive learning module and design k-instance contrastive loss, optimize the disease feature representation by increasing the inter-class distance and reducing the intra-class distance for object features. We conducted experiments on two road disease detection datasets, GRDDC and CNRDD, and compared with other models,which show an increase of nearly 4% on GRDDC dataset compared with optimal model, and an increase of 4.6% on CNRDD dataset. Experimental results prove the effectiveness and superiority of our model.

摘要
道路疾病检测具有较小的疾病占比和多样背景，这些背景信息会引入很多领域信息，使检测变得更加困难。在这篇论文中，我们提出了一种新的LDBFSS（幽默领域背景特征分离和抑制）网络，可以无需领域监督和强制对象特征进行增强，从而实现背景信息的分离和抑制。我们将LDBFSS网络与YOLOv5模型结合，以提高疾病特征的检测。LDBFSS网络的组成部分包括幽默领域发现模块和领域对抗学习模块，通过无监督方法获取 Pseudo 领域标签，引导领域推定器和模型进行对抗学习，以suppressBackground信息。此外，我们还引入了对比学习模块，并设计了k个实例对比损失，以优化疾病特征表示。我们在GRDDC和CNRDD两个道路疾病检测数据集上进行实验，与其他模型进行比较，结果显示，与优化模型相比，我们的模型在GRDDC数据集上增加了约4%的检测精度，在CNRDD数据集上增加了4.6%的检测精度。实验结果证明了我们的模型的有效性和优势。

Learning Quasi-Static 3D Models of Markerless Deformable Linear Objects for Bimanual Robotic Manipulation

paper_url: http://arxiv.org/abs/2309.07609
repo_url: https://github.com/ppi-put/neural_dlo_model
paper_authors: Piotr Kicki, Michał Bidziński, Krzysztof Walas
for: This paper focuses on the robotic manipulation of deformable linear objects (DLOs) and proposes a new learning-based 3D model based on the Transformer architecture to achieve superior accuracy.
methods: The paper uses several learning-based 3D models of DLOs and proposes a new one based on the Transformer architecture, as well as introduces a data augmentation technique to improve the prediction performance of the models.
results: The proposed model achieves superior accuracy on several challenging datasets, even on DLOs of different lengths, and demonstrates its applicability in the task of shaping a DLO.Here is the text in Simplified Chinese:
for: 这篇论文关注机器人控制弹性线性物体（DLO）的问题，并提出了基于Transformer架构的新的学习型3D模型，以实现超越性能。
methods: 论文使用了多种学习型3D模型，并提出了一种基于Transformer架构的新模型，同时引入了一种数据增强技术，以提高模型预测性能。
results: 提出的模型在多个复杂的数据集上表现出优于其他模型，并在DLO的不同长度下实现了超越性能，证明其在DLO的形状控制任务中的应用可行性。

Abstract
The robotic manipulation of Deformable Linear Objects (DLOs) is a vital and challenging task that is important in many practical applications. Classical model-based approaches to this problem require an accurate model to capture how robot motions affect the deformation of the DLO. Nowadays, data-driven models offer the best tradeoff between quality and computation time. This paper analyzes several learning-based 3D models of the DLO and proposes a new one based on the Transformer architecture that achieves superior accuracy, even on the DLOs of different lengths, thanks to the proposed scaling method. Moreover, we introduce a data augmentation technique, which improves the prediction performance of almost all considered DLO data-driven models. Thanks to this technique, even a simple Multilayer Perceptron (MLP) achieves close to state-of-the-art performance while being significantly faster to evaluate. In the experiments, we compare the performance of the learning-based 3D models of the DLO on several challenging datasets quantitatively and demonstrate their applicability in the task of shaping a DLO.

摘要
机器人对弹性线性物体（DLO）的机械抓握是一项重要且挑战性较高的任务，在实际应用中具有重要意义。经典模型基本方法需要一个准确的模型来捕捉机器人运动对DLO的折叠效应。当今，数据驱动模型可以提供最佳的平衡点。本文分析了多种学习基于3D模型的DLO，并提出了基于Transformer架构的新模型，实现了更高精度，即使DLO的长度不同也是如此。此外，我们还引入了数据增强技术，该技术可以提高大多数考虑的DLO数据驱动模型的预测性能。这种技术使得简单的多层感知器（MLP）可以达到接近状态艺术的性能，而且评估速度快得多。在实验中，我们对learning基于3D模型的DLO的性能进行了评量，并在多个挑战性数据集上进行了比较，以示其在DLO形状 задании中的可行性。

Universality of underlying mechanism for successful deep learning

paper_url: http://arxiv.org/abs/2309.07537
repo_url: None
paper_authors: Yuval Meir, Yarden Tzach, Shiri Hodassman, Ofek Tevet, Ido Kanter
for: 提高深度学习模型的准确率和计算复杂度
methods: 使用单个滤波器的质量量化方法，找到小集合的可能输出标签，并通过层次进行进一步加工，提高信噪比和准确率
results: 验证了一种通用机制，可以提高VGG-16和EfficientNet-B0模型在CIFAR-100和ImageNet datasets上的准确率，并且显示了隐藏层数量和输出标签数量之间的关系。

Abstract
An underlying mechanism for successful deep learning (DL) with a limited deep architecture and dataset, namely VGG-16 on CIFAR-10, was recently presented based on a quantitative method to measure the quality of a single filter in each layer. In this method, each filter identifies small clusters of possible output labels, with additional noise selected as labels out of the clusters. This feature is progressively sharpened with the layers, resulting in an enhanced signal-to-noise ratio (SNR) and higher accuracy. In this study, the suggested universal mechanism is verified for VGG-16 and EfficientNet-B0 trained on the CIFAR-100 and ImageNet datasets with the following main results. First, the accuracy progressively increases with the layers, whereas the noise per filter typically progressively decreases. Second, for a given deep architecture, the maximal error rate increases approximately linearly with the number of output labels. Third, the average filter cluster size and the number of clusters per filter at the last convolutional layer adjacent to the output layer are almost independent of the number of dataset labels in the range [3, 1,000], while a high SNR is preserved. The presented DL mechanism suggests several techniques, such as applying filter's cluster connections (AFCC), to improve the computational complexity and accuracy of deep architectures and furthermore pinpoints the simplification of pre-existing structures while maintaining their accuracies.

摘要
深度学习（DL）中的一种基本机制，即VGG-16在CIFAR-10上的某些研究，提出了一种量化方法来衡量每个滤波器的质量。这种方法是，每个滤波器可以找到小 clusters of possible output labels，并且选择这些 clusters 中的陌生标签作为输出。这种特征逐渐增强，导致输出signal-to-noise ratio（SNR）的提高和更高的准确率。在这个研究中，这种建议的通用机制得到了VGG-16和EfficientNet-B0在CIFAR-100和ImageNet数据集上的验证，以下是主要的结果：1. 准确度逐渐增加，而噪音每个滤波器通常逐渐减少。2. 对于给定的深度架构，最大错误率随输出标签的数量 approximately 线性增加。3. 最后一层的卷积层附近的滤波器集群大小和每个滤波器的集群数量在[3, 1,000]个标签范围内基本不变，而保持高SNR。这种深度学习机制建议了一些技术，如应用滤波器集群连接（AFCC），以提高深度架构的计算复杂性和准确率，同时简化现有结构而保持其准确性。

Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach

paper_url: http://arxiv.org/abs/2309.07944
repo_url: None
paper_authors: Guillaume Jeanneret, Loïc Simon, Frédéric Jurie
for: 本文目的是生成Counterfactual Explanations（CEs），即通过修改最少必需的特征来改变分类器对给定图像的预测。
methods: 本文提出的方法是基于Distillation的黑盒Counterfactual技术，无需分类器结构、参数或梯度。首先，TIME引入了两种不同的偏见到Stable Diffusion中：context bias和class bias。context bias是图像结构相关的偏见，class bias是通过目标分类器学习的类特征偏见。然后，通过学习这两种偏见，找到最佳的latent code，并使用目标类token来重新生成图像，以生成counterfactual解释。
results: 对比 précédente方法，TIME可以在黑盒 Setting中生成相当有效的counterfactual解释。

Abstract
This paper addresses the challenge of generating Counterfactual Explanations (CEs), involving the identification and modification of the fewest necessary features to alter a classifier's prediction for a given image. Our proposed method, Text-to-Image Models for Counterfactual Explanations (TIME), is a black-box counterfactual technique based on distillation. Unlike previous methods, this approach requires solely the image and its prediction, omitting the need for the classifier's structure, parameters, or gradients. Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation. Extensive empirical studies validate that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting.

摘要
Before generating the counterfactuals, TIME introduces two distinct biases into Stable Diffusion in the form of textual embeddings: the context bias, associated with the image's structure, and the class bias, linked to class-specific features learned by the target classifier. After learning these biases, we find the optimal latent code by applying the classifier's predicted class token and regenerate the image using the target embedding as conditioning, producing the counterfactual explanation.Empirical studies have shown that TIME can generate explanations of comparable effectiveness even when operating within a black-box setting.

paper_url: http://arxiv.org/abs/2309.07524
repo_url: None
paper_authors: Yujie Feng, Yin Yang, Xiaohong Fan, Zhengpeng Zhang, Jianping Zhang
for: 这个研究旨在提高遥测图像质量，Addressing the limitations of remote sensing image degradation due to sensor technology and complex imaging environments.methods: 提出了一个新的盲目杀价学习框架，Combining alternating iterations of shrinkage thresholds, blurring kernels, and images with a theoretical foundation of network design. Additionally, a learnable blur kernel proximal mapping module and a deep proximal mapping module in the image domain are proposed.results: 实验结果显示了我们的MGSTNet框架在遥测图像 datasets上比现有的杀价方法更高效。

Abstract
Remote sensing images are essential for many earth science applications, but their quality can be degraded due to limitations in sensor technology and complex imaging environments. To address this, various remote sensing image deblurring methods have been developed to restore sharp, high-quality images from degraded observational data. However, most traditional model-based deblurring methods usually require predefined hand-craft prior assumptions, which are difficult to handle in complex applications, and most deep learning-based deblurring methods are designed as a black box, lacking transparency and interpretability. In this work, we propose a novel blind deblurring learning framework based on alternating iterations of shrinkage thresholds, alternately updating blurring kernels and images, with the theoretical foundation of network design. Additionally, we propose a learnable blur kernel proximal mapping module to improve the blur kernel evaluation in the kernel domain. Then, we proposed a deep proximal mapping module in the image domain, which combines a generalized shrinkage threshold operator and a multi-scale prior feature extraction block. This module also introduces an attention mechanism to adaptively adjust the prior importance, thus avoiding the drawbacks of hand-crafted image prior terms. Thus, a novel multi-scale generalized shrinkage threshold network (MGSTNet) is designed to specifically focus on learning deep geometric prior features to enhance image restoration. Experiments demonstrate the superiority of our MGSTNet framework on remote sensing image datasets compared to existing deblurring methods.

摘要
<>remote sensing 图像是许多地球科学应用中必需的，但它们可能会受到仪器技术和观测环境的限制而受损。为了解决这问题，许多基于模型的融合图像滤波方法已经被开发出来，以恢复高质量的图像。然而，大多数传统的模型基于的手工定制假设是难以处理复杂的应用场景，而大多数基于深度学习的滤波方法则是黑盒模型，缺乏透明性和可解释性。在这种情况下，我们提出了一种基于交互迭代的盲滤波学习框架，其基于网络设计理论。此外，我们还提出了一种可学习的滤波kernels proximal映射模块，以提高滤波kernels的评估。然后，我们提出了一种深度 proximal映射模块，它将通过一个通用的缩短阈值操作符和多尺度先验特征提取块来组合。这个模块还引入了一种注意机制，以适应性地调整先验重要性，从而避免手动定制图像先验项的缺陷。因此，我们提出了一种基于多尺度总体缩短阈值网络（MGSTNet），以专门学习深度几何先验特征，以提高图像恢复。实验表明，我们的MGSTNet框架在遥感图像 datasets 上的效果比 EXISTS 的滤波方法更佳。

Dhan-Shomadhan: A Dataset of Rice Leaf Disease Classification for Bangladeshi Local Rice

paper_url: http://arxiv.org/abs/2309.07515
repo_url: None
paper_authors: Md. Fahad Hossain
for: 这个论文是为了提供一个大量的rice叶病病例图像集，用于Computer Vision和图像识别技术的研究和应用。
methods: 本论文使用了两种背景变化，包括场景背景图像和白色背景图像，收集了1106张五种危害rice的病病例图像。
results: 本论文通过对这些图像进行分类和识别，得到了对rice叶病的识别和检测的结果。

Abstract
This dataset represents almost all the harmful diseases for rice in Bangladesh. This dataset consists of 1106 image of five harmful diseases called Brown Spot, Leaf Scaled, Rice Blast, Rice Turngo, Steath Blight in two different background variation named field background picture and white background picture. Two different background variation helps the dataset to perform more accurately so that the user can use this data for field use as well as white background for decision making. The data is collected from rice field of Dhaka Division. This dataset can use for rice leaf diseases classification, diseases detection using Computer Vision and Pattern Recognition for different rice leaf disease.

摘要
这个数据集代表了孟加拉的rice中的大多数有害病种。这个数据集包含1106张五种有害病种的图像，分别是棕点病、叶缘病、rice毒、rice螺旋和隐蔀病，在两种不同的背景 variation中，分别是田埂背景图像和白色背景图像。两种不同的背景 variation 帮助数据集更加准确地进行分类，以便用户在场景中使用这些数据进行决策。这些数据来自于达卡分区的rice场。这个数据集可以用于rice叶病识别、病种检测使用计算机视觉和模式识别。

paper_url: http://arxiv.org/abs/2309.07513
repo_url: None
paper_authors: Gregor Koehler, Tassilo Wald, Constantin Ulrich, David Zimmerer, Paul F. Jaeger, Jörg K. H. Franke, Simon Kohl, Fabian Isensee, Klaus H. Maier-Hein
for: 提高神经网络的决策精度，允许神经网络在不同角度重新考虑初始决策，从而提高决策质量。
methods: 提出了一种叫做 RecycleNet 的缓存特征回收方法，通过在多个回收步骤中，将输出反馈到早期神经网络层次，以实现神经网络可以在不同角度重新考虑初始决策，以提高决策质量。
results: 在医学图像分割任务上进行评估，显示了 RecycleNet 可以在不同的分割benchmark上提高决策精度，并且与当前最佳分割方法相比，也可以获得更好的性能。

Abstract
Despite the remarkable success of deep learning systems over the last decade, a key difference still remains between neural network and human decision-making: As humans, we cannot only form a decision on the spot, but also ponder, revisiting an initial guess from different angles, distilling relevant information, arriving at a better decision. Here, we propose RecycleNet, a latent feature recycling method, instilling the pondering capability for neural networks to refine initial decisions over a number of recycling steps, where outputs are fed back into earlier network layers in an iterative fashion. This approach makes minimal assumptions about the neural network architecture and thus can be implemented in a wide variety of contexts. Using medical image segmentation as the evaluation environment, we show that latent feature recycling enables the network to iteratively refine initial predictions even beyond the iterations seen during training, converging towards an improved decision. We evaluate this across a variety of segmentation benchmarks and show consistent improvements even compared with top-performing segmentation methods. This allows trading increased computation time for improved performance, which can be beneficial, especially for safety-critical applications.

摘要
尽管深度学习系统在过去十年内取得了非常出色的成绩，但是人类决策和神经网络决策之间仍然存在一定的区别：人类可以不仅立即做出决策，还可以思考、重新考虑、从不同的角度来到达更好的决策。为了让神经网络具备这种“pondering”能力，我们提出了RecycleNet方法，即 latent feature recycling 方法，允许神经网络在多个循环步骤中反复利用初始决策，以提高决策的准确性。这种方法对神经网络 Architecture 做出了最少的假设，因此可以在各种上下文中实现。使用医学影像 segmentation 作为评估环境，我们表明了 latent feature recycling 可以让神经网络在训练过程中没有看到的多个循环步骤中反复更新初始预测，并且在多个 segmentation 标准 benchmark 上达到了更高的性能。这意味着可以通过增加计算时间来换取更好的性能，这在安全关键应用中可能是有利的。

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

paper_url: http://arxiv.org/abs/2309.07509
repo_url: None
paper_authors: Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
for: 这篇论文的目的是生成真实的讲话表情，具有各种应用。
methods: DiffTalker 模型使用了音频和特征点共同驱动，以解决直接将扩散模型应用到音频控制的挑战。DiffTalker 包括两个代理网络：一个基于 transformer 的特征点完成网络，以确保几何精度，以及一个基于扩散的脸部生成网络，以捕捉具有纹理详细的讲话表情。
results: DiffTalker 能够生成具有清晰度和几何精度的讲话表情，不需要额外对 audio 和图像特征进行对齐。实验结果表明，DiffTalker 在生成讲话表情方面表现出色，无需额外对 audio 和图像特征进行对齐。

Abstract
Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

摘要
通过音频和标点驱动，DiffTalker模型可生成真实的说话脸。DiffTalker解决了直接应用扩散模型到音频控制的挑战，传统上是通过文本图像对对应来训练。DiffTalker包括两个代理网络：一个基于变换器的标点完成网络以确保几何准确，以及一个基于扩散的面Generated network дляTexture详细。标点在将音频和图像领域连接起来的过程中扮演着关键的角色，使得可以借助预训练的扩散模型来 incorporate知识。这种创新的方法可以高效地生成敏捷说话的脸。实验结果表明DiffTalker可以生成清晰和几何准确的说话脸，无需额外对音频和图像特征进行对齐。

Efficiently Robustify Pre-trained Models

paper_url: http://arxiv.org/abs/2309.07499
repo_url: None
paper_authors: Nishant Jain, Harkirat Behl, Yogesh Singh Rawat, Vibhav Vineet
for: 本研究旨在探讨大规模深度学习模型在真实世界中的稳定性，以及现有robustification方法是否可 scalable。
methods: 我们首先对这些大规模模型进行了不同类型的拟合，以示它们在不同的数据集和预测task中的性能下降。然后，我们讨论了完全模型 fine-tuning 的缺点，包括 computational overhead和模型忘记部分感知特征。最后，我们提出了一种简单且cost-effective的方法，基于知识传递文献，可以快速地增强这些大规模模型的稳定性，同时保留模型的传输学习和零shot评估性能。
results: 我们的方法在不同的视觉拟合 dataset 上（包括 ImageNet-C,R,S,A dataset 和不同数据集的传输学习和零shot评估setup）进行了评估，结果显示我们的方法能够有效地增强大规模模型的稳定性，需要远少于原始模型的计算负担，同时保留模型的传输学习和零shot性能。

Abstract
A recent trend in deep learning algorithms has been towards training large scale models, having high parameter count and trained on big dataset. However, robustness of such large scale models towards real-world settings is still a less-explored topic. In this work, we first benchmark the performance of these models under different perturbations and datasets thereby representing real-world shifts, and highlight their degrading performance under these shifts. We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks and can also lead them to forget some of the desired characterstics. Finally, we propose a simple and cost-effective method to solve this problem, inspired by knowledge transfer literature. It involves robustifying smaller models, at a lower computation cost, and then use them as teachers to tune a fraction of these large scale networks, reducing the overall computational overhead. We evaluate our proposed method under various vision perturbations including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot evaluation setups on different datasets. Benchmark results show that our method is able to induce robustness to these large scale models efficiently, requiring significantly lower time and also preserves the transfer learning, zero-shot properties of the original model which none of the existing methods are able to achieve.

摘要
现在的深度学习算法趋势是训练大规模模型，具有高参数计数和在大量数据上训练。然而，这些大规模模型在实际场景中的稳定性仍然是一个未经探索的话题。在这项工作中，我们首先对这些模型在不同的扰动和数据集上进行了性能测试，并发现它们在这些扰动下的性能下降。然后，我们讨论了完整模型练习基于现有的Robustification方案可能不是一个可执行的选择，因为它们可能会使模型忘记一些愿望的特征。最后，我们提出了一种简单而经济的解决方案，基于知识传递文献。它是通过强化小型模型，然后使用这些小型模型作为老师来微调一部分这些大规模网络，从而降低总计算成本。我们对各种视觉扰动，包括ImageNet-C、R、S、A数据集以及转移学习、零shot评估集成进行了评估。结果表明，我们的方法能够有效地带来这些大规模模型的稳定性，需要较低的计算成本，同时保留原始模型的转移学习、零shot特性，与现有方法不同。

EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

paper_url: http://arxiv.org/abs/2309.07471
repo_url: https://github.com/minnjung/ep2p-loc
paper_authors: Minjung Kim, Junseo Koo, Gunhee Kim
for: 本研究旨在解决视觉地标化问题，即从视觉图像中估算出6DOF摄像机pose。
methods: 本方法使用了一种新的大规模视觉地标化方法，即EP2P-Loc。该方法利用了2D-3D特征匹配，并通过终端训练来提高pose估算的精度。
results: 在新的大规模indoor和outdoorbenchmark上进行了实验，并显示了与现有视觉地标化和图像到点云注册方法相比，本方法可以 дости得最佳性能。

Abstract
Visual localization is the task of estimating a 6-DoF camera pose of a query image within a provided 3D reference map. Thanks to recent advances in various 3D sensors, 3D point clouds are becoming a more accurate and affordable option for building the reference map, but research to match the points of 3D point clouds with pixels in 2D images for visual localization remains challenging. Existing approaches that jointly learn 2D-3D feature matching suffer from low inliers due to representational differences between the two modalities, and the methods that bypass this problem into classification have an issue of poor refinement. In this work, we propose EP2P-Loc, a novel large-scale visual localization method that mitigates such appearance discrepancy and enables end-to-end training for pose estimation. To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image, and find all 2D-3D correspondences without keypoint detection. To reduce memory usage and search complexity, we take a coarse-to-fine approach where we extract patch-level features from 2D images, then perform 2D patch classification on each 3D point, and obtain the exact corresponding 2D pixel coordinates through positional encoding. Finally, for the first time in this task, we employ a differentiable PnP for end-to-end training. In the experiments on newly curated large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show that our method achieves the state-of-the-art performance compared to existing visual localization and image-to-point cloud registration methods.

摘要
“视觉地理位置”是指根据查询图像和提供的3D参考地图来估算摄像机pose的任务。由于不同感知模式之间的表示差异，现有的方法 JOINTLY学习2D-3D特征匹配具有低准确率。此外，通过 Circumventing 这个问题，这些方法通常会导致精度不高的答案。在这项工作中，我们提出了EP2P-Loc，一种新的大规模视觉地理位置方法，可以减轻表示差异的问题，并允许端到端培训。为了增加准确率，我们提出了一种简单的算法，可以在图像中remove不可见的3D点，并找到所有2D-3D匹配。此外，为了降低内存使用量和搜索复杂度，我们采取了一种粗粒度-细粒度的方法，首先从2D图像中提取patch-level特征，然后在每个3D点上进行2D patch分类，并通过 pozitional encoding 获取准确的2D像素坐标。最后，我们首次在这个任务中使用了可导的PnP для端到端培训。在新编制的大规模indoor和outdoor benchmarks上进行了实验，我们示出了我们的方法可以与现有的视觉地理位置和图像-点云注册方法相比，达到了状态 искусственный智能水平。

Research on self-cross transformer model of point cloud change detecter

paper_url: http://arxiv.org/abs/2309.07444
repo_url: None
paper_authors: Xiaoxu Ren, Haili Sun, Zhenxin Zhang
for: 本文主要针对的是检测3D点云中的变化，以帮助城市建设过程中的变化检测，确保项目完整性和减少劳动成本。
methods: 本文提出了一种基于cross transformer模块的3D点云变化检测网络，并对其进行了验证和测试。
results: 测试结果表明，该网络在检测3D点云中的变化方面具有高精度和高速响应性。

Abstract
With the vigorous development of the urban construction industry, engineering deformation or changes often occur during the construction process. To combat this phenomenon, it is necessary to detect changes in order to detect construction loopholes in time, ensure the integrity of the project and reduce labor costs. Or the inconvenience and injuriousness of the road. In the study of change detection in 3D point clouds, researchers have published various research methods on 3D point clouds. Directly based on but mostly based ontraditional threshold distance methods (C2C, M3C2, M3C2-EP), and some are to convert 3D point clouds into DSM, which loses a lot of original information. Although deep learning is used in remote sensing methods, in terms of change detection of 3D point clouds, it is more converted into two-dimensional patches, and neural networks are rarely applied directly. We prefer that the network is given at the level of pixels or points. Variety. Therefore, in this article, our network builds a network for 3D point cloud change detection, and proposes a new module Cross transformer suitable for change detection. Simultaneously simulate tunneling data for change detection, and do test experiments with our network.

摘要
随着城市建设业的发展，工程变形或变化经常发生在建设过程中。要解决这种现象，需要检测变化，以时间准确检测建筑缺陷，保证项目完整性，降低劳动成本。或者公路不便。在3D点云变化检测研究中，研究人员已经发表了多种研究方法。大多基于传统的距离方法（C2C、M3C2、M3C2-EP），一些是将3D点云转换为DSM，这会产生很多原始信息的损失。虽然深度学习在远程感知方法中被广泛应用，但在3D点云变化检测方面，它更多地转换为二维质心，神经网络在Change Detection中 rarely applied directly。我们认为，网络应该给点或像素层次。因此，在这篇文章中，我们建立了一个3D点云变化检测网络，并提出了一个适用于变化检测的新模块——跨 transformer。同时，我们对各种数据进行了模拟隧道测试，并对我们的网络进行了测试实验。

DePT: Decoupled Prompt Tuning

paper_url: http://arxiv.org/abs/2309.07439
repo_url: https://github.com/koorye/dept
paper_authors: Ji Zhang, Shihan Wu, Lianli Gao, Hengtao Shen, Jingkuan Song
for: 解决基础-新任务负面关系（Base-New Tradeoff）在提升任务中，即提升基础任务的泛化性会导致新任务的泛化性减退，并且vice versa。
methods: 提出了一种叫做Decoupled Prompt Tuning（DePT）框架，它在提升过程中将基础知识从特征通道隔离到一个隔离的特征空间中，以保留原始特征空间中的任务共享知识，从而实现更好的零基础泛化性在新任务上。
results: 经过对11个数据集的广泛实验，显示DePT具有强大的灵活性和效果性，可以提高所有的提升方法。代码和预训练模型可以在https://github.com/Koorye/DePT上下载。

Abstract
This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning, i.e., the better the tuned model generalizes to the base (or target) task, the worse it generalizes to new tasks, and vice versa. Specifically, through an in-depth analysis of the learned features of the base and new tasks, we observe that the BNT stems from a channel bias issue, i.e., the vast majority of feature channels are occupied by base-specific knowledge, resulting in the collapse of taskshared knowledge important to new tasks. To address this, we propose the Decoupled Prompt Tuning (DePT) framework, which decouples base-specific knowledge from feature channels into an isolated feature space during prompt tuning, so as to maximally preserve task-shared knowledge in the original feature space for achieving better zero-shot generalization on new tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods, hence it can improve all of them. Extensive experiments on 11 datasets show the strong flexibility and effectiveness of DePT. Our code and pretrained models are available at https://github.com/Koorye/DePT.

摘要

Physical Invisible Backdoor Based on Camera Imaging

paper_url: http://arxiv.org/abs/2309.07428
repo_url: None
paper_authors: Yusheng Guo, Nan Zhong, Zhenxing Qian, Xinpeng Zhang
for: 这篇论文旨在提出一种physical invisible backdoor攻击方法，用于妨碍模型的正常工作，而不需要改变图像的 pixels。
methods: 该方法基于camera imaging，使用特定的摄像头ID来提取图像特征，并利用CFA interpolations algorithm和特征提取块组合成一个特殊的网络结构，以便在这个结构上进行攻击。
results: 实验结果表明，该方法可以 Effectively compromise classical models, such as ResNet18, over a new dataset of 21,500 images, and is robust against various backdoor defenses.

Abstract
Backdoor attack aims to compromise a model, which returns an adversary-wanted output when a specific trigger pattern appears yet behaves normally for clean inputs. Current backdoor attacks require changing pixels of clean images, which results in poor stealthiness of attacks and increases the difficulty of the physical implementation. This paper proposes a novel physical invisible backdoor based on camera imaging without changing nature image pixels. Specifically, a compromised model returns a target label for images taken by a particular camera, while it returns correct results for other images. To implement and evaluate the proposed backdoor, we take shots of different objects from multi-angles using multiple smartphones to build a new dataset of 21,500 images. Conventional backdoor attacks work ineffectively with some classical models, such as ResNet18, over the above-mentioned dataset. Therefore, we propose a three-step training strategy to mount the backdoor attack. First, we design and train a camera identification model with the phone IDs to extract the camera fingerprint feature. Subsequently, we elaborate a special network architecture, which is easily compromised by our backdoor attack, by leveraging the attributes of the CFA interpolation algorithm and combining it with the feature extraction block in the camera identification model. Finally, we transfer the backdoor from the elaborated special network architecture to the classical architecture model via teacher-student distillation learning. Since the trigger of our method is related to the specific phone, our attack works effectively in the physical world. Experiment results demonstrate the feasibility of our proposed approach and robustness against various backdoor defenses.

摘要
<>这里使用了简化字体。>黑门攻击目的是妥协模型，让模型在特定的触发模式出现时返回攻击者所需的输出，但是在清洁的输入中保持正常的行为。现有的黑门攻击需要改变清洁图像的像素，这会导致攻击的不透明度低下和实现physical实现更加困难。本文提出了一种新的物理隐藏黑门，基于Camera imaging，不需要改变 Nature 图像像素。具体来说，一个妥协模型会对特定的摄像头拍摄的图像返回目标标签，而对其他图像返回正确的结果。为实现和评估提议的黑门，我们使用多种多摄像头拍摄不同的物体从多个角度，并建立了一个新的数据集，包含21500个图像。传统的黑门攻击对于一些经典模型，如ResNet18，在上述数据集上无效。因此，我们提出了一种三步训练策略来实现黑门攻击。首先，我们设计和训练了一个摄像头标识模型，以EXTRACTING camera fingerprint feature。然后，我们利用CFA interpolación算法的特点和摄像头标识模型的特点，设计了一种特殊的网络架构，这种网络架构容易受到我们黑门攻击的影响。最后，我们通过教师学生分布式学习来帮助特殊网络架构转移到经典网络架构中，从而实现黑门攻击。由于触发器与特定的手机相关，我们的黑门在物理世界中具有效果。实验结果表明，我们的提议方法是可行的，并且对各种黑门防御措施具有较高的Robustness。

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

paper_url: http://arxiv.org/abs/2309.07409
repo_url: https://github.com/ffzzy840304/masked-pdpp
paper_authors: Fen Fang, Yun Liu, Ali Koksal, Qianli Xu, Joo-Hwee Lim
for: 本研究旨在解决视频教程中的程序规划问题，即从短视觉中快速识别多种任务类型（如倒 Pour milk、倒 Pour water、打开封口、关闭封口等），并 Capture 这些动作类型和任务目标之间的细致 semantic relation。
methods: 我们提出了一种简单 yet effective 的提高方法 - 使用 masked diffusion model。这个mask acts as a task-oriented attention filter，使得 diffusion/denoising process 能够专注于一 subset of action types。此外，我们还使用更强大的视觉表示学习技术来增强任务分类的准确性。特别是，我们学习了一个 joint visual-text embedding，其中text embedding 由提取 human actions的 pre-trained vision-language model 生成。
results: 我们在三个公共数据集上进行了evaluation，并达到了当前最佳性能在多个指标上。Code可以在https://github.com/ffzzy840304/Masked-PDPP中找到。

Abstract
A key challenge with procedure planning in instructional videos lies in how to handle a large decision space consisting of a multitude of action types that belong to various tasks. To understand real-world video content, an AI agent must proficiently discern these action types (e.g., pour milk, pour water, open lid, close lid, etc.) based on brief visual observation. Moreover, it must adeptly capture the intricate semantic relation of the action types and task goals, along with the variable action sequences. Recently, notable progress has been made via the integration of diffusion models and visual representation learning to address the challenge. However, existing models employ rudimentary mechanisms to utilize task information to manage the decision space. To overcome this limitation, we introduce a simple yet effective enhancement - a masked diffusion model. The introduced mask acts akin to a task-oriented attention filter, enabling the diffusion/denoising process to concentrate on a subset of action types. Furthermore, to bolster the accuracy of task classification, we harness more potent visual representation learning techniques. In particular, we learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions. We evaluate the method on three public datasets and achieve state-of-the-art performance on multiple metrics. Code is available at https://github.com/ffzzy840304/Masked-PDPP.

摘要
“一个主要挑战在程序规划视频教程中是如何处理庞大的决策空间，这个空间包含许多不同任务的动作类型。为了理解现实世界视频内容，一个AI代理必须能够准确地识别这些动作类型（如抹 milk、抹水、打开封口、关闭封口等），并且需要capture这些动作类型和任务目标之间的复杂semantic关系，以及变化的动作序列。在最近，通过混合扩散模型和视觉表示学习来解决这个挑战，然而现有模型使用简单的任务信息管理decision space的机制。为了超越这个限制，我们提出了一种简单 yet有效的提高---masked扩散模型。在扩散/净化过程中，这个mask acts as a task-oriented attention filter，使得扩散/净化过程能够专注于一 subset of action types。此外，为了提高任务分类的准确度，我们利用更强大的视觉表示学习技术。具体来说，我们学习一个joint visual-text embedding，其中的text embedding是通过Prompting a pre-trained vision-language model来Focus on human actions来生成。我们对三个公共数据集进行评估，并在多个纪录录制中 дости得状态的最佳性能。代码可以在https://github.com/ffzzy840304/Masked-PDPP中找到。”

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

paper_url: http://arxiv.org/abs/2309.07403
repo_url: None
paper_authors: Lei Fan, Bo Liu, Haoxiang Li, Ying Wu, Gang Hua
for: 提高视觉识别系统的灵活性和可靠性，解决知名类错误分类和未知类图像的异常行为问题。
methods: 基于主观逻辑理论，分别量化决策不确定性为内类冲突和外部无知，并通过证据结合来获得全面的主观意见。
results: 通过synthetic数据分析、视觉识别和开集检测等实验，证明了方法的有效性，可以准确量化两种不确定性并有助于灵活识别。

Abstract
In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition.

摘要

Quantifying prediction uncertainty: Uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples.2. Comparable uncertainty between samples: Both confusion and ignorance should be comparable between samples to enable effective decision-making.In this paper, we propose to explicitly model these two sources of uncertainty using the theory of Subjective Logic. Recognition is viewed as an evidence-collecting process, and confusion is defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, can be achieved via further evidence combinations.We demonstrate the effectiveness of our methods through a series of experiments on synthetic data analysis, visual recognition, and open-set detection. Our approach can quantify two sources of uncertainties and handle flexible recognition effectively.

HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis

paper_url: http://arxiv.org/abs/2309.07400
repo_url: https://github.com/hku-medai/higt
paper_authors: Ziyu Guo, Weiqin Zhao, Shujun Wang, Lequan Yu
For: 这 paper 是为了研究 computation pathology 领域中 gigapixel Whole Slide Images (WSIs) 的 pyramid 结构，以捕捉不同层次的信息，从单个细胞互动到组织微环境。* Methods: 这 paper 使用了一种新的 Hierarchical Interaction Graph-Transformer (HIGT) 模型，结合图 neural network 和 transformer 作为基础，可以学习 WSI pyramids 中的短距离本地信息和长距离全局表示。另外，为了在不同层次之间建立交互，我们设计了一种 novel Bidirectional Interaction block。* Results: 经过对两个公共 WSI 数据集（KICA 和 ESCA）的测试，我们的 HIGT 方法在肿瘤分类和阶段评估任务上表现出色，超过了现有的 hierarchical 和非层次方法。

Abstract
In computation pathology, the pyramid structure of gigapixel Whole Slide Images (WSIs) has recently been studied for capturing various information from individual cell interactions to tissue microenvironments. This hierarchical structure is believed to be beneficial for cancer diagnosis and prognosis tasks. However, most previous hierarchical WSI analysis works (1) only characterize local or global correlations within the WSI pyramids and (2) use only unidirectional interaction between different resolutions, leading to an incomplete picture of WSI pyramids. To this end, this paper presents a novel Hierarchical Interaction Graph-Transformer (i.e., HIGT) for WSI analysis. With Graph Neural Network and Transformer as the building commons, HIGT can learn both short-range local information and long-range global representation of the WSI pyramids. Considering that the information from different resolutions is complementary and can benefit each other during the learning process, we further design a novel Bidirectional Interaction block to establish communication between different levels within the WSI pyramids. Finally, we aggregate both coarse-grained and fine-grained features learned from different levels together for slide-level prediction. We evaluate our methods on two public WSI datasets from TCGA projects, i.e., kidney carcinoma (KICA) and esophageal carcinoma (ESCA). Experimental results show that our HIGT outperforms both hierarchical and non-hierarchical state-of-the-art methods on both tumor subtyping and staging tasks.

摘要
在计算 PATHOLOGY 领域，最近才 studying gigapixel Whole Slide Images (WSIs) 的 pyramid 结构，以捕捉各个细胞交互到组织微环境中的多种信息。这种层次结构被认为对于肿瘤诊断和预后 Task 有利。然而，大多数先前的层次 WSI 分析工作（1）只研究 WSI pyramids 的本地或全局相关性，以及（2）使用单向交互，导致 WSI pyramids 的完整图像减少。为此，本文提出了一种基于 Graph Neural Network 和 Transformer 的 Hierarchical Interaction Graph-Transformer (i.e., HIGT)，可以学习 WSI pyramids 中的短距离本地信息和长距离全局表示。因为不同层次的信息是补偿的，可以在学习过程中互助于each other，我们进一步设计了一种替换 Interaction 块来在 WSI pyramids 中建立不同层次之间的交互。最后，我们将不同层次上学习的粗细化特征和细胞特征融合在一起进行板块级预测。我们在 TCGA 项目中公开的两个 WSI 数据集（i.e., KICA 和 ESCA）上进行了实验，结果表明我们的 HIGT 在肿瘤分类和预后 Task 上都有出色的表现，超过了现有的层次和非层次方法。

Nucleus-aware Self-supervised Pretraining Using Unpaired Image-to-image Translation for Histopathology Images

paper_url: http://arxiv.org/abs/2309.07394
repo_url: https://github.com/zhiyuns/UNITPathSSL
paper_authors: Zhiyun Song, Penghui Du, Junpeng Yan, Kailu Li, Jianzhong Shou, Maode Lai, Yubo Fan, Yan Xu
for:本研究旨在提高模型性能，通过未标注数据中获得有效特征，并在 Histopathology 图像领域得到成功。然而，只有少数研究强调抽取核心水平信息，这是 PATHOLOGIC 分析中关键的一部分。本文提出了一种基于自我超vised 预训练的新核心意识框架，用于 Histopathology 图像。该框架通过未标注图像与假掩码图像之间的无对映射翻译，捕捉 Histopathology 图像中核仁形态和分布信息。methods:本研究使用了一种基于 conditional 和随机样式表示的自我超vised 预训练策略，以确保生成的 Histopathology 图像是真实且多样的。此外，我们还使用了实例分割指导策略，以捕捉实例水平信息。results:我们在 7 个数据集上进行了实验，并证明了我们的预训练方法在 Kather 分类、多实例学习和 5 个权重预测任务中超越了指导学习方法，并在 8 个半upervised 任务中提供了最佳结果。此外，我们还发现了基于自我超vised 预训练的模型在 PATHOLOGIC 分析中的优势。

Abstract
Self-supervised pretraining attempts to enhance model performance by obtaining effective features from unlabeled data, and has demonstrated its effectiveness in the field of histopathology images. Despite its success, few works concentrate on the extraction of nucleus-level information, which is essential for pathologic analysis. In this work, we propose a novel nucleus-aware self-supervised pretraining framework for histopathology images. The framework aims to capture the nuclear morphology and distribution information through unpaired image-to-image translation between histopathology images and pseudo mask images. The generation process is modulated by both conditional and stochastic style representations, ensuring the reality and diversity of the generated histopathology images for pretraining. Further, an instance segmentation guided strategy is employed to capture instance-level information. The experiments on 7 datasets show that the proposed pretraining method outperforms supervised ones on Kather classification, multiple instance learning, and 5 dense-prediction tasks with the transfer learning protocol, and yields superior results than other self-supervised approaches on 8 semi-supervised tasks. Our project is publicly available at https://github.com/zhiyuns/UNITPathSSL.

摘要
自我监督预训术目的是提高模型性能，通过无标注数据中获得有效特征，并在 histopathology 图像领域已经达到成功。然而，只有一些工作强调提取核心级信息，这是pathological分析中不可或缺的。在这项工作中，我们提出了一种新的自我监督预训框架，用于 histopathology 图像。该框架通过无标注图像到图像翻译来捕捉 histopathology 图像中核心形态和分布信息。生成过程由 conditional 和随机风格表示控制，以保证生成的 histopathology 图像的真实性和多样性。此外，我们还使用了实例分割指导策略来捕捉实例级信息。我们在 7 个数据集上进行了实验，结果表明，我们的预训法在 Kather 分类、多实例学习和 5 个激活预测任务中都能够超越supervised方法，并在 8 个半supervised任务中达到更高的效果。您可以在上下载我们的项目。

Judging a video by its bitstream cover

paper_url: http://arxiv.org/abs/2309.07361
repo_url: None
paper_authors: Yuxing Han, Yunan Ding, Jiangtao Wen, Chen Ye Gan
for: 本研究旨在开发一种基于post-compression bitstream的视频分类方法，以提高 multimedia 理解和检索效率。
methods: 该方法仅仅从视频的 post-compression bitstream 中提取特征，不需要进行视频解压缩，从而降低计算和存储需求。
results: 我们使用自定义的 YouTube 视频集（包含超过 29,000 个视频剪辑，总计 6,000 小时）进行验证，得到了precision、accuracy 和 recall 率高于 80%。该算法比传统的DTW算法六个数量级快，每秒钟可以处理 30 帧视频。

Abstract
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially in an age where an immense volume of video content is constantly being generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream. We validate our approach using a custom-built data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our preliminary evaluations indicate precision, accuracy, and recall rates well over 80%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by six orders of magnitude.

摘要
翻译文本为简化中文：分类视频为不同类别，如体育和音乐视频，对多媒体理解和检索是非常重要，特别在今天的数据涌入量是不断增长。传统方法需要视频解压缩，提取像素级特征，如颜色、文本和运动，从而增加计算和存储需求。此外，这些方法经常受到低质量视频的影响，性能下降。我们提出了一种新的方法，通过只对视频后期压缩流进行分类，取消了需要视频流。我们使用自定义的数据集，包括YouTube视频剪辑总计29,000个，总时长6,000小时，涵盖11种不同类别。我们的初步评估结果显示，精度、准确率和回归率超过80%。算法运行约15,000次 faster than real-time for 30fps videos，比传统的动态时间戳相差6个数量级。

2023-09-14

Morphologically-Aware Consensus Computation via Heuristics-based IterATive Optimization (MACCHIatO)

Interpretability-Aware Vision Transformer

Depth Estimation from a Single Optical Encoded Image using a Learned Colored-Coded Aperture

Empowering Visually Impaired Individuals: A Novel Use of Apple Live Photos and Android Motion Photos

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Kinship Verification from rPPG using 1DCNN Attention networks

TCGF: A unified tensorized consensus graph framework for multi-view representation learning

M3Dsynth: A dataset of medical 3D images with AI-generated local manipulations

Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping

Large-Vocabulary 3D Diffusion Model with Transformer

OpenIllumination: A Multi-Illumination Dataset for Inverse Rendering Evaluation on Real Objects

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Looking at words and points with attention: a benchmark for text-to-shape coherence

ALWOD: Active Learning for Weakly-Supervised Object Detection

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting

Physically Plausible Full-Body Hand-Object Interaction Synthesis

Generative Image Dynamics

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

A Novel Local-Global Feature Fusion Framework for Body-weight Exercise Recognition with Pressure Mapping Sensors

mEBAL2 Database and Benchmark: Image-based Multispectral Eyeblink Detection

Using network metrics to explore the community structure that underlies movement patterns

Gradient constrained sharpness-aware prompt learning for vision-language models

TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation

MC-NeRF: Muti-Camera Neural Radiance Fields for Muti-Camera Image Acquisition Systems

Decomposition of linear tensor transformations

For A More Comprehensive Evaluation of 6DoF Object Pose Tracking

Virchow: A Million-Slide Digital Pathology Foundation Model

Co-Salient Object Detection with Semantic-Level Consensus Extraction and Dispersion

DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

OmnimatteRF: Robust Omnimatte with 3D Background Modeling

Dataset Condensation via Generative Model

CoRF : Colorizing Radiance Fields using Knowledge Distillation

Towards Robust and Unconstrained Full Range of Rotation Head Pose Estimation

Indoor Scene Reconstruction with Fine-Grained Details Using Hybrid Representation and Normal Prior Enhancement

SwitchGPT: Adapting Large Language Models for Non-Text Outputs

Road Disease Detection based on Latent Domain Background Feature Separation and Suppression

Learning Quasi-Static 3D Models of Markerless Deformable Linear Objects for Bimanual Robotic Manipulation

Universality of underlying mechanism for successful deep learning

Text-to-Image Models for Counterfactual Explanations: a Black-Box Approach

A Multi-scale Generalized Shrinkage Threshold Network for Image Blind Deblurring in Remote Sensing

Dhan-Shomadhan: A Dataset of Rice Leaf Disease Classification for Bangladeshi Local Rice

RecycleNet: Latent Feature Recycling Leads to Iterative Decision Refinement

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Efficiently Robustify Pre-trained Models

EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization

Research on self-cross transformer model of point cloud change detecter

DePT: Decoupled Prompt Tuning

Physical Invisible Backdoor Based on Camera Imaging

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis

Nucleus-aware Self-supervised Pretraining Using Unpaired Image-to-image Translation for Histopathology Images

Judging a video by its bitstream cover