cs.CV - 2023-08-25

ROAM: Robust and Object-aware Motion Generation using Neural Pose Descriptors

paper_url: http://arxiv.org/abs/2308.12969
repo_url: None
paper_authors: Wanyue Zhang, Rishabh Dabral, Thomas Leimkühler, Vladislav Golyanik, Marc Habermann, Christian Theobalt
for: 本研究旨在解决现有自动方法对新物体的抗性和泛化问题，提高3D虚拟人物运动合成中对新物体的适应性和自然性。
methods: 本研究使用一个受过参数化的动作模型，并通过对物体只 datasets上学习的半定态特征表示来增强模型对新物体的抗性和泛化能力。
results: 通过对比当前状态的方法和用户研究，本研究得到了较好的3D虚拟人物运动和互动质量和稳定性，并且可以在未看过物体的情况下进行高质量的动作生成。

Abstract
Existing automatic approaches for 3D virtual character motion synthesis supporting scene interactions do not generalise well to new objects outside training distributions, even when trained on extensive motion capture datasets with diverse objects and annotated interactions. This paper addresses this limitation and shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object. We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object. Given an unseen object and a reference pose-object pair, we optimise for the object-aware pose that is closest in the feature space to the reference pose. Finally, we use l-NSM, i.e., our motion generation model that is trained to seamlessly transition from locomotion to object interaction with the proposed bidirectional pose blending scheme. Through comprehensive numerical comparisons to state-of-the-art methods and in a user study, we demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects. Our project page is available at https://vcai.mpi-inf.mpg.de/projects/ROAM/.

摘要
现有自动化方法 для3D虚拟人物运动合成，不能很好地泛化到新物体外部训练分布，即使训练在含有多种物体和注释交互的大规模运动捕捉数据集上。这篇论文解决了这一问题，并显示了在含有新物体的场景中的3D物体意识Character Synthesis中的稳定性和泛化性可以通过训练一个运动模型，只需要一个参考物体。我们利用了一种基于物体专门采集的隐藏特征表示，这种表示在物体周围的SE(3)-等变换equivariant描述器场中编码了物体。给定一个未看过的物体和一个参考姿态-物体对，我们优化了 closest在特征空间的物体意识姿态。最后，我们使用l-NSM，即我们训练的运动生成模型，通过我们的拟合bidirectional姿态混合方案来协调转换从步行到物体交互。通过对现有方法的数字比较和用户研究，我们展示了3D虚拟人物运动和交互质量和稳定性在未看过物体场景中得到了显著改善。我们的项目页面可以在https://vcai.mpi-inf.mpg.de/projects/ROAM/上找到。

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

paper_url: http://arxiv.org/abs/2308.12968
repo_url: https://github.com/yuxinn-j/scenimefy
paper_authors: Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy
for: 这种研究的目的是提高动漫场景的自动高质量渲染，以解决现有的镜像匹配问题，提高图像的semantic preserve和精细特征。
methods: 这种方法使用了semi-supervised image-to-image翻译框架，使用了Structure-consistent pseudo paired data，并使用了segementation-guided data selection和patch-wise contrastive style loss来提高风格化和精细特征。
results: 对比 estado-of-the-art 基eline，这种方法在 both perceptual quality和量化性能方面表现出色，得到了更高的质量和更好的结果。

Abstract
Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.

摘要
<>Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.Translated by Google Translate.

POCO: 3D Pose and Shape Estimation with Confidence

paper_url: http://arxiv.org/abs/2308.12965
repo_url: None
paper_authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas
for: The paper is written for improving the accuracy of 3D human pose and shape estimation from images, and providing uncertainty estimates for downstream tasks.methods: The paper proposes a novel framework called POCO, which uses a Dual Conditioning Strategy (DCS) to estimate both the 3D body pose and the per-sample variance in a single feed-forward pass.results: The paper shows that training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose, and demonstrates the effectiveness of the proposed method by applying it to three state-of-the-art HPS regressors and showing improvement in accuracy. Additionally, the paper demonstrates the usefulness of the uncertainty estimates for downstream tasks such as bootstrap HPS training and video pose estimation.Here’s the Chinese translation of the three information:for: 本文是为了提高图像中人体三维姿态和形状估计的准确性, 并为下游任务提供不确定性估计。methods: 本文提出了一种名为POCO的新框架，该框架使用双conditioning策略（DCS）来在单一的前向传播中估计3D人体姿态和每个样本的方差。results: 本文显示了训练网络理解不确定性可以帮助其更加准确地估计3D姿态，并通过应用到三个state-of-the-art HPS regressors上显示了改进准确性。此外，本文还示出了不确定性估计的实用性，例如通过自动划分不确定样本来进行HPS训练、视频人体姿态估计等。

Abstract
The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames. Code and models will be available for research at https://poco.is.tue.mpg.de.

摘要
“三维人体姿态和形状（HPS）从图像回推的进步越来越精确，使得结果可以用于人做动作识别或3Dgraphics。然而，没有一个优秀的回推器，因为图像证据不明确或者人做动作和外表都没有在训练过程中出现过。现今大多数HPS回推器都不会报告其出力的可信度，因此下游任务无法分辨实际的估计和错误的估计。为了解决这个问题，我们开发了POCO，一个新的框架，可以在单一的从前进推 pass中预测HPS和其可信度。具体来说，POCO会预测3D人体姿态和每个样本的条件方差。我们的关键思想是通过引入双条件策略（DCS）来预测不确定性，这和姿态重建质量高度相关。POCO框架可以应用于任何HPS回推器，我们在这里评估了修改HMR、PARE和CLIFF等回推器。在所有情况下，将network培训来理解不确定性，使其更好地估计3D姿态。这并不是我们的主要目标，但是改善是微不足道，但是一致的。我们的主要动机是提供不确定性估计，我们在两种方式中示出了这个：（1）我们使用POCO-trained回推器的自信估计作为自动生成的pseudo陌生标本。将这些自信估计作为训练标本，然后重训，可以提高准确性。（2）我们利用不确定性在动作捕捉中自动识别 uncertain frames（例如由遮蔽所致），并从自信frames中填充这些frame。”

Dense Text-to-Image Generation with Attention Modulation

paper_url: http://arxiv.org/abs/2308.12964
repo_url: https://github.com/naver-ai/densediffusion
paper_authors: Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, Jun-Yan Zhu
for: 实现文本描述中的具体图像Synthesize realistic images from dense captions, where each text prompt provides a detailed description for a specific image region.
methods: 使用预训条件为文本描述中的具体图像构成，并通过控制图像的构成来实现具体图像的生成。
results: 以无需训练和数据集，提高文本描述中的具体图像生成效果，并与特定构成条件下的图像生成效果相似。

Abstract
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

摘要
existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.Here's the translation in Traditional Chinese:现有的文本至图扩散模型对于细节描述的文本提示则做不好，这些文本提示每个图像区域的详细描述。为解决这个问题，我们提出了DenseDiffusion，一种不需要训练的方法，可以将预训练的文本至图模型调整以应对这些细节描述。我们首先分析生成图像的布局和预训练模型的中间注意力图。接着，我们开发了一种注意力调节方法，可以根据布局指导物品出现在特定的区域中。不需要进一步的调整或数据，我们提高了对细节描述的图像生成性能，并且在自动和人类评估上都有改善。此外，我们可以使用特定布局条件进行训练，以获得相似的视觉效果。

MapPrior: Bird’s-Eye View Map Layout Estimation with Generative Models

paper_url: http://arxiv.org/abs/2308.12963
repo_url: None
paper_authors: Xiyue Zhu, Vlas Zyrianov, Zhijian Liu, Shenlong Wang
for: 提高 bird’s-eye view (BEV) 识别模型的准确性和生成性的 semantic map 布局
methods: combine 传统的探测型 BEV 识别模型和学习的生成模型 для semantic map 布局
results: 在 nuScenes benchmark 上，MapPrior 比最强竞争对手提高 MMD 和 ECE scores 的 camera-和 LiDAR-based BEV 识别任务中表现出色，得到了显著改善的结果。

Abstract
Despite tremendous advancements in bird's-eye view (BEV) perception, existing models fall short in generating realistic and coherent semantic map layouts, and they fail to account for uncertainties arising from partial sensor information (such as occlusion or limited coverage). In this work, we introduce MapPrior, a novel BEV perception framework that combines a traditional discriminative BEV perception model with a learned generative model for semantic map layouts. Our MapPrior delivers predictions with better accuracy, realism, and uncertainty awareness. We evaluate our model on the large-scale nuScenes benchmark. At the time of submission, MapPrior outperforms the strongest competing method, with significantly improved MMD and ECE scores in camera- and LiDAR-based BEV perception.

摘要
尽管存在巨大的进步，现有的鸟瞰视（BEV）感知模型仍未能生成真实、凝重的 semantic map 布局，并且无法考虑部分感知器（如遮挡或有限覆盖）中的不确定性。在这项工作中，我们引入 MapPrior，一种新的 BEV 感知框架，该框架将传统的推理 BEV 感知模型与学习的生成模型结合在一起。我们的 MapPrior 能够提供更加准确、真实和不确定性意识的预测。我们在 nuScenes benchmark 上进行了评估，当时提交的 MapPrior 已经超过了最强竞争对手，在摄像头和 LiDAR 基于的 BEV 感知方面具有显著提高的 MMD 和 ECE 分数。

Motion-Guided Masking for Spatiotemporal Representation Learning

paper_url: http://arxiv.org/abs/2308.12962
repo_url: None
paper_authors: David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector Santos-Villalobos, Rohith MV, Xinyu Li
for: 这个论文主要是为了提高视频理解性，并且使用随机遮盲法来提高视频 autoencoder 的性能。
methods: 这个论文提出了一种新的推寄算法，即动态推寄法（Motion-guided masking，MGM），该算法利用运动向量来引导遮盲器在时间上的位置。
results: 在两个复杂的大规模视频测试集（Kinetics-400和Something-Something V2）上，这个方法可以与之前的状态OF-THE-ART方法相比，在视频 autoencoder 中获得最大 $1.3%$ 的提高。此外，这个方法还可以在训练EPoch数量相同的情况下，与之前的方法相比，在视频 autoencoder 中获得最大 $66%$ 的提高。最后，这个方法在下游传输学习和领域适应任务中表现出色，在 UCF101、HMDB51 和 Diving48 datasets上获得最大 $4.9%$ 的提高。

Abstract
Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

摘要
On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +$1.3\%$ improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to $66\%$ fewer training epochs. Our MGM also generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +$4.9\%$ improvement compared to baseline methods.

Less is More: Towards Efficient Few-shot 3D Semantic Segmentation via Training-free Networks

paper_url: http://arxiv.org/abs/2308.12961
repo_url: https://github.com/yangyangyang127/tfs3d
paper_authors: Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Jiaming Liu, Hao Dong, Peng Gao
for: 提高3D分割任务中的几个shot学习效果，减少大规模数据的依赖。
methods: 提出了一种没有学习参数的培训自由3D分割网络（TFS3D）和其进一步改进版本TFS3D-T。TFS3D使用三角函数坐标编码提取密集表示，与之前的培训方法相比具有相似的性能。TFS3D-T通过增强几个shot查询和支持数据之间的交互，提高了前期培训的效果。
results: 对S3DIS和ScanNet数据集进行实验，TFS3D-T在mIoU方面提高了+6.93%和+17.96%，同时减少了培训时间 by -90%，表明TFS3D-T具有更高的效果和效率。

Abstract
To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot semantic segmentation methods first pre-train the models on `seen' classes, and then evaluate their generalization performance on `unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead, but also incurs a significant domain gap on `unseen' classes. To tackle these issues, we propose an efficient Training-free Few-shot 3D Segmentation netwrok, TFS3D, and a further training-based variant, TFS3D-T. Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings, and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires to train a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.

摘要
Recent works in 3D segmentation have resorted to few-shot learning to reduce reliance on large-scale datasets. Current 3D few-shot semantic segmentation methods first pre-train the models on "seen" classes and then evaluate their generalization performance on "unseen" classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on "unseen" classes. To address these issues, we propose an efficient Training-free Few-shot 3D Segmentation network (TFS3D) and a further training-based variant (TFS3D-T). Without any learnable parameters, TFS3D extracts dense representations by trigonometric positional encodings and achieves comparable performance to previous training-based methods. Due to the elimination of pre-training, TFS3D can alleviate the domain gap issue and save a substantial amount of time. Building upon TFS3D, TFS3D-T only requires training a lightweight query-support transferring attention (QUEST), which enhances the interaction between the few-shot query and support data. Experiments demonstrate TFS3D-T improves previous state-of-the-art methods by +6.93% and +17.96% mIoU respectively on S3DIS and ScanNet, while reducing the training time by -90%, indicating superior effectiveness and efficiency.

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

paper_url: http://arxiv.org/abs/2308.12960
repo_url: https://github.com/sheng-eatamath/S3A
paper_authors: Sheng Zhang, Muzammal Naseer, Guangyi Chen, Zhiqiang Shen, Salman Khan, Kun Zhang, Fahad Khan
for: 本研究的目的是解决零 shot 分类中的开放世界问题，即没有注释但具有广泛的词汇。
methods: 本研究提出了 Self Structural Semantic Alignment (S^3A) 框架，它可以从无注释数据中提取结构性 semantics，并同时进行自我学习。S^3A 框架包括一种唯一的 Cluster-Vote-Prompt-Realign (CVPR) 算法，它通过轮循图像集成、选择每个集合中的图像，通过大语言模型生成权威提示，以及将图像和词汇进行结构性 semantic alignment，来提取结构性 semantics。
results: 对多种通用和细化的 benchmarcks 进行了广泛的实验，结果表明，S^3A 方法可以在零 shot 分类中提供较高的精度改进，相比 CLIP 的平均改进率高于 15%。

Abstract
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. Despite the success, most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies, which rarely satisfy the open-world scenario. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. To address this challenge, we propose the Self Structural Semantic Alignment (S^3A) framework, which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign (CVPR) algorithm, which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images, voting within each cluster to identify initial class candidates from the vocabulary, generating discriminative prompts with large language models to discern confusing candidates, and realigning images and the vocabulary as structural semantic alignment. Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches, achieving a more than 15% accuracy improvement over CLIP on average. Our codes, models, and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

摘要
大规模预训练视觉语言模型（VLM）已经证明有效于零shot分类。 despite the success， most traditional VLMs-based methods are restricted by the assumption of partial source supervision or ideal vocabularies，which rarely satisfy the open-world scenario. In this paper，we aim at a more challenging setting，Realistic Zero-Shot Classification，which assumes no annotation but instead a broad vocabulary. To address this challenge，we propose the Self Structural Semantic Alignment（S^3A）framework，which extracts the structural semantic information from unlabeled data while simultaneously self-learning. Our S^3A framework adopts a unique Cluster-Vote-Prompt-Realign（CVPR）algorithm，which iteratively groups unlabeled data to derive structural semantics for pseudo-supervision. Our CVPR process includes iterative clustering on images，voting within each cluster to identify initial class candidates from the vocabulary，generating discriminative prompts with large language models to discern confusing candidates，and realigning images and the vocabulary as structural semantic alignment. Finally，we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student learning strategy. Our comprehensive experiments across various generic and fine-grained benchmarks demonstrate that the S^3A method offers substantial improvements over existing VLMs-based approaches，achieving a more than 15% accuracy improvement over CLIP on average. Our codes，models，and prompts are publicly released at https://github.com/sheng-eatamath/S3A.

Label Budget Allocation in Multi-Task Learning

paper_url: http://arxiv.org/abs/2308.12949
repo_url: None
paper_authors: Ximeng Sun, Kihyuk Sohn, Kate Saenko, Clayton Mellina, Xiao Bian
for: 提高机器学习系统的性能，解决标签数据的成本问题。
methods: 提出了标签预算分配问题，并提出了一种适应任务的预算分配算法来解决这个问题。
results: 通过实验证明了我们的方法可以比其他各种常用的标签策略提高多任务学习的性能。

Abstract
The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to propose and formally define the label budget allocation problem in multi-task learning and to empirically show that different budget allocation strategies make a big difference to its performance. We propose a Task-Adaptive Budget Allocation algorithm to robustly generate the optimal budget allocation adaptive to different multi-task learning settings. Specifically, we estimate and then maximize the extent of new information obtained from the allocated budget as a proxy for multi-task learning performance. Experiments on PASCAL VOC and Taskonomy demonstrate the efficacy of our approach over other widely used heuristic labeling strategies.

摘要
machine learning系统的性能受数据标注成本的限制。在多任务学习中，相关任务之间交换信息，提高总性能，但标注成本可能因任务而异。如何在多任务学习中合理分配标注预算（即用于标注的费用）以达到最佳性能？我们是第一个提出并正式定义多任务学习标注预算分配问题，并通过实验证明不同预算分配策略对性能产生了很大影响。我们提出了适应任务的预算分配算法，可以在不同的多任务学习设置下生成最佳的预算分配策略。specifically，我们估算并最大化从分配预算中获得的新信息的总量，作为多任务学习性能的代理。 Pascal VOC和Taskonomy的实验表明我们的方法比其他常见的标注策略更有效。

Perspective-aware Convolution for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2308.12938
repo_url: https://github.com/KenYu910645/perspective-aware-convolution
paper_authors: Jia-Quan Yu, Soo-Chang Pei
for: 提高自动驾驶车辆中的单摄像头三维物体检测精度。
methods: 提出了一种新的视角意识核心层，该层可以在图像中提取长距离依赖关系，以捕捉场景的视角信息。
results: 在KITTI3D数据集上测试，该方法可以提高3D物体检测精度，达到了23.9%的准确率。这些结果表明了场景信息的重要性，以及网络设计中场景结构的潜在优势。

Abstract
Monocular 3D object detection is a crucial and challenging task for autonomous driving vehicle, while it uses only a single camera image to infer 3D objects in the scene. To address the difficulty of predicting depth using only pictorial clue, we propose a novel perspective-aware convolutional layer that captures long-range dependencies in images. By enforcing convolutional kernels to extract features along the depth axis of every image pixel, we incorporates perspective information into network architecture. We integrate our perspective-aware convolutional layer into a 3D object detector and demonstrate improved performance on the KITTI3D dataset, achieving a 23.9\% average precision in the easy benchmark. These results underscore the importance of modeling scene clues for accurate depth inference and highlight the benefits of incorporating scene structure in network design. Our perspective-aware convolutional layer has the potential to enhance object detection accuracy by providing more precise and context-aware feature extraction.

摘要
<>Translate given text into Simplified Chinese.<>单目3D对象检测是自动驾驶车辆中的一项关键和挑战性任务，它使用单个摄像头图像来推断Scene中的3D对象。为了解决基于图像的深度预测困难，我们提出了一种新的视角意识核心层。我们要求核心层在每个图像像素上提取深度轴方向的特征，从而将视角信息integrated into网络 architecture。我们将这种视角意识核心层与3D对象检测器结合，并在KITTI3D数据集上进行了评估，实现了23.9%的准确率在易 benchmark。这些结果证明了场景 clue的重要性，并高亮了网络设计中场景结构的 incorporation 的好处。我们的视角意识核心层有可能提高对象检测精度，通过提供更加准确和上下文感知的特征提取。

Panoptic-Depth Color Map for Combination of Depth and Image Segmentation

paper_url: http://arxiv.org/abs/2308.12937
repo_url: None
paper_authors: Jia-Quan Yu, Soo-Chang Pei
for: 这篇论文旨在提出一种将图像分割和深度估计结合在一起的新方法，以提高自动驾驶场景中图像识别的精度和安全性。
methods: 该方法具有一个额外的深度估计分支，用于在分割网络中预测每个实例段的深度。
results: 在Cityscape数据集上测试，该方法能够实现高质量的分割结果，同时包含深度信息，并通过色彩地图可视化。这种方法开拓了将不同任务和网络结合起来生成更全面的图像识别结果，以提高自动驾驶车辆的安全性。

Abstract
Image segmentation and depth estimation are crucial tasks in computer vision, especially in autonomous driving scenarios. Although these tasks are typically addressed separately, we propose an innovative approach to combine them in our novel deep learning network, Panoptic-DepthLab. By incorporating an additional depth estimation branch into the segmentation network, it can predict the depth of each instance segment. Evaluating on Cityscape dataset, we demonstrate the effectiveness of our method in achieving high-quality segmentation results with depth and visualize it with a color map. Our proposed method demonstrates a new possibility of combining different tasks and networks to generate a more comprehensive image recognition result to facilitate the safety of autonomous driving vehicles.

摘要
Image segmentation和深度估计是计算机视觉中关键任务，尤其在自动驾驶场景下。虽然这两个任务通常被视为独立的，但我们提出了一种创新的方法，将它们结合在一起。我们的新型深度学习网络Panoptic-DepthLab中添加了一个深度估计分支，可以预测每个实例分割结果中的深度。在Cityscape数据集上评估，我们示出了我们的方法可以实现高质量的分割结果，并通过色彩地图进行可见化。我们的提议的方法开 up了将不同任务和网络结合起来以生成更全面的图像认知结果，以便促进自动驾驶车辆的安全。

Towards Realistic Unsupervised Fine-tuning with CLIP

paper_url: http://arxiv.org/abs/2308.12919
repo_url: None
paper_authors: Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, Tieniu Tan
for: 这个研究旨在应用CLIPvision-language模型进行下游有监督学习任务，并在无监督下精致化CLIP。
methods: 本研究提出了一个简单、高效的精致化方法，名为Universal Entropy Optimization（UEO），它利用amples的信任程度来减少信任度高的例子的 conditional entropy，并将不信任度高的例子的margin entropy提高。 UEO还包括了对CLIP的视觉分支中的通道对称变换进行优化。
results: 经过了15个领域和4种不同的专门知识的广泛实验，结果显示UEO的表现比基eline方法更好，具有更高的普遍化和外部调整检测能力。

Abstract
The emergence of vision-language models (VLMs), such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.

摘要
“视觉语言模型（VLM）的出现，如CLIP，已经引发了大量关于其应用于下游有监督学习任务的研究effort。 although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. In this paper, we delve into a realistic unsupervised fine-tuning scenario by assuming that the unlabeled data might contain out-of-distribution samples from unknown classes. Furthermore, we emphasize the importance of simultaneously enhancing out-of-distribution detection capabilities alongside the recognition of instances associated with predefined class labels. To tackle this problem, we present a simple, efficient, and effective fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompts, UEO also incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Through extensive experiments conducted across 15 domains and 4 different types of prior knowledge, we demonstrate that UEO surpasses baseline methods in terms of both generalization and out-of-distribution detection.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Robot Pose Nowcasting: Forecast the Future to Improve the Present

paper_url: http://arxiv.org/abs/2308.12914
repo_url: None
paper_authors: Alessandro Simoni, Francesco Marchetti, Guido Borghi, Federico Becattini, Lorenzo Seidenari, Roberto Vezzani, Alberto Del Bimbo
for: 本研究旨在提供一种基于视觉数据的精准三维姿态估计系统，以便在工业4.0enario中安全和有效地协同工作人员和机器人。
methods: 该系统基于视觉数据，并通过对未来姿态进行预测来提高当前姿态估计精度。这种技术被称为“姿态预测”（pose nowcasting）。
results: 实验结果表明，该系统在两个不同的数据集上达到了状态艺术和实时性的国际先进水平，并在机器人和人类场景中都有较高的有效性。

Abstract
In recent years, the effective and safe collaboration between humans and machines has gained significant importance, particularly in the Industry 4.0 scenario. A critical prerequisite for realizing this collaborative paradigm is precisely understanding the robot's 3D pose within its environment. Therefore, in this paper, we introduce a novel vision-based system leveraging depth data to accurately establish the 3D locations of robotic joints. Specifically, we prove the ability of the proposed system to enhance its current pose estimation accuracy by jointly learning to forecast future poses. Indeed, we introduce the concept of Pose Nowcasting, denoting the capability of a system to exploit the learned knowledge of the future to improve the estimation of the present. The experimental evaluation is conducted on two different datasets, providing state-of-the-art and real-time performance and confirming the validity of the proposed method on both the robotic and human scenarios.

摘要

SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

paper_url: http://arxiv.org/abs/2308.12910
repo_url: None
paper_authors: Ziyan Yang, Kushal Kafle, Zhe Lin, Scott Cohen, Zhihong Ding, Vicente Ordonez
for: 预测Scene中对象之间的关系，以及这些关系的位置。
methods: 提出了一种基于自动生成模型的Subject-Conditional Relation Detection（SCoRD）方法，可以conditioned on a subject，预测该主题与其他对象之间的关系和位置。
results: 在Open Images dataset上，通过创建OIv6-SCoRD benchmark，并使用文本描述来自动生成relation-object对，提高了relation-object预测的准确率和泛化能力。

Abstract
We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 42.59% for relation-object pairs and 32.27% for their box locations.

摘要
我们提议Subject-Conditional Relation Detection（SCoRD），其中 conditioned on输入主题，目标是预测它们与其他对象之间的关系以及它们的位置。基于Open Images dataset，我们提出了一个具有分布转移的OIv6-SCoRD benchmark，其中训练和测试分别的分布转移是对于$\langle$主题、关系、对象$\rangle$ triplets的出现统计学的。为解决这个问题，我们提议一个自动生成的模型，其中给定一个主题，它预测其关系、对象和对象位置，并将这些输出作为一个序列的token进行投影。我们首先表明，前一些场景树预测方法在这个benchmark上不能生成主题conditioned的完整的对象-关系对的列表。特别是，我们获得了主题conditioned的relation-object预测的准确率为83.8%，而一个最近的场景树探测器只得49.75%。然后，我们表明，通过在训练中使用自动获得的文本描述中的关系-对象对，我们可以提高对象-框预测和关系-对象预测的总体化能力。特别是，对于没有输入对象位置的主题conditioned的$\langle$主题、关系、对象$\rangle$ triplets，我们可以获得relation-object预测的准确率为42.59%和对象位置预测的准确率为32.27%。

Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings

paper_url: http://arxiv.org/abs/2308.12894
repo_url: None
paper_authors: Yuhe Liu, Chuanjian Liu, Kai Han, Quan Tang, Zengchang Qin
for: 本研究旨在提高 semantic segmentation 任务中类划分的精度和效率，特别是通过强化类划分的 semantics 来提高分类准确率。
methods: 本研究提出了 ECENet 模型，该模型通过在多个阶段图像特征之间交互来获取和增强类划分的准确性。此外，我们还提出了一种 Feature Reconstruction 模块，该模块通过组合内在和多样化分支来保证特征的多样性和重复性。
results: 实验结果表明，ECENet 模型在 ADE20K 数据集上比其他模型具有更高的精度和效率，并在 PASCAL-Context 数据集上达到了新的州OF-the-art 结果。

Abstract
Semantic segmentation is a computer vision task that associates a label with each pixel in an image. Modern approaches tend to introduce class embeddings into semantic segmentation for deeply utilizing category semantics, and regard supervised class masks as final predictions. In this paper, we explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely. Following this observation, we propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features. Based on this, we revisit the traditional decoding process and explore inverted information flow between segmentation masks and class embeddings. Furthermore, to ensure the discriminability and informativity of features from backbone, we propose a Feature Reconstruction module, which combines intrinsic and diverse branches together to ensure the concurrence of diversity and redundancy in features. Experiments show that our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset. The code will be released at https://gitee.com/mindspore/models and https://github.com/Carol-lyh/ECENet.

摘要
Semantic segmentation 是一个计算机视觉任务，将每个图像像素标注为不同类别。现代方法通常会将类别嵌入引入 semantic segmentation，并将监督类 маSK 视为最终预测。在这篇论文中，我们研究了类别嵌入机制，并发现可以根据类 маSK 提取更Explicit和有意义的类别嵌入。基于这一观察，我们提出了 ECENet，一种新的 segmentation 模式，其中类别嵌入在多Stage图像特征交互中得到并加强。此外，我们重新评估传统的解码过程，并探索类 segmentation masks 和类别嵌入之间的倒推信息流。进一步，为保证特征来源的可识别度和信息充足，我们提出了一种 Feature Reconstruction 模块，它将内在和多样化分支结合起来，以保证特征的协调性和多样性。实验表明，我们的 ECENet 在 ADE20K 数据集上较其他方法具有许多计算成本的优势，并在 PASCAL-Context 数据集上达到了新的州态艺术结果。代码将在 https://gitee.com/mindspore/models 和 https://github.com/Carol-lyh/ECENet 上发布。

Multi-stage feature decorrelation constraints for improving CNN classification performance

paper_url: http://arxiv.org/abs/2308.12880
repo_url: None
paper_authors: Qiuyu Zhu, Xuewen Zu, Chengfei Liu
for: 提高深度神经网络（CNN）的分类精度。
methods: 提出了一种多Stage Feature Decorrelation Loss（MFD Loss），用于约束前stage特征的谱相关性，从而提高CNN的分类精度。
results: 对多个常用的数据集和多种常用的CNN进行了实验比较和分析，证明了MFD Loss可以显著提高CNN的分类精度，并且与其他常见的损失函数结合使用也有优于单独使用Softmax损失的性能。

Abstract
For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increasing of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features .For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.

摘要
For the convolutional neural network (CNN) used for pattern classification, the training loss function is usually applied to the final output of the network, except for some regularization constraints on the network parameters. However, with the increase of the number of network layers, the influence of the loss function on the network front layers gradually decreases, and the network parameters tend to fall into local optimization. At the same time, it is found that the trained network has significant information redundancy at all stages of features, which reduces the effectiveness of feature mapping at all stages and is not conducive to the change of the subsequent parameters of the network in the direction of optimality. Therefore, it is possible to obtain a more optimized solution of the network and further improve the classification accuracy of the network by designing a loss function for restraining the front stage features and eliminating the information redundancy of the front stage features. For CNN, this article proposes a multi-stage feature decorrelation loss (MFD Loss), which refines effective features and eliminates information redundancy by constraining the correlation of features at all stages. Considering that there are many layers in CNN, through experimental comparison and analysis, MFD Loss acts on multiple front layers of CNN, constrains the output features of each layer and each channel, and performs supervision training jointly with classification loss function during network training. Compared with the single Softmax Loss supervised learning, the experiments on several commonly used datasets on several typical CNNs prove that the classification performance of Softmax Loss+MFD Loss is significantly better. Meanwhile, the comparison experiments before and after the combination of MFD Loss and some other typical loss functions verify its good universality.Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".