2023-09-10

cs.CV

cs.CV - 2023-09-10

Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color

paper_url: http://arxiv.org/abs/2309.05148
repo_url: None
paper_authors: William Thong, Przemyslaw Joniak, Alice Xiang
for: 这篇论文目的是在计算机视觉中测量表面颜色，超越单一的皮肤颜色标准。
methods: 该论文使用了 Fitzpatrick 皮肤类型分类法，以及颜色探测技术来评估计算机视觉系统中的皮肤偏见。
results: 该论文发现，使用单一的皮肤颜色标准（Fitzpatrick 分类法）不能充分捕捉计算机视觉系统中的皮肤偏见，而使用多维度皮肤颜色标准（包括皮肤颜色和颜色角度）可以更好地评估计算机视觉系统中的皮肤偏见。

Abstract
This paper strives to measure apparent skin color in computer vision, beyond a unidimensional scale on skin tone. In their seminal paper Gender Shades, Buolamwini and Gebru have shown how gender classification systems can be biased against women with darker skin tones. Subsequently, fairness researchers and practitioners have adopted the Fitzpatrick skin type classification as a common measure to assess skin color bias in computer vision systems. While effective, the Fitzpatrick scale only focuses on the skin tone ranging from light to dark. Towards a more comprehensive measure of skin color, we introduce the hue angle ranging from red to yellow. When applied to images, the hue dimension reveals additional biases related to skin color in both computer vision datasets and models. We then recommend multidimensional skin color scales, relying on both skin tone and hue, for fairness assessments.

摘要

A Skeleton-based Approach For Rock Crack Detection Towards A Climbing Robot Application

paper_url: http://arxiv.org/abs/2309.05139
repo_url: https://github.com/josselinsomervilleroberts/reachbot-predictor
paper_authors: Josselin Somerville Roberts, Paul-Emile Giacomelli, Yoni Gozlan, Julia Di
for: 这个论文是为了提高吊 Bridge 爬行机器人在科学上有趣但危险的洞穴环境中的运动能力而写的。
methods: 这篇论文使用了一种新的分割方法，即 SKeleton Intersection Loss (SKIL)，以便在硬石表面上检测岩石裂隙和边缘。此外，论文还提出了一组新的评价指标，即 LineAcc，以便评估细长物体分割的质量。
results: 根据论文的描述，使用 SKIL 方法和 LineAcc 指标可以在类似的细长物体分割任务中获得更高的性能，例如血管分割。这表示这些方法可以用于吊 Bridge 爬行机器人上的 grasp 位置识别。

Abstract
Conventional wheeled robots are unable to traverse scientifically interesting, but dangerous, cave environments. Multi-limbed climbing robot designs, such as ReachBot, are able to grasp irregular surface features and execute climbing motions to overcome obstacles, given suitable grasp locations. To support grasp site identification, we present a method for detecting rock cracks and edges, the SKeleton Intersection Loss (SKIL). SKIL is a loss designed for thin object segmentation that leverages the skeleton of the label. A dataset of rock face images was collected, manually annotated, and augmented with generated data. A new group of metrics, LineAcc, has been proposed for thin object segmentation such that the impact of the object width on the score is minimized. In addition, the metric is less sensitive to translation which can often lead to a score of zero when computing classical metrics such as Dice on thin objects. Our fine-tuned models outperform previous methods on similar thin object segmentation tasks such as blood vessel segmentation and show promise for integration onto a robotic system.

摘要

DAD++: Improved Data-free Test Time Adversarial Defense

paper_url: http://arxiv.org/abs/2309.05132
repo_url: https://github.com/vcl-iisc/data-free-defense-at-test-time
paper_authors: Gaurav Kumar Nayak, Inder Khatri, Shubham Randive, Ruchit Rawal, Anirban Chakraborty
for: 这个研究旨在提高深度神经网络在实际应用中的防御性，以应对攻击者可能会运用攻击技术来破坏神经网络的问题。
methods: 本研究使用了训练时间不需要数据的攻击防御技术，包括检测和修正框架。此外，为了进一步提高修正框架在检测器不足自信的情况下的表现，提出了一种软检测方案（称为“DAD++”）。
results: 在多个数据集和网络架构上进行了广泛的实验和检测，证明了我们的提案的有效性。此外，我们还证明了我们的方法可以在没有训练数据的情况下实现攻击防御，例如在数据自由知识传播和无监督无标注领域数据预测等方面。我们发现在所有实验和应用中，我们的DAD++方法具有优秀的防御性，即使面对多种攻击方法，clean准确率也几乎不受影响。

Abstract
With the increasing deployment of deep neural networks in safety-critical applications such as self-driving cars, medical imaging, anomaly detection, etc., adversarial robustness has become a crucial concern in the reliability of these networks in real-world scenarios. A plethora of works based on adversarial training and regularization-based techniques have been proposed to make these deep networks robust against adversarial attacks. However, these methods require either retraining models or training them from scratch, making them infeasible to defend pre-trained models when access to training data is restricted. To address this problem, we propose a test time Data-free Adversarial Defense (DAD) containing detection and correction frameworks. Moreover, to further improve the efficacy of the correction framework in cases when the detector is under-confident, we propose a soft-detection scheme (dubbed as "DAD++"). We conduct a wide range of experiments and ablations on several datasets and network architectures to show the efficacy of our proposed approach. Furthermore, we demonstrate the applicability of our approach in imparting adversarial defense at test time under data-free (or data-efficient) applications/setups, such as Data-free Knowledge Distillation and Source-free Unsupervised Domain Adaptation, as well as Semi-supervised classification frameworks. We observe that in all the experiments and applications, our DAD++ gives an impressive performance against various adversarial attacks with a minimal drop in clean accuracy. The source code is available at: https://github.com/vcl-iisc/Improved-Data-free-Test-Time-Adversarial-Defense

摘要
随着深度神经网络在安全关键应用领域的普及，如自动驾驶车、医学影像分析、异常检测等，对深度神经网络的抗 adversarial 性能成为了这些网络在实际场景中的可靠性问题。众多基于对抗训练和规范化技术的方法已经被提议以使得这些深度网络对 adversarial 攻击具有抗性。然而，这些方法需要 Either retrain models 或从scratch 训练，使得在数据限制情况下不可防御 pre-trained 模型。为解决这个问题，我们提出了一种测试时 Data-free Adversarial Defense (DAD) 包含检测和修正框架。此外，为了进一步提高修正框架在检测器具有低自信的情况下的效果，我们提议了一种软检测方案（称为 "DAD++"）。我们在多种数据集和网络架构上进行了广泛的实验和剖除，以示我们的提议方法的有效性。此外，我们还证明了我们的方法可以在数据缺乏（或数据高效）应用/设置下进行免数据抗 adversarial 防御，如数据缺乏知识传承和源缺乏无监督领域适应，以及半监督分类框架。我们在所有实验和应用中观察到，我们的 DAD++ 在对多种 adversarial 攻击的检测和修正方面表现出色，而且clean accuracy 的损失很小。源代码可以在：https://github.com/vcl-iisc/Improved-Data-free-Test-Time-Adversarial-Defense 中下载。

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

paper_url: http://arxiv.org/abs/2309.05098
repo_url: https://github.com/zhongcl-thu/3d-implicit-transporter
paper_authors: Chengliang Zhong, Yuhang Zheng, Yupeng Zheng, Hao Zhao, Li Yi, Xiaodong Mu, Ling Wang, Pengfei Li, Guyue Zhou, Chao Yang, Xinliang Zhang, Jian Zhao
for: 本研究旨在提高3D点云数据中的键点检测精度，通过 integrate 空间和时间信息。
methods: 该研究提出了首个3D版本的Transporter方法，基于混合3D表示、交叉注意力和隐藏重建。
results: 对3D柔性物体和非定形动物（人类和小鼠）进行了验证，显示学习的键点具有空间-时间一致性。此外，还提出了一种封闭控制策略，利用学习的键点进行3D物体抓取，并证明其性能优于传统方法。

Abstract
Keypoint-based representation has proven advantageous in various visual and robotic tasks. However, the existing 2D and 3D methods for detecting keypoints mainly rely on geometric consistency to achieve spatial alignment, neglecting temporal consistency. To address this issue, the Transporter method was introduced for 2D data, which reconstructs the target frame from the source frame to incorporate both spatial and temporal information. However, the direct application of the Transporter to 3D point clouds is infeasible due to their structural differences from 2D images. Thus, we propose the first 3D version of the Transporter, which leverages hybrid 3D representation, cross attention, and implicit reconstruction. We apply this new learning system on 3D articulated objects and nonrigid animals (humans and rodents) and show that learned keypoints are spatio-temporally consistent. Additionally, we propose a closed-loop control strategy that utilizes the learned keypoints for 3D object manipulation and demonstrate its superior performance. Codes are available at https://github.com/zhongcl-thu/3D-Implicit-Transporter.

摘要
《键点基本表示法在视觉和机器人任务中表现出了优势。然而，现有的2D和3D关键点检测方法主要基于几何一致性来实现空间对齐，忽略了时间一致性。为解决这个问题，Transporter方法在2D数据上被引入，可以重建目标帧从源帧中，并同时包含空间和时间信息。然而，直接将Transporter应用于3D点云是不可能的，因为它们与2D图像的结构不同。因此，我们提出了第一个3D版本的Transporter，它利用混合3D表示、对比注意力和隐式重建。我们在3D可变物体和非均质动物（人类和小鼠）上应用这种新学习系统，并证明学习的关键点是空间-时间一致的。此外，我们提出了一种封闭控制策略，使用学习的关键点进行3D物体操作，并证明其性能更高。代码可以在https://github.com/zhongcl-thu/3D-Implicit-Transporter上下载。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment

paper_url: http://arxiv.org/abs/2309.05095
repo_url: None
paper_authors: Tina Behrouzi, Atefeh Shahroudnejad, Payam Mousavi
for: 本研究旨在提出一种新的端到端身份无关面塑渲染系统，MaskRenderer，可以在实时下生成真实、高精度的面塑渲染图像。
methods: MaskRenderer使用以下三种方法来解决现有面塑渲染问题：（i）使用3DMM模型来更好地处理pose变化、遮挡和嘴部运动;（ii）使用 triplet loss函数来在训练中进行cross-reenactment，以保持人脸认知;（iii）使用多尺度遮挡来提高填充和恢复失去的区域。
results: 根据在VoxCeleb1测试集上进行的全面量化和质量测试，MaskRenderer比现有的模型在未看到面前，尤其是当源和驱动身份很不同时，表现出优异的效果。

Abstract
We present a novel end-to-end identity-agnostic face reenactment system, MaskRenderer, that can generate realistic, high fidelity frames in real-time. Although recent face reenactment works have shown promising results, there are still significant challenges such as identity leakage and imitating mouth movements, especially for large pose changes and occluded faces. MaskRenderer tackles these problems by using (i) a 3DMM to model 3D face structure to better handle pose changes, occlusion, and mouth movements compared to 2D representations; (ii) a triplet loss function to embed the cross-reenactment during training for better identity preservation; and (iii) multi-scale occlusion, improving inpainting and restoring missing areas. Comprehensive quantitative and qualitative experiments conducted on the VoxCeleb1 test set, demonstrate that MaskRenderer outperforms state-of-the-art models on unseen faces, especially when the Source and Driving identities are very different.

摘要
我们提出了一种新的端到端无关identitface reenactment系统，MaskRenderer，可以在实时下生成真实、高质量的帧。尽管最近的face reenactment工作已经显示出了有前途的结果，但还存在许多挑战，如人脸泄露和模仿嘴部运动，尤其是大 pose 变化和 occluded 人脸。MaskRenderer 通过以下几个方法解决这些问题：(i) 使用 3DMM 模型人脸结构来更好地处理 pose 变化、遮挡和嘴部运动，相比于2D 表示。(ii) 使用 triplet 损失函数在训练期间进行 cross-reenactment 的嵌入，以保持人脸identit。(iii) 使用多尺度遮挡，提高填充和恢复缺失区域。我们在 VoxCeleb1 测试集上进行了全面的量化和质量实验，结果表明，MaskRenderer 在未看过的人脸上比 state-of-the-art 模型更高效，特别是当 Source 和 Driving 身份很不同时。

Sculpting Efficiency: Pruning Medical Imaging Models for On-Device Inference

paper_url: http://arxiv.org/abs/2309.05090
repo_url: None
paper_authors: Sudarshan Sreeram, Bernhard Kainz
for: 这个论文旨在应用机器学习技术于医疗领域，以提高患者结果。
methods: 这个论文使用了筛选排序技术，测试了心血管疾病和眼科领域的分割模型。
results: 研究发现，使用筛选排序技术可以实现图像压缩率达1148倍，而无需 sacrifiSing图像质量。此外， filter-pruned模型在高压缩率下的执行速度比GPU基eline更快。此外，这些模型还表现出了比基eline和Weight-pruned模型更好的 Robustness和泛化特点。

Abstract
Applying ML advancements to healthcare can improve patient outcomes. However, the sheer operational complexity of ML models, combined with legacy hardware and multi-modal gigapixel images, poses a severe deployment limitation for real-time, on-device inference. We consider filter pruning as a solution, exploring segmentation models in cardiology and ophthalmology. Our preliminary results show a compression rate of up to 1148x with minimal loss in quality, stressing the need to consider task complexity and architectural details when using off-the-shelf models. At high compression rates, filter-pruned models exhibit faster inference on a CPU than the GPU baseline. We also demonstrate that such models' robustness and generalisability characteristics exceed that of the baseline and weight-pruned counterparts. We uncover intriguing questions and take a step towards realising cost-effective disease diagnosis, monitoring, and preventive solutions.

摘要
(Simplified Chinese translation)使用机器学习（ML）技术应用于医疗领域可以提高病人结果，但是ML模型的运算复杂性，加上传统硬件和多模式 gigapixel 图像，导致实时、设备上的推理存在严重的部署限制。我们考虑使用筛子剪辑作为解决方案，探索卡диологи和眼科领域中的 segmentation 模型。我们的初步结果显示，可以达到 1148x 的压缩率，而且影响质量 minimal。这说明在使用卖外模型时需要考虑任务复杂性和建筑特点。在高压缩率下，筛子剪辑的模型在 CPU 上的推理速度比 GPU 基线 faster。我们还发现这些模型的可靠性和泛化特点超过基线和重量剪辑模型。我们探索了一些有趣的问题，并在实现成本效果的道路上进行了一步前进。

FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild

paper_url: http://arxiv.org/abs/2309.05073
repo_url: https://github.com/wangjiongw/freeman_api
paper_authors: Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Ruimao Zhang
for: 这个论文的目的是为了提供一个大规模、实际场景中的人体3D姿态估计数据集，以促进人体 pose estimation 领域的研究。
methods: 这个论文使用了多视图摄像头捕捉的方法，将数据集收集到了8000个序列、1100万帧中，并设计了一个自动化、精确的标注管道，以便大规模处理。
results: 这个论文提供了一个大规模、实际场景中的人体3D姿态估计数据集，并提供了评估基准，以便评估不同任务的性能。这个数据集还能够在实际场景中提供Robust的人体姿态估计。

Abstract
Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. This task carries great importance for fields like AIGC and human-robot interaction. In practice, 3D human pose estimation in real-world settings is a critical initial step in solving this problem. However, the current datasets, often collected under controlled laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of real-world datasets is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, real-world multi-view dataset. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an automated, precise labeling pipeline that allows for large-scale processing efficiently. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is now publicly available at https://wangjiongw.github.io/freeman.

摘要
<>translate the following text into Simplified Chinese<>人体三维结构估算从自然场景中获得是视觉认知的基本问题。这项任务对于AIGC和人机交互等领域具有极大的重要性。然而，现有的数据集，通常在控制的实验室条件下使用复杂的运动跟踪设备和不变的背景 captured，具有限制性。absence of real-world data is hindering the progress of this critical task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, real-world multi-view dataset. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11 million frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an automated, precise labeling pipeline that allows for large-scale processing efficiently. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is now publicly available at .Translation:人体三维结构估算从自然场景中获得是视觉认知的基本问题。这项任务对于AIGC和人机交互等领域具有极大的重要性。然而，现有的数据集，通常在控制的实验室条件下使用复杂的运动跟踪设备和不变的背景 captured，具有限制性。absence of real-world data is hindering the progress of this critical task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, real-world multi-view dataset. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11 million frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an automated, precise labeling pipeline that allows for large-scale processing efficiently. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is now publicly available at .

Lung Diseases Image Segmentation using Faster R-CNNs

paper_url: http://arxiv.org/abs/2309.06386
repo_url: None
paper_authors: Mihir Jain
for: 这个论文主要是为了提高儿童肺病诊断的准确率，以减少发展国家儿童死亡率。
methods: 这个论文提出了一种基于低级别神经网络结构的方法，以解决深度网络中的拓扑挑战。该方法包括在特征峰中嵌入参数，以提高数据提取和避免信息损失。它还使用软非最大抑制来优化地方提案网络生成的地方提案。
results: 该论文在肺X射线图像上进行了测试，并计算了冲撤率、准确率、敏感度和特征率来评估模型的性能。研究还分析了损失函数的趋势，包括训练阶段和分类阶段的loss函数。

Abstract
Lung diseases are a leading cause of child mortality in the developing world, with India accounting for approximately half of global pneumonia deaths (370,000) in 2016. Timely diagnosis is crucial for reducing mortality rates. This paper introduces a low-density neural network structure to mitigate topological challenges in deep networks. The network incorporates parameters into a feature pyramid, enhancing data extraction and minimizing information loss. Soft Non-Maximal Suppression optimizes regional proposals generated by the Region Proposal Network. The study evaluates the model on chest X-ray images, computing a confusion matrix to determine accuracy, precision, sensitivity, and specificity. We analyze loss functions, highlighting their trends during training. The regional proposal loss and classification loss assess model performance during training and classification phases. This paper analysis lung disease detection and neural network structures.

摘要
乳腺疾病是发展中国家儿童死亡率的主要原因，印度负责全球肺炎死亡人数的大约一半（370,000）在2016年。时间早报诊断非常重要，以减少死亡率。这篇论文介绍了一种低密度神经网络结构，以减少深度网络的拓扑挑战。该网络嵌入参数到特征pyramid中，提高数据提取和减少信息损失。软非最大抑制优化地区提议生成的地区提议网络。本研究对胸部X射线图像进行评估，计算冲混矩阵来确定准确率、精度、敏感度和特征率。我们分析损失函数，描述它们在训练和分类阶段的趋势。地域提议损失和分类损失评估模型在训练和分类阶段的性能。本文分析肺疾病检测和神经网络结构。

Super-Resolution Surface Reconstruction from Few Low-Resolution Slices

paper_url: http://arxiv.org/abs/2309.05071
repo_url: https://github.com/cyiyoo/SurfaceReconstructionFromFewSlices
paper_authors: Yiyao Zhang, Ke Chen, Shang-Hua Yang
for: 提高几何特征表面的分辨率，以便进行其他数学模拟（如费米素分析）。
methods: 提出了一种基于循环几何的变量模型，并实现了两种数值算法（投影梯度下降法和多个参数的替换方法）来解决该模型。
results: 通过实际例子（包括另一种变量模型的输出）的数值实验，显示了新模型的优点，并通过几何学的标准差比较来证明其精度的提高。

Abstract
In many imaging applications where segmented features (e.g. blood vessels) are further used for other numerical simulations (e.g. finite element analysis), the obtained surfaces do not have fine resolutions suitable for the task. Increasing the resolution of such surfaces becomes crucial. This paper proposes a new variational model for solving this problem, based on an Euler-Elastica-based regulariser. Further, we propose and implement two numerical algorithms for solving the model, a projected gradient descent method and the alternating direction method of multipliers. Numerical experiments using real-life examples (including two from outputs of another variational model) have been illustrated for effectiveness. The advantages of the new model are shown through quantitative comparisons by the standard deviation of Gaussian curvatures and mean curvatures from the viewpoint of discrete geometry.

摘要
Many 图像应用程序中，已经分割特征（例如血液动脉）进行其他数学模拟（例如finite element分析）时，获得的表面没有高精度适用于任务。提高表面精度成为重要问题。这篇论文提出了一种新的可变模型，基于Euler-Elastica基础函数。此外，我们提出并实现了两种数值算法来解决该模型，即投影梯度下降法和分解方向多项式法。实际实验使用真实的例子（包括另一种可变模型的输出），以示效果。新模型的优势通过精度评估（基于离散几何的标准差）与其他模型进行比较。

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

paper_url: http://arxiv.org/abs/2309.05069
repo_url: https://github.com/bobwan1995/zeroshot-hoi-with-clip
paper_authors: Bo Wan, Tinne Tuytelaars
for: 本研究探讨了零例人物对象互动（HOI）检测任务，这是一种不需要特定任务标注的新 paradigm。
methods: 我们采用了CLIP，一个大规模预训练的视觉语言模型（VLM），进行知识储存和多级卷积 neural network，以学习HOI表示。
results: 我们的实验证明，我们的新的多级CLIP知识集成策略能够实现强大的表现，与一些完全监督和弱监督方法相比，其表现甚至能达到公共HICO-DET标准 benchmark 的水平。

Abstract
In this paper, we investigate the task of zero-shot human-object interaction (HOI) detection, a novel paradigm for identifying HOIs without the need for task-specific annotations. To address this challenging task, we employ CLIP, a large-scale pre-trained vision-language model (VLM), for knowledge distillation on multiple levels. Specifically, we design a multi-branch neural network that leverages CLIP for learning HOI representations at various levels, including global images, local union regions encompassing human-object pairs, and individual instances of humans or objects. To train our model, CLIP is utilized to generate HOI scores for both global images and local union regions that serve as supervision signals. The extensive experiments demonstrate the effectiveness of our novel multi-level CLIP knowledge integration strategy. Notably, the model achieves strong performance, which is even comparable with some fully-supervised and weakly-supervised methods on the public HICO-DET benchmark.

摘要
在本文中，我们研究了零例人物对象互动（HOI）检测任务，这是一种新的概念，可以无需特定任务的注释来认定HOIs。为解决这个复杂的任务，我们使用了CLIP，一个大规模预训练的视觉语言模型（VLM），进行知识储存在多级。具体来说，我们设计了一个多支 neuron 网络，利用CLIP来学习 HOI 表示形式在不同级别，包括全图、人物对象对的本地联合区域以及人类或物体的个体实例。为训练我们的模型，CLIP 生成了 HOI scores для全图和本地联合区域，这些权重函数作为超参数。我们的实验证明，我们的新的多级 CLIP 知识集成策略有效。特别是，模型在公共 HICO-DET benchmark 上表现出色，与一些完全监督和弱监督方法相比，其性能甚至达到了一些相似水平。

Multi-view Self-supervised Disentanglement for General Image Denoising

paper_url: http://arxiv.org/abs/2309.05049
repo_url: https://github.com/chqwer2/multi-view-self-supervised-disentanglement-denoising
paper_authors: Hao Chen, Chenyuan Qu, Yu Zhang, Chen Chen, Jianbo Jiao
for: 提高现代图像噪声去除器的性能，采用深度学习方法。
methods: 提出了一种自我超vised学习框架，不需要seen clean图像，通过两个不同的噪声版本输入，学习分离干净图像的特征和噪声。
results: 在 synthetic 和实际噪声下，与先前自我超vised方法相比，提出的方法表现更加优秀，特别是在未看到的新噪声类型上。在实际噪声下，even outperform its supervised counterparts by over 3 dB.

Abstract
With its significant performance improvements, the deep learning paradigm has become a standard tool for modern image denoisers. While promising performance has been shown on seen noise distributions, existing approaches often suffer from generalisation to unseen noise types or general and real noise. It is understandable as the model is designed to learn paired mapping (e.g. from a noisy image to its clean version). In this paper, we instead propose to learn to disentangle the noisy image, under the intuitive assumption that different corrupted versions of the same clean image share a common latent space. A self-supervised learning framework is proposed to achieve the goal, without looking at the latent clean image. By taking two different corrupted versions of the same image as input, the proposed Multi-view Self-supervised Disentanglement (MeD) approach learns to disentangle the latent clean features from the corruptions and recover the clean image consequently. Extensive experimental analysis on both synthetic and real noise shows the superiority of the proposed method over prior self-supervised approaches, especially on unseen novel noise types. On real noise, the proposed method even outperforms its supervised counterparts by over 3 dB.

摘要
Translated into Simplified Chinese:利用深度学习 paradigm 的性能提升，现代图像去噪器已成为标准工具。然而，现有方法常受到未经测试的噪音类型或通用噪音的影响。这是因为模型是用来学习对应关系（例如，从噪图到其干净版本）。在这篇文章中，我们反而提议学习噪图，基于干净图像的假设，即不同的噪图版本都共享同一个封闭空间。我们提出了一种自我超级vised学习框架，以实现目标。无需考虑干净图像，我们的Multi-view Self-supervised Disentanglement（MeD）方法可以从两个不同的噪图版本中提取干净特征，并 eventually 恢复干净图像。我们对具有 synthetic 和实际噪音的实验分析表明，我们的方法在先前的自我超级vised方法中表现出优异性，特别是在未经测试的新类型噪音上。在实际噪音上，我们的方法甚至超过了其supervised对手的性能，高于3 dB。

paper_url: http://arxiv.org/abs/2309.05036
repo_url: None
paper_authors: Muraleekrishna Gopinathan, Jumana Abu-Khalaf, David Suter, Sidike Paheding, Nathir A. Rawashdeh
for: 本研究旨在提供基于常见家居空间知识的可理解语言导航（VLN）模型，帮助导航器在新环境中快速寻找目标房间。
methods: 我们提出了一种名为WIN（What Is Near）的共通elian学习模型，该模型根据当前观察和导航历史，使用生活空间的常见知识来预测当前环境的局部地图。
results: 我们的实验结果显示，基于WIN模型的本地-全球规划和预测室内布局可以帮助导航器更好地选择合适的行动，并在未看过的环境中表现出比 классиical VLN代理更好的普适性。我们的模型在标准VLN指标中获得了68%的成功率和63%的成功指标。

Abstract
Humans use their knowledge of common house layouts obtained from previous experiences to predict nearby rooms while navigating in new environments. This greatly helps them navigate previously unseen environments and locate their target room. To provide layout prior knowledge to navigational agents based on common human living spaces, we propose WIN (\textit{W}hat \textit{I}s \textit{N}ear), a commonsense learning model for Vision Language Navigation (VLN) tasks. VLN requires an agent to traverse indoor environments based on descriptive navigational instructions. Unlike existing layout learning works, WIN predicts the local neighborhood map based on prior knowledge of living spaces and current observation, operating on an imagined global map of the entire environment. The model infers neighborhood regions based on visual cues of current observations, navigational history, and layout common sense. We show that local-global planning based on locality knowledge and predicting the indoor layout allows the agent to efficiently select the appropriate action. Specifically, we devised a cross-modal transformer that utilizes this locality prior for decision-making in addition to visual inputs and instructions. Experimental results show that locality learning using WIN provides better generalizability compared to classical VLN agents in unseen environments. Our model performs favorably on standard VLN metrics, with Success Rate 68\% and Success weighted by Path Length 63\% in unseen environments.

摘要
人类利用前期经验获得的常见家庭布局知识，在新环境中预测附近的房间，以便更好地导航和寻找目标房间。为了为导航代理人提供基于常见生活空间的布局先验知识，我们提议了WIN（何处near）模型，这是一种基于视觉语言导航（VLN）任务的 Commonsense 学习模型。VLN需要一个代理人通过描述性导航指令 traverse indoor环境。不同于现有的布局学习工作，WIN 预测当前环境的本地区域地图，基于先前的生活空间知识和当前观察，并在想象的全局环境地图上进行操作。模型根据当前观察的视觉指示和导航历史，以及布局常识来做地图推断。我们发现，基于本地知识和预测室内布局，使用WIN 模型进行本地-全局规划，可以有效地选择适当的行动。特别是，我们开发了一种跨模态变换器，使用这种本地知识进行决策，并与视觉输入和指令一起使用。实验结果表明，利用WIN 模型学习本地知识可以在未seen 环境中提供更好的普适性，我们的模型在标准 VLN 指标上表现良好，Success Rate 为 68%，Success weighted by Path Length 为 63%。

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

paper_url: http://arxiv.org/abs/2309.05032
repo_url: None
paper_authors: Kyoung Ok Yang, Junho Koh, Jun Won Choi
for:本研究旨在提高人体动作识别（HAR）模型的性能，通过融合不同感知器获取的数据。methods:本文提出了一种新的多模式融合架构， referred to as Unified Contrastive Fusion Transformer (UCFFormer)，用于融合不同分布的数据，以提高HAR性能。UCFFormer使用了Unified Transformer来捕捉多模式嵌入特征之间的相互关系，并使用Factorized Time-Modality Attention进行有效的自我注意力计算。此外，UCFFormer还包括对比学习，以减少不同模式特征分布之间的差异，从而生成协调的特征 для信息融合。results:实验结果表明，UCFFormer在UTD-MHAD和NTU RGB+D等两个 популяр的数据集上表现出状元的性能，与竞争方法相比，具有显著的优势。

Abstract
Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins.

摘要
不同类型的感知器被考虑用于开发人体行为识别（HAR）模型。在这篇论文中，我们介绍了一种新的多模态融合架构，称为统一对比融合变换器（UCFFormer），用于融合具有多样化分布的数据以提高HAR性能。基于每种模式中提取的嵌入特征，UCFFormer使用统一变换器来捕捉时间和模式域中嵌入的相互关系。我们还提出了分解时间-模式注意力 Mechanism，以实现效率的自我注意力计算。UCFFormer还 integrate了对比学习，以降低不同模式特征分布的差异，从而生成具有相同含义的特征进行信息融合。在UTD-MHAD和NTU RGB+D等两个流行的数据集上进行了性能评估，结果表明，UCFFormer在与竞争方法相比具有明显的优势，以至于达到状态之 искусственный智能。

SC-NeRF: Self-Correcting Neural Radiance Field with Sparse Views

paper_url: http://arxiv.org/abs/2309.05028
repo_url: None
paper_authors: Liang Song, Guangming Wang, Jiuming Liu, Zhenyang Fu, Yanzi Miao, Hesheng
for: 本研究扩展了神经辐射场的总结任务到户外场景，并且只使用对象级 datasets 进行训练。
methods: 我们提出了一种基于多头注意力机制的几何修正模块和出现修正模块来解决室外场景中的分布隔离和视角变化导致的渲染问题。
results: 我们的方法在四个 dataset （Blender, DTU, LLFF, Spaces）上进行评估，与之前的方法相比，我们的网络在户外场景中表现出色，PSNR 平均值由 19.369 提高到 25.989，SSIM 平均值由 0.838 提高到 0.889，LPIPS 值由 0.265 降低到 0.224。

Abstract
In recent studies, the generalization of neural radiance fields for novel view synthesis task has been widely explored. However, existing methods are limited to objects and indoor scenes. In this work, we extend the generalization task to outdoor scenes, trained only on object-level datasets. This approach presents two challenges. Firstly, the significant distributional shift between training and testing scenes leads to black artifacts in rendering results. Secondly, viewpoint changes in outdoor scenes cause ghosting or missing regions in rendered images. To address these challenges, we propose a geometric correction module and an appearance correction module based on multi-head attention mechanisms. We normalize rendered depth and combine it with light direction as query in the attention mechanism. Our network effectively corrects varying scene structures and geometric features in outdoor scenes, generalizing well from object-level to unseen outdoor scenes. Additionally, we use appearance correction module to correct appearance features, preventing rendering artifacts like blank borders and ghosting due to viewpoint changes. By combining these modules, our approach successfully tackles the challenges of outdoor scene generalization, producing high-quality rendering results. When evaluated on four datasets (Blender, DTU, LLFF, Spaces), our network outperforms previous methods. Notably, compared to MVSNeRF, our network improves average PSNR from 19.369 to 25.989, SSIM from 0.838 to 0.889, and reduces LPIPS from 0.265 to 0.224 on Spaces outdoor scenes.

摘要
在 latest studies, 神经网络频谱场景推广到新视图合成任务中得到了广泛探索。然而，现有方法仅限于对象和室内场景。在这种工作中，我们将推广任务扩展到户外场景，只使用对象级数据进行训练。这种方法存在两个挑战。首先，训练和测试场景之间的分布差异导致黑色artefacts在渲染结果中出现。其次，户外场景中的视角变化会导致 Ghosting 或者 absent 区域在渲染图像中出现。为了解决这些挑战，我们提议一种几何修正模块和一种外观修正模块，这两个模块都基于多头注意机制。我们将渲染深度Normalize 并与光direction作为查询在注意机制中。我们的网络有效地 correction 户外场景中的变化Scene structure 和几何特征，通过对象级数据进行推广，在未看到的户外场景中 generalized 良好。此外，我们使用外观修正模块来修正外观特征，避免由视角变化引起的渲染缺陷，如空白边缘和 Ghosting。通过这两个模块的组合，我们的方法成功地解决了户外场景推广的挑战，生成高质量的渲染结果。当我们的网络在四个数据集（Blender, DTU, LLFF, Spaces）进行评估时，与之前的方法相比，我们的网络在PSNR, SSIM 和 LPIPS 等指标上表现出色，特别是在 Spaces 户外场景上，我们的网络从19.369 提高到25.989，从0.838 提高到0.889，并且降低了0.265 到0.224。

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation From Scratch

paper_url: http://arxiv.org/abs/2309.07909
repo_url: None
paper_authors: Zelin Zang, Hao Luo, Kai Wang, Panpan Zhang, Fan Wang, Stan. Z Li, Yang You
for: 提高不监督对比学习的效果，特别是针对科学数据领域的数据增强。
methods: 提出了一种基于扩散的数据增强技术DiffAug，通过扩散步骤确保增强后和原始数据的略matrizspace相似。DiffAug不需要标签、外部数据/模型或先前知识，因为它首先在邻域中挖掘足够的语义知识。
results: DiffAug在图像分类和聚类任务上提高了1.6%~4.5%的精度，对生物数据进行应用后提高了10.1%的性能，平均提高5.8%。DiffAug在视觉和生物领域都表现良好。

Abstract
Unsupervised contrastive learning methods have recently seen significant improvements, particularly through data augmentation strategies that aim to produce robust and generalizable representations. However, prevailing data augmentation methods, whether hand designed or based on foundation models, tend to rely heavily on prior knowledge or external data. This dependence often compromises their effectiveness and efficiency. Furthermore, the applicability of most existing data augmentation strategies is limited when transitioning to other research domains, especially science-related data. This limitation stems from the paucity of prior knowledge and labeled data available in these domains. To address these challenges, we introduce DiffAug-a novel and efficient Diffusion-based data Augmentation technique. DiffAug aims to ensure that the augmented and original data share a smoothed latent space, which is achieved through diffusion steps. Uniquely, unlike traditional methods, DiffAug first mines sufficient prior semantic knowledge about the neighborhood. This provides a constraint to guide the diffusion steps, eliminating the need for labels, external data/models, or prior knowledge. Designed as an architecture-agnostic framework, DiffAug provides consistent improvements. Specifically, it improves image classification and clustering accuracy by 1.6%~4.5%. When applied to biological data, DiffAug improves performance by up to 10.1%, with an average improvement of 5.8%. DiffAug shows good performance in both vision and biological domains.

摘要
Recently, unsupervised contrastive learning methods have made significant progress, particularly through data augmentation strategies that aim to produce robust and generalizable representations. However, current data augmentation methods, whether designed by hand or based on pre-existing models, often rely heavily on prior knowledge or external data, which can limit their effectiveness and efficiency. Moreover, most existing data augmentation strategies are not applicable to other research domains, especially science-related data, due to the lack of prior knowledge and labeled data available in these domains. To address these challenges, we propose DiffAug, a novel and efficient Diffusion-based data Augmentation technique. DiffAug aims to ensure that the augmented and original data share a smoothed latent space, which is achieved through a series of diffusion steps. Unlike traditional methods, DiffAug first mines sufficient prior semantic knowledge about the neighborhood, providing a constraint to guide the diffusion steps, eliminating the need for labels, external data/models, or prior knowledge. As an architecture-agnostic framework, DiffAug provides consistent improvements, with image classification and clustering accuracy improved by 1.6%~4.5% and biological data improved by up to 10.1%, with an average improvement of 5.8%. DiffAug demonstrates good performance in both vision and biological domains.

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

paper_url: http://arxiv.org/abs/2309.05015
repo_url: None
paper_authors: Guanyu Xu, Zhiwei Hao, Yong Luo, Han Hu, Jianping An, Shiwen Mao
for:* The paper aims to achieve fast and energy-efficient collaborative inference for vision transformer (ViT) models on resource-constrained edge devices, while maintaining comparable accuracy with large ViTs.methods:* The authors propose a collaborative inference framework called DeViT, which decomposes large ViTs into multiple small models for efficient inference on edge devices.* They also design a decomposition-and-ensemble algorithm based on knowledge distillation, called DEKD, to fuse multiple small decomposed models and reduce communication overheads.results:* The authors achieve efficient collaborative inference for ViTs and outperform existing lightweight ViTs, striking a good trade-off between efficiency and accuracy.* Their DeViTs improve end-to-end latency by 2.89 times with only 1.65% accuracy sacrifice compared to the large ViT, ViT-L/16, on the GPU server.* Their DeDeiTs surpass the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1K, while running 1.72 times faster and requiring 55.28% lower energy consumption on the edge device.

Abstract
Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89$\times$ with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1K, while running 1.72$\times$ faster and requiring 55.28% lower energy consumption on the edge device.

摘要
近年来，大量参数和高计算成本的视觉变换器（ViT）在计算机视觉领域取得了伟大成就，但是ViT模型在有限的边缘设备上进行部署是困难的。现有的解决方案主要是压缩ViT模型到一个尺寸化的模型，但是这些模型仍然无法实现实时推理。为了解决这个问题，我们提出了在变换结构中进行分解的方法，将大型ViT decomposed into multiple small models，以实现在边缘设备上快速和能efficient的合作推理。我们的目标是实现高度可配置的推理，同时保持与大型ViT相同的准确性。为此，我们首先提出了一种协同推理框架，称为DeViT，以便在边缘设备上部署。然后，我们设计了一种基于知识传授的分解和聚合算法，称为DEKD，以减少通信开销，并处理不同模型的兼容性。我们还开发了一种匹配特征模块，以促进分解模型从大型ViT中的模仿。我们对三种常见的ViT脊梁进行了四个常用的数据集进行了广泛的实验。结果表明，我们的方法可以实现高效的协同推理，同时保持与大型ViT相同的准确性。例如，我们的DeViT在CIFAR-100上提高了端到端延迟时间，同时仅减少了1.65%的准确性。DeDeiTs在ImageNet-1K上超越了最近的高效ViT，MobileViT-S，而且在边缘设备上运行1.72倍快，需要55.28%的能源投入下降。

Geometrically Consistent Partial Shape Matching

paper_url: http://arxiv.org/abs/2309.05013
repo_url: None
paper_authors: Viktoria Ehm, Paul Roetzer, Marvin Eisenberger, Maolin Gao, Florian Bernard, Daniel Cremers
for: 这 paper 的目的是解决计算机视觉和图形领域中缺失部分形状匹配的问题，例如形状插值、pose transfer 和 texture transfer。
methods: 这 paper 使用了国际线性程序程序方法，并将国际状态艺术特征集成到匹配中。
results: 这 paper 的结果表明，使用这种方法可以在部分观察到的形状上提供更可靠和更平滑的匹配结果，并且可以比较学习基于的现有匹配方法。

Abstract
Finding correspondences between 3D shapes is a crucial problem in computer vision and graphics, which is for example relevant for tasks like shape interpolation, pose transfer, or texture transfer. An often neglected but essential property of matchings is geometric consistency, which means that neighboring triangles in one shape are consistently matched to neighboring triangles in the other shape. Moreover, while in practice one often has only access to partial observations of a 3D shape (e.g. due to occlusion, or scanning artifacts), there do not exist any methods that directly address geometrically consistent partial shape matching. In this work we fill this gap by proposing to integrate state-of-the-art deep shape features into a novel integer linear programming partial shape matching formulation. Our optimization yields a globally optimal solution on low resolution shapes, which we then refine using a coarse-to-fine scheme. We show that our method can find more reliable results on partial shapes in comparison to existing geometrically consistent algorithms (for which one first has to fill missing parts with a dummy geometry). Moreover, our matchings are substantially smoother than learning-based state-of-the-art shape matching methods.

摘要
寻找3D形状之间的对应关系是计算机视觉和图形处理中的一项重要问题，这种问题在例如形状 interpolate、pose transfer 和 texture transfer 等任务中具有重要意义。一种常被忽略但重要的属性是 geometric consistency，即邻近的三角形在一个形状中是一致地匹配邻近的三角形在另一个形状中。而在实践中，一般只有对形状的部分观察数据（例如因 occlusion 或扫描 artifacts）的访问，而没有任何方法可以直接处理准确的 partial shape matching。在这个工作中，我们填充了这个空白，我们提议将 state-of-the-art 的深度 shape feature 集成到一个新的整数线性 программирова部分形状匹配方法中。我们的优化得到了低分辨率形状上的全球优化解决方案，然后使用一种 course-to-fine 方案进行细化。我们表明，我们的方法可以在 partial shape 上获得更可靠的结果，并且与现有的准确 geometric consistency 算法相比，我们的匹配结果更平滑。

Towards Fully Decoupled End-to-End Person Search

paper_url: http://arxiv.org/abs/2309.04967
repo_url: None
paper_authors: Pengcheng Zhang, Xiao Bai, Jin Zheng, Xin Ning
for: 本文旨在提出一种全分解人体搜索方法，即将搜索任务分解为两个子任务：检测和身份识别，并通过一个统一的模型来实现这两个任务。
methods: 该方法使用一种任务协调学习策略，即在检测和身份识别任务之间进行协调学习，以解决这两个任务之间的冲突。
results: 实验结果表明，该方法可以减轻人体搜索任务中的冲突，并提高搜索性能。

Abstract
End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection task unifies all persons while the re-id task discriminates different identities, resulting in conflict optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on one or two of the sub-tasks due to their partially decoupled models, which limits the overall person search performance. In this paper, we propose to fully decouple person search towards optimal person search. A task-incremental person search network is proposed to incrementally construct an end-to-end model for the detection and re-id sub-task, which decouples the model architecture for the two sub-tasks. The proposed task-incremental network allows task-incremental training for the two conflicting tasks. This enables independent learning for different objectives thus fully decoupled the model for persons earch. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed fully decoupled models for end-to-end person search.

摘要
Traditional end-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection task unifies all persons while the re-id task discriminates different identities, resulting in conflict optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on one or two of the sub-tasks due to their partially decoupled models, which limits the overall person search performance. In this paper, we propose to fully decouple person search towards optimal person search. A task-incremental person search network is proposed to incrementally construct an end-to-end model for the detection and re-id sub-task, which decouples the model architecture for the two sub-tasks. The proposed task-incremental network allows task-incremental training for the two conflicting tasks. This enables independent learning for different objectives thus fully decoupled the model for persons search. Comprehensive experimental evaluations demonstrate the effectiveness of the proposed fully decoupled models for end-to-end person search.Here's the word-for-word translation:既然传统的端到端人体搜索是通过一个统一的模型来探测和重新标识目标人体在原始场景图像中的，但是探测和重新标识的任务之间存在冲突的优化目标。现有的方法是将端到端人体搜索解除，以降低这种冲突。然而，这些方法仍然在一或两个子任务上不优化，这限制了整体人体搜索性能。在本文中，我们提出了完全解除人体搜索的方法。我们提出了一种任务增量人体搜索网络，可以逐步建立端到端模型，以解除探测和重新标识两个子任务的模型体系。这种任务增量网络允许任务增量训练，从而实现独立学习不同目标，即完全解除人体搜索模型。我们的实验证明了我们的完全解除模型的效果。

paper_url: http://arxiv.org/abs/2309.04961
repo_url: https://github.com/extreme-classification/mufin
paper_authors: Anshul Mittal, Kunal Dahiya, Shreya Malani, Janani Ramaswamy, Seba Kuruvilla, Jitendra Ajmera, Keng-hao Chang, Sumeet Agarwal, Purushottam Kar, Manik Varma
for: 这个论文是为了解决极端分类（XC）任务中 millions of labels 的问题，其中数据点和标签具有视觉和文本描述器。
methods: 这篇论文提出了 MUFIN 技术，它是一种基于 cross-modal 注意力的多模态方法，可以在 millions of labels 的情况下提供高精度的分类。
results: 在 MM-AmazonTitles-300K 数据集上，MUFIN 对比主流的文本基于、图像基于和多模态方法，能够提供至少 3% 高的精度。

Abstract
This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On the other hand, XC methods utilize classifier architectures to offer superior accuracies than embedding only methods but mostly focus on text-based categorization tasks. MUFIN bridges this gap by reformulating multi-modal categorization as an XC problem with several millions of labels. This presents the twin challenges of developing multi-modal architectures that can offer embeddings sufficiently expressive to allow accurate categorization over millions of labels; and training and inference routines that scale logarithmically in the number of labels. MUFIN develops an architecture based on cross-modal attention and trains it in a modular fashion using pre-training and positive and negative mining. A novel product-to-product recommendation dataset MM-AmazonTitles-300K containing over 300K products was curated from publicly available amazon.com listings with each product endowed with a title and multiple images. On the all datasets MUFIN offered at least 3% higher accuracy than leading text-based, image-based and multi-modal techniques. Code for MUFIN is available at https://github.com/Extreme-classification/MUFIN

摘要
Traditional multi-modal methods rely solely on embedding-based approaches, while XC methods use classifier architectures for higher accuracy but are limited to text-based categorization tasks. MUFIN addresses this gap by treating multi-modal categorization as an XC problem with millions of labels. This presents two challenges: developing multi-modal architectures that can provide expressive embeddings for accurate categorization, and training and inference routines that scale logarithmically with the number of labels.MUFIN's architecture is based on cross-modal attention and is trained in a modular fashion using pre-training and positive and negative mining. The authors curated a new dataset, MM-AmazonTitles-300K, containing over 300K products with titles and multiple images from publicly available Amazon listings. MUFIN achieved at least 3% higher accuracy than leading text-based, image-based, and multi-modal techniques on all datasets.The code for MUFIN is available on GitHub at .

SdCT-GAN: Reconstructing CT from Biplanar X-Rays with Self-driven Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2309.04960
repo_url: https://github.com/csqvictory/sdct-gan
paper_authors: Shuangqin Cheng, Qingliang Chen, Qiyi Zhang, Ming Li, Yamuhanmode Alike, Kaile Su, Pengcheng Wen
For: The paper aims to address the limitations of existing methods in reconstructing 3D CT images from a limited number of 2D X-rays, by proposing a new self-driven generative adversarial network model (SdCT-GAN) that prioritizes the preservation of image details.* Methods: The proposed SdCT-GAN model incorporates a novel auto-encoder structure in the discriminator and utilizes a Sobel Gradient Guider (SGG) idea to integrate edge information from the 2D X-ray image at the input. Additionally, the LPIPS evaluation metric is adopted to quantitatively evaluate the fine contours and textures of reconstructed images.* Results: The proposed SdCT-GAN model outperforms mainstream state-of-the-art baselines in terms of both qualitative and quantitative results, as demonstrated through empirical studies. The reconstructed images exhibit better preservation of textural details and fine contours, as evaluated using the LPIPS metric.

Abstract
Computed Tomography (CT) is a medical imaging modality that can generate more informative 3D images than 2D X-rays. However, this advantage comes at the expense of more radiation exposure, higher costs, and longer acquisition time. Hence, the reconstruction of 3D CT images using a limited number of 2D X-rays has gained significant importance as an economical alternative. Nevertheless, existing methods primarily prioritize minimizing pixel/voxel-level intensity discrepancies, often neglecting the preservation of textural details in the synthesized images. This oversight directly impacts the quality of the reconstructed images and thus affects the clinical diagnosis. To address the deficits, this paper presents a new self-driven generative adversarial network model (SdCT-GAN), which is motivated to pay more attention to image details by introducing a novel auto-encoder structure in the discriminator. In addition, a Sobel Gradient Guider (SGG) idea is applied throughout the model, where the edge information from the 2D X-ray image at the input can be integrated. Moreover, LPIPS (Learned Perceptual Image Patch Similarity) evaluation metric is adopted that can quantitatively evaluate the fine contours and textures of reconstructed images better than the existing ones. Finally, the qualitative and quantitative results of the empirical studies justify the power of the proposed model compared to mainstream state-of-the-art baselines.

摘要
computed tomography (CT) 是医疗影像Modalities中能生成更加有用的三维图像，比二维X射线更高的Radiation exposure, cost and acquisition time. 因此，通过有限数量的二维X射线重建三维CT图像的重建技术已经得到了重要的重视。然而，现有的方法主要强调Minimize pixel/voxel-level intensity discrepancies，经常忽略保留图像细节的Synthesized images。这种缺点直接影响了重建图像的质量，从而affects the clinical diagnosis.为了解决这些缺点，本文提出了一种新的自适应生成对抗网络模型（SdCT-GAN），该模型具有一种新的自适应网络结构，以增强图像细节的保留。此外，本文还应用了 Sobel Gradient Guider（SGG）的想法，将输入2D X射线图像的边信息集成到模型中。此外，本文采用了LPIPS（学习Perceptual Image Patch Similarity）评价度量，可以更好地评估重建图像的细节和Texture。最后，对比主流状态的参考基eline，质量和量测试结果证明了提案模型的力量。

Semi-Supervised learning for Face Anti-Spoofing using Apex frame

paper_url: http://arxiv.org/abs/2309.04958
repo_url: None
paper_authors: Usman Muhammad, Mourad Oussalah, Jorma Laaksonen
for: 提高面部防诈技术的表现
methods: 使用 Gaussian 分布计算apex帧，不需要卷积
results: 在四个面部防诈数据库（CASIA、REPLAY-ATTACK、OULU-NPU、MSU-MFSD）中，使用apex帧提高了面部防诈技术的表现

Abstract
Conventional feature extraction techniques in the face anti-spoofing domain either analyze the entire video sequence or focus on a specific segment to improve model performance. However, identifying the optimal frames that provide the most valuable input for the face anti-spoofing remains a challenging task. In this paper, we address this challenge by employing Gaussian weighting to create apex frames for videos. Specifically, an apex frame is derived from a video by computing a weighted sum of its frames, where the weights are determined using a Gaussian distribution centered around the video's central frame. Furthermore, we explore various temporal lengths to produce multiple unlabeled apex frames using a Gaussian function, without the need for convolution. By doing so, we leverage the benefits of semi-supervised learning, which considers both labeled and unlabeled apex frames to effectively discriminate between live and spoof classes. Our key contribution emphasizes the apex frame's capacity to represent the most significant moments in the video, while unlabeled apex frames facilitate efficient semi-supervised learning, as they enable the model to learn from videos of varying temporal lengths. Experimental results using four face anti-spoofing databases: CASIA, REPLAY-ATTACK, OULU-NPU, and MSU-MFSD demonstrate the apex frame's efficacy in advancing face anti-spoofing techniques.

摘要
传统的特征提取技术在面对抗假驱动领域 either 分析整个视频序列或专注于特定的段来提高模型性能。然而，确定最佳的帧，以提供模型最有价值的输入， remains 一个挑战。在这篇论文中，我们解决这个挑战 by employing Gaussian weighting to create apex frames for videos. Specifically, an apex frame is derived from a video by computing a weighted sum of its frames, where the weights are determined using a Gaussian distribution centered around the video's central frame. Furthermore, we explore various temporal lengths to produce multiple unlabeled apex frames using a Gaussian function, without the need for convolution. By doing so, we leverage the benefits of semi-supervised learning, which considers both labeled and unlabeled apex frames to effectively discriminate between live and spoof classes. Our key contribution emphasizes the apex frame's capacity to represent the most significant moments in the video, while unlabeled apex frames facilitate efficient semi-supervised learning, as they enable the model to learn from videos of varying temporal lengths. Experimental results using four face anti-spoofing databases: CASIA, REPLAY-ATTACK, OULU-NPU, and MSU-MFSD demonstrate the apex frame's efficacy in advancing face anti-spoofing techniques.

Anatomy Completor: A Multi-class Completion Framework for 3D Anatomy Reconstruction

paper_url: http://arxiv.org/abs/2309.04956
repo_url: None
paper_authors: Jianning Li, Antonio Pepe, Gijs Luijten, Christina Schwarz-Gsaxner, Jens Kleesiek, Jan Egger
for: 这 paper targets 一种 scenario where one or multiple anatomies are missing in the imaging data due to surgical, pathological or traumatic factors, or simply because these anatomies are not covered by image acquisition.
methods: 该 paper 提出了 two paradigms based on a 3D denoising auto-encoder (DAE) to solve the anatomy reconstruction problem: (i) the DAE learns a many-to-one mapping between incomplete and complete instances; (ii) the DAE learns directly a one-to-one residual mapping between the incomplete instances and the target anatomies.
results: Results show that the method produces reasonable anatomy reconstructions given instances with different levels of incompleteness (i.e., one or multiple random anatomies are missing).

Abstract
In this paper, we introduce a completion framework to reconstruct the geometric shapes of various anatomies, including organs, vessels and muscles. Our work targets a scenario where one or multiple anatomies are missing in the imaging data due to surgical, pathological or traumatic factors, or simply because these anatomies are not covered by image acquisition. Automatic reconstruction of the missing anatomies benefits many applications, such as organ 3D bio-printing, whole-body segmentation, animation realism, paleoradiology and forensic imaging. We propose two paradigms based on a 3D denoising auto-encoder (DAE) to solve the anatomy reconstruction problem: (i) the DAE learns a many-to-one mapping between incomplete and complete instances; (ii) the DAE learns directly a one-to-one residual mapping between the incomplete instances and the target anatomies. We apply a loss aggregation scheme that enables the DAE to learn the many-to-one mapping more effectively and further enhances the learning of the residual mapping. On top of this, we extend the DAE to a multiclass completor by assigning a unique label to each anatomy involved. We evaluate our method using a CT dataset with whole-body segmentations. Results show that our method produces reasonable anatomy reconstructions given instances with different levels of incompleteness (i.e., one or multiple random anatomies are missing). Codes and pretrained models are publicly available at https://github.com/Jianningli/medshapenet-feedback/ tree/main/anatomy-completor

摘要
在这篇论文中，我们介绍了一个完成框架，用于重建各种生物体的几何形状，包括器官、血管和肌肉。我们的工作针对于医学影像数据中缺失一或多个生物体，可能因为手术、病理或受伤等因素，或者只是因为这些生物体没有被影像捕获。自动重建缺失的生物体具有许多应用，如器官3D生物印刷、全身份割、动画实实地、古生物Radiology和审判影像。我们提出了两种方案，基于3D杂噪自动编码器（DAE）解决生物体重建问题：（i）DAE学习多个尝试的多对一映射，使得 incomplete instances 可以转化为完整的实例。（ii）DAE直接学习一对一差分映射，使得 incomplete instances 可以减少为目标生物体。我们采用损失聚合方案，使得 DAE 更好地学习多对一映射，并进一步提高了差分映射的学习。此外，我们将 DAE 扩展到多类完成器，通过为每个生物体分配唯一的标签。我们使用 CT 数据集进行评估，结果表明，我们的方法可以在不同水平的不完整性（即一或多个随机生物体缺失）下生成合理的生物体重建。代码和预训练模型可以在中下载。

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

paper_url: http://arxiv.org/abs/2309.04946
repo_url: https://github.com/yuangan/eat_code
paper_authors: Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, Yi Yang
for: 这项研究旨在提高audio-driven talking-head sintheis的灵活性和效率，并提供一种可靠地控制表情的方法。
methods: 该方法使用一个预训练的无情 talking-head transformer，并提出三种轻量级的改进（深度情感提示、情感变形网络和情感适应模块），以实现精准和真实的情感控制。
results: 该方法在广泛使用的benchmark上达到了状态之前的性能，包括LRW和MEAD。此外，我们的参数有效地适应能力在情感训练视频罕见或缺失时表现出色。

Abstract
Audio-driven talking-head synthesis is a popular research topic for virtual human-related applications. However, the inflexibility and inefficiency of existing methods, which necessitate expensive end-to-end training to transfer emotions from guidance videos to talking-head predictions, are significant limitations. In this work, we propose the Emotional Adaptation for Audio-driven Talking-head (EAT) method, which transforms emotion-agnostic talking-head models into emotion-controllable ones in a cost-effective and efficient manner through parameter-efficient adaptations. Our approach utilizes a pretrained emotion-agnostic talking-head transformer and introduces three lightweight adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and Emotional Adaptation Module) from different perspectives to enable precise and realistic emotion controls. Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including LRW and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable generalization ability, even in scenarios where emotional training videos are scarce or nonexistent. Project website: https://yuangan.github.io/eat/

摘要
文本：Audio-driven talking-head synthesis是虚拟人类应用领域的受欢迎研究主题。然而，现有方法的不灵活和不效率是重要的限制，这些方法需要昂贵的端到端训练来传递视频指导中的情感到说话人预测中。在这项工作中，我们提出了情感适应 для Audio-driven Talking-head（EAT）方法，它将情感无关的说话人模型转化为可控情感的模型，在成本效率的方式下进行。我们的方法使用预训练的情感无关说话人转换器，并引入三种轻量级的适应（深度情感提示、情感变形网络和情感适应模块）从不同的角度来实现精准和真实的情感控制。我们的实验显示，我们的方法在广泛使用的标准准则上达到了状态 искусственный智能的性能，包括LRW和MEAD。此外，我们的参数效率的适应表现了Remarkable的总体化能力，即使情感训练视频罕见或缺失。项目网站：https://yuangan.github.io/eat/Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Text-driven Editing of 3D Scenes without Retraining

paper_url: http://arxiv.org/abs/2309.04917
repo_url: None
paper_authors: Shuangkang Fang, Yufeng Wang, Yi Yang, Yi-Hsuan Tsai, Wenrui Ding, Ming-Hsuan Yang, Shuchang Zhou
for: 该文章目的是提出一种基于文本的3D场景编辑方法，允许用户通过文本描述编辑3D场景图像。
methods: 该方法使用现成的文本基于编辑模型来修改2D图像，然后对修改后的图像进行筛选，以消除质量不佳的图像。最后，该方法使用生成特征相似的训练数据来减少干扰干扰。
results: 该方法可以实现多种编辑类型，包括外观编辑、天气转换、材质变换和风格传递。具有用户可以通过文本描述编辑3D场景图像的便捷、直观和实用性。

Abstract
Numerous diffusion models have recently been applied to image synthesis and editing. However, editing 3D scenes is still in its early stages. It poses various challenges, such as the requirement to design specific methods for different editing types, retraining new models for various 3D scenes, and the absence of convenient human interaction during editing. To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images, followed by a filtering process to discard poorly edited images that disrupt 3D consistency. We then consider the remaining inconsistency as a problem of removing noise perturbation, which can be solved by generating training data with similar perturbation characteristics for training. We further propose cross-view regularization terms to help the generalized NeRF model mitigate these perturbations. Our text-driven method allows users to edit a 3D scene with their desired description, which is more friendly, intuitive, and practical than prior works. Empirical results show that our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer. Most importantly, our method generalizes well with editing abilities shared among a set of model parameters without requiring a customized editing model for some specific scenes, thus inferring novel views with editing effects directly from user input. The project website is available at http://sk-fun.fun/DN2N

摘要
“很多扩散模型在图像生成和修改中被应用，但是修改3D场景仍处于早期阶段。这具有多种挑战，例如需要设计特定的修改类型，重新训练新模型以适应不同的3D场景，以及在修改过程中缺乏便利的人类交互。为解决这些问题，我们介绍了一种基于文本的修改方法，称为DN2N，它允许直接从文本描述中获得一个通用的修改能力。我们的方法使用商业可用的文本基于修改模型来修改2D图像，然后对修改后的图像进行筛选，以排除破坏3D一致性的图像。我们认为剩下的不一致性可以视为去除噪声干扰的问题，可以通过生成类似噪声特征的训练数据来解决。我们还提出了跨视图常见化项来帮助通用NeRF模型解决这些干扰。我们的文本驱动方法允许用户通过文本描述修改3D场景，这是较为友好、直观和实用的than之前的方法。我们的实验结果表明，我们的方法可以实现多种修改类型，包括但不限于外观修改、天气转换、材质变化和风格传递。最重要的是，我们的方法可以将修改能力共享到一组模型参数中，无需特定的修改模型，从而直接从用户输入中推断出修改后的视图。项目网站地址为http://sk-fun.fun/DN2N。”

Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art

paper_url: http://arxiv.org/abs/2309.04902
repo_url: https://github.com/arekavandi/transformer-sod
paper_authors: Aref Miri Rekavandi, Shima Rashidi, Farid Boussaid, Stephen Hoefs, Emre Akbas, Mohammed bennamoun
for: 这个研究的目的是探索对小物类别检测（SOD）的表现优化，以及可能的原因。
methods: 这个研究使用了训练中的广泛网络，包括当前最优秀的对应物检测器。
results: 这个研究发现，广泛网络在SOD tasks中的表现优化，并且提供了一个统计分析和评估方法来评估广泛网络在SOD tasks中的表现。

Abstract
Transformers have rapidly gained popularity in computer vision, especially in the field of object recognition and detection. Upon examining the outcomes of state-of-the-art object detection methods, we noticed that transformers consistently outperformed well-established CNN-based detectors in almost every video or image dataset. While transformer-based approaches remain at the forefront of small object detection (SOD) techniques, this paper aims to explore the performance benefits offered by such extensive networks and identify potential reasons for their SOD superiority. Small objects have been identified as one of the most challenging object types in detection frameworks due to their low visibility. We aim to investigate potential strategies that could enhance transformers' performance in SOD. This survey presents a taxonomy of over 60 research studies on developed transformers for the task of SOD, spanning the years 2020 to 2023. These studies encompass a variety of detection applications, including small object detection in generic images, aerial images, medical images, active millimeter images, underwater images, and videos. We also compile and present a list of 12 large-scale datasets suitable for SOD that were overlooked in previous studies and compare the performance of the reviewed studies using popular metrics such as mean Average Precision (mAP), Frames Per Second (FPS), number of parameters, and more. Researchers can keep track of newer studies on our web page, which is available at \url{https://github.com/arekavandi/Transformer-SOD}.

摘要
孔雀Transformers在计算机视觉领域快速崛起，尤其是对象识别和检测领域。我们对现代对象检测方法的结果进行分析发现，transformers在大多数视频或图像数据集上 consistently outperform了基于CNN的检测器。虽然transformer-basedapproaches在小对象检测（SOD）技术中保持领先地位，但这篇论文旨在探讨transformers在SOD中的性能优势，以及可能的潜在原因。小对象被识别为计算机检测框架中最为困难的对象之一，因为它们的可见性较低。我们想 investigate potential strategies可以提高transformers在SOD中的表现。这篇评论文件了60多个关于发展transformers的研究成果，涵盖了2020-2023年间的多种检测应用，包括通用图像中的小对象检测、航空图像、医疗图像、活动毫米图像、水下图像和视频。我们还编译了12个大规模的SOD适用的数据集，并与之前的研究相比较了这些研究的性能，使用popular metrics such as mean Average Precision（mAP）、Frames Per Second（FPS）、参数数量和更多。研究人员可以通过我们的网页（https://github.com/arekavandi/Transformer-SOD） Track newer studies.

Fun Paper

2023-09-10

cs.CV - 2023-09-10

Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color

A Skeleton-based Approach For Rock Crack Detection Towards A Climbing Robot Application

DAD++: Improved Data-free Test Time Adversarial Defense

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment

Sculpting Efficiency: Pruning Medical Imaging Models for On-Device Inference

FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild

Lung Diseases Image Segmentation using Faster R-CNNs

Super-Resolution Surface Reconstruction from Few Low-Resolution Slices

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Multi-view Self-supervised Disentanglement for General Image Denoising

What Is Near?: Room Locality Learning for Enhanced Robot Vision-Language-Navigation in Indoor Living Environments

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

SC-NeRF: Self-Correcting Neural Radiance Field with Sparse Views

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation From Scratch

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Geometrically Consistent Partial Shape Matching

Towards Fully Decoupled End-to-End Person Search

SdCT-GAN: Reconstructing CT from Biplanar X-Rays with Self-driven Generative Adversarial Networks

Semi-Supervised learning for Face Anti-Spoofing using Apex frame

Anatomy Completor: A Multi-class Completion Framework for 3D Anatomy Reconstruction

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Text-driven Editing of 3D Scenes without Retraining

Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art

2023-09-10

Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color

A Skeleton-based Approach For Rock Crack Detection Towards A Climbing Robot Application

DAD++: Improved Data-free Test Time Adversarial Defense

3D Implicit Transporter for Temporally Consistent Keypoint Discovery

MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment

Sculpting Efficiency: Pruning Medical Imaging Models for On-Device Inference

FreeMan: Towards Benchmarking 3D Human Pose Estimation in the Wild

Lung Diseases Image Segmentation using Faster R-CNNs

Super-Resolution Surface Reconstruction from Few Low-Resolution Slices

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

Multi-view Self-supervised Disentanglement for General Image Denoising

What Is Near?: Room Locality Learning for Enhanced Robot Vision-Language-Navigation in Indoor Living Environments

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

SC-NeRF: Self-Correcting Neural Radiance Field with Sparse Views

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation From Scratch

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Geometrically Consistent Partial Shape Matching

Towards Fully Decoupled End-to-End Person Search

Multi-modal Extreme Classification

SdCT-GAN: Reconstructing CT from Biplanar X-Rays with Self-driven Generative Adversarial Networks

Semi-Supervised learning for Face Anti-Spoofing using Apex frame

Anatomy Completor: A Multi-class Completion Framework for 3D Anatomy Reconstruction

Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation

Text-driven Editing of 3D Scenes without Retraining

Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art