2023-07-26

cs.CV

cs.CV - 2023-07-26

Artifact Restoration in Histology Images with Diffusion Probabilistic Models

paper_url: http://arxiv.org/abs/2307.14262
repo_url: https://github.com/zhenqi-he/artifusion
paper_authors: Zhenqi He, Junjun He, Jin Ye, Yiqing Shen
for: histological whole slide images (WSIs) restoration
methods: denoising diffusion probabilistic model (ArtiFusion) with a novel Swin-Transformer denoising architecture and time token scheme
results: effective restoration of artifact-free regions with preserved tissue structures and stain style, as demonstrated through extensive evaluations.

Abstract
Histological whole slide images (WSIs) can be usually compromised by artifacts, such as tissue folding and bubbles, which will increase the examination difficulty for both pathologists and Computer-Aided Diagnosis (CAD) systems. Existing approaches to restoring artifact images are confined to Generative Adversarial Networks (GANs), where the restoration process is formulated as an image-to-image transfer. Those methods are prone to suffer from mode collapse and unexpected mistransfer in the stain style, leading to unsatisfied and unrealistic restored images. Innovatively, we make the first attempt at a denoising diffusion probabilistic model for histological artifact restoration, namely ArtiFusion.Specifically, ArtiFusion formulates the artifact region restoration as a gradual denoising process, and its training relies solely on artifact-free images to simplify the training complexity.Furthermore, to capture local-global correlations in the regional artifact restoration, a novel Swin-Transformer denoising architecture is designed, along with a time token scheme. Our extensive evaluations demonstrate the effectiveness of ArtiFusion as a pre-processing method for histology analysis, which can successfully preserve the tissue structures and stain style in artifact-free regions during the restoration. Code is available at https://github.com/zhenqi-he/ArtiFusion.

摘要
histological whole slide images (WSIs) 可能会受到artefacts的影响，如组织卷积和气泡，这会提高Pathologist和Computer-Aided Diagnosis (CAD)系统的检查难度。现有的恢复artefact图像方法被限定于生成对抗网络 (GANs)，其恢复过程是表示为图像-to-图像传输。这些方法容易受到模式落寞和意外传输的问题，导致不满意的和不实际的恢复图像。我们在这里做出了一个新的恢复气泡概率模型，即ArtiFusion。具体来说，ArtiFusion将artefact区域恢复视为一种慢涨推敲过程，其训练仅基于无artefact图像，以简化训练复杂性。此外，为了捕捉区域artefact恢复中的本地-全局相关性，我们设计了一种Swin-Transformer推净架构，并采用时间token方案。我们的广泛评估表明ArtiFusion作为 histology分析前置处理方法，可以成功保留组织结构和染色 Style在恢复后的artefact-free区域中。代码可以在https://github.com/zhenqi-he/ArtiFusion上获取。

Sparse Double Descent in Vision Transformers: real or phantom threat?

paper_url: http://arxiv.org/abs/2307.14253
repo_url: https://github.com/vgcq/sdd_vit
paper_authors: Victor Quétu, Marta Milovanovic, Enzo Tartaglione
for: 这 paper 的目的是研究 Vision Transformers (ViT) 是否受到 “sparse double descent” 现象的影响，并找到避免该现象的方法。
methods: 该 paper 使用了一种 attention-based 方法，并对 ViT 进行了优化调整，以避免 inductive bias 的影响。
results: 研究发现，对于 ViT，可以通过优化 lambda 值来避免 sparse double descent 现象，但是这会导致模型的压缩。

Abstract
Vision transformers (ViT) have been of broad interest in recent theoretical and empirical works. They are state-of-the-art thanks to their attention-based approach, which boosts the identification of key features and patterns within images thanks to the capability of avoiding inductive bias, resulting in highly accurate image analysis. Meanwhile, neoteric studies have reported a ``sparse double descent'' phenomenon that can occur in modern deep-learning models, where extremely over-parametrized models can generalize well. This raises practical questions about the optimal size of the model and the quest over finding the best trade-off between sparsity and performance is launched: are Vision Transformers also prone to sparse double descent? Can we find a way to avoid such a phenomenon? Our work tackles the occurrence of sparse double descent on ViTs. Despite some works that have shown that traditional architectures, like Resnet, are condemned to the sparse double descent phenomenon, for ViTs we observe that an optimally-tuned $\ell_2$ regularization relieves such a phenomenon. However, everything comes at a cost: optimal lambda will sacrifice the potential compression of the ViT.

摘要
幻transformer（ViT）在最近的理论和实验研究中受到广泛关注。它们因其基于注意力的方法而成为现代图像分析的州��ensional标准，可以快速和准确地找到图像中的关键特征和模式。然而，新的研究还发现了一种“稀疏双峰”现象，这种现象在现代深度学习模型中出现，其中非常过参数的模型可以总是具有高度的泛化能力。这引发了实用问题：幻transformer也是否受到稀疏双峰现象的影响？我们的工作是研究幻transformer中稀疏双峰现象的发生。虽然一些研究表明，传统的architecture，如Resnet，是不可避免稀疏双峰现象的，但是对于幻transformer，我们发现了一种优化的 $\ell_2$ 正则化可以缓解这种现象。然而，这来的代价是优化lambda会导致幻transformer的潜在压缩被抑制。

Fluorescent Neuronal Cells v2: Multi-Task, Multi-Format Annotations for Deep Learning in Microscopy

paper_url: http://arxiv.org/abs/2307.14243
repo_url: None
paper_authors: Luca Clissa, Antonio Macaluso, Roberto Morelli, Alessandra Occhinegro, Emiliana Piscitiello, Ludovico Taddei, Marco Luppi, Roberto Amici, Matteo Cerri, Timna Hitrec, Lorenzo Rinaldi, Antonio Zoccoli
for: 本研究用于推动生物科学领域的computer视觉技术发展，提供多种标注数据集，包括semantic segmentation、物体检测和计数等学习任务。
methods: 本研究使用多种染色物标注 rodent neuronal cells的核心和细胞膜，包括多种生物marker和生物chemical marker，以便研究computer视觉技术的发展。
results: 本研究提供了一个多样化的数据集，包括rodent neuronal cells的核心和细胞膜的多种染色物标注，可以推动computer视觉技术的发展，并且可以用于多种生物科学研究。

Abstract
Fluorescent Neuronal Cells v2 is a collection of fluorescence microscopy images and the corresponding ground-truth annotations, designed to foster innovative research in the domains of Life Sciences and Deep Learning. This dataset encompasses three image collections in which rodent neuronal cells' nuclei and cytoplasm are stained with diverse markers to highlight their anatomical or functional characteristics. Alongside the images, we provide ground-truth annotations for several learning tasks, including semantic segmentation, object detection, and counting. The contribution is two-fold. First, given the variety of annotations and their accessible formats, we envision our work facilitating methodological advancements in computer vision approaches for segmentation, detection, feature learning, unsupervised and self-supervised learning, transfer learning, and related areas. Second, by enabling extensive exploration and benchmarking, we hope Fluorescent Neuronal Cells v2 will catalyze breakthroughs in fluorescence microscopy analysis and promote cutting-edge discoveries in life sciences. The data are available at: https://amsacta.unibo.it/id/eprint/7347

摘要
fluorescent neuronal cells v2是一个包含 fluorescence microscopy 图像和相应的ground truth注释的集合，旨在推动生命科学和深度学习领域的创新研究。这个数据集包括三个图像集，其中 rodent neuronal cells的核和质物被使用不同的标记物来标出其形态或功能特征。同时，我们提供了ground truth注释，用于多种学习任务，包括semantic segmentation、object detection和 counting。我们的贡献是twofold。首先，由于数据集中的多样性和可访问的格式，我们期望我们的工作会促进计算机视觉方法的进步，包括 segmentation、 detection、feature learning、unsupervised和self-supervised learning、转移学习等领域。其次，通过允许广泛探索和比较，我们希望fluorescent neuronal cells v2会促进 fluorescence microscopy 分析的进步，并推动生命科学的前沿研究。数据可以在以下链接中下载：https://amsacta.unibo.it/id/eprint/7347。

Defending Adversarial Patches via Joint Region Localizing and Inpainting

paper_url: http://arxiv.org/abs/2307.14242
repo_url: None
paper_authors: Junwen Chen, Xingxing Wei
for: 防御对象是遭受攻击的图像识别和检测任务，应对于各种攻击方法，包括对象视觉上的变化和内容上的遗传。
methods: 提出了一种基于“本地化和填充”机制的防御方法，包括一个两棵分支结构的“本地化”子网络和一个使用周围上下文信息填充原始内容的“填充”子网络，两者通过迭代优化方式相互关联学习。
results: 通过对多个交通标识和检测任务进行测试，证明了该防御方法能够有效地防御各种攻击检测和识别任务，并且可以保持图像的可读性和检测性。

Abstract
Deep neural networks are successfully used in various applications, but show their vulnerability to adversarial examples. With the development of adversarial patches, the feasibility of attacks in physical scenes increases, and the defenses against patch attacks are urgently needed. However, defending such adversarial patch attacks is still an unsolved problem. In this paper, we analyse the properties of adversarial patches, and find that: on the one hand, adversarial patches will lead to the appearance or contextual inconsistency in the target objects; on the other hand, the patch region will show abnormal changes on the high-level feature maps of the objects extracted by a backbone network. Considering the above two points, we propose a novel defense method based on a ``localizing and inpainting" mechanism to pre-process the input examples. Specifically, we design an unified framework, where the ``localizing" sub-network utilizes a two-branch structure to represent the above two aspects to accurately detect the adversarial patch region in the image. For the ``inpainting" sub-network, it utilizes the surrounding contextual cues to recover the original content covered by the adversarial patch. The quality of inpainted images is also evaluated by measuring the appearance consistency and the effects of adversarial attacks. These two sub-networks are then jointly trained via an iterative optimization manner. In this way, the ``localizing" and ``inpainting" modules can interact closely with each other, and thus learn a better solution. A series of experiments versus traffic sign classification and detection tasks are conducted to defend against various adversarial patch attacks.

摘要
深度神经网络在不同应用中得到了成功，但它们受到了针对性攻击的漏洞。随着物理场景中的攻击可能性的提高，防御针对贴图攻击的需求也日益增加。然而，防御针对贴图攻击仍然是一个未解决的问题。在这篇论文中，我们分析了针对贴图攻击的特性，并发现：一方面，贴图会导致目标对象的外观或上下文不一致；另一方面，贴图区域在对象提取后的高级特征图中会出现异常变化。基于以上两点，我们提出了一种基于“局部化和填充”机制的防御方法。具体来说，我们设计了一个统一框架，其中“局部化”子网络采用两棵树结构来准确检测贴图区域在图像中。为“填充”子网络，它利用周围的上下文征化来恢复贴图覆盖的原始内容。我们对填充图像的质量也进行了评估，包括外观一致性和针对攻击的影响。这两个子网络然后通过迭代优化方式进行联合培训，以便“局部化”和“填充”模块可以更好地互动。通过这种方式，我们可以更好地防御针对贴图攻击。我们在交通标识和检测任务上进行了多个实验，以防御不同类型的贴图攻击。

DisguisOR: Holistic Face Anonymization for the Operating Room

paper_url: http://arxiv.org/abs/2307.14241
repo_url: https://github.com/wngtn/disguisor
paper_authors: Lennart Bastian, Tony Danjun Wang, Tobias Czempiel, Benjamin Busam, Nassir Navab
for: 这篇研究旨在提高医疗数据科学（SDS）中的隐私保护，特别是在运行室（OR）中的录影视频中。methods: 本研究使用多条camera流的RGB和深度图像进行整合，从而获得了3D点云表示。然后，通过对检测到的3D人体关键点进行对应，将人脸模型覆盖在每个摄取到的相机视野中。results: 本方法能够更高效地找到人脸，并且实现了更加自然的隐私保护。DisguisOR可以实现Scene Level的隐私保护，并且具有推进SDS更多研究的潜力。

Abstract
Purpose: Recent advances in Surgical Data Science (SDS) have contributed to an increase in video recordings from hospital environments. While methods such as surgical workflow recognition show potential in increasing the quality of patient care, the quantity of video data has surpassed the scale at which images can be manually anonymized. Existing automated 2D anonymization methods under-perform in Operating Rooms (OR), due to occlusions and obstructions. We propose to anonymize multi-view OR recordings using 3D data from multiple camera streams. Methods: RGB and depth images from multiple cameras are fused into a 3D point cloud representation of the scene. We then detect each individual's face in 3D by regressing a parametric human mesh model onto detected 3D human keypoints and aligning the face mesh with the fused 3D point cloud. The mesh model is rendered into every acquired camera view, replacing each individual's face. Results: Our method shows promise in locating faces at a higher rate than existing approaches. DisguisOR produces geometrically consistent anonymizations for each camera view, enabling more realistic anonymization that is less detrimental to downstream tasks. Conclusion: Frequent obstructions and crowding in operating rooms leaves significant room for improvement for off-the-shelf anonymization methods. DisguisOR addresses privacy on a scene level and has the potential to facilitate further research in SDS.

摘要
Methods: RGB and depth images from multiple cameras are fused into a 3D point cloud representation of the scene. We then detect each individual's face in 3D by regressing a parametric human mesh model onto detected 3D human keypoints and aligning the face mesh with the fused 3D point cloud. The mesh model is rendered into every acquired camera view, replacing each individual's face.Results: Our method shows promise in locating faces at a higher rate than existing approaches. DisguisOR produces geometrically consistent anonymizations for each camera view, enabling more realistic anonymization that is less detrimental to downstream tasks.Conclusion: Frequent obstructions and crowding in operating rooms leaves significant room for improvement for off-the-shelf anonymization methods. DisguisOR addresses privacy on a scene level and has the potential to facilitate further research in SDS.

Computational Approaches for Traditional Chinese Painting: From the “Six Principles of Painting” Perspective

paper_url: http://arxiv.org/abs/2307.14227
repo_url: None
paper_authors: Wei Zhang, Jian-Wei Zhang, Kam Kwai Wong, Yifang Wang, Yingchaojie Feng, Luwei Wang, Wei Chen
for: 本研究旨在探讨计算机技术在传统中国画中的应用，以保护和普及这种独特的艺术风格。methods: 本研究采用了三个视角来分析计算机技术在传统中国画中的应用，包括以“六 principios of Painting”理论为基础的艺术元素分类、四个阶段框架来描述传统中国画应用的目的、以及常用的计算机技术在传统中国画中的应用。results: 本研究通过分析92篇文献和专家访谈，提出了一个四个阶段框架来描述传统中国画应用的目的，并概括了常用的计算机技术在传统中国画中的应用。这些成果可以帮助人们更好地理解计算机技术在传统中国画中的应用，并为未来的研究提供指导。

Abstract
Traditional Chinese Painting (TCP) is an invaluable cultural heritage resource and a unique visual art style. In recent years, increasing interest has been placed on digitalizing TCPs to preserve and revive the culture. The resulting digital copies have enabled the advancement of computational methods for structured and systematic understanding of TCPs. To explore this topic, we conducted an in-depth analysis of 92 pieces of literature. We examined the current use of computer technologies on TCPs from three perspectives, based on numerous conversations with specialists. First, in light of the "Six Principles of Painting" theory, we categorized the articles according to their research focus on artistic elements. Second, we created a four-stage framework to illustrate the purposes of TCP applications. Third, we summarized the popular computational techniques applied to TCPs. The framework also provides insights into potential applications and future prospects, with professional opinion. The list of surveyed publications and related information is available online at https://ca4tcp.com.

摘要
传统中国画（TCP）是一种无价的文化遗产资源和独特的视觉艺术风格。在最近几年，对于数字化TCP的兴趣日益增长，以保存和复兴文化。这些数字化 kopi 已经帮助计算机科学方面的研究人员对TCP进行结构化和系统化的研究。为了探讨这个主题，我们进行了深入的文献分析，检视了92篇论文。我们根据“六则绘画理论”分类了文章，按照许多专家的讲话，对TCP的计算机技术的应用进行了三个视角。首先，根据“六则绘画理议”分类文章，按照艺术元素的研究方向进行了分类。其次，我们创建了四个阶段框架，以 Illustrate TCP 的应用目的。最后，我们总结了应用于TCP的流行计算机技术。这个框架还提供了可能的应用和未来前景，以及专业意见。悉数据和相关信息可以在https://ca4tcp.com 上查看。

ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

paper_url: http://arxiv.org/abs/2307.14187
repo_url: https://github.com/KUIS-AI/adapt
paper_authors: Görkay Aydemir, Adil Kaan Akan, Fatma Güney
for: 预测复杂交通场景中机器人的未来轨迹，需要可靠和高效的预测方法。现有的预测方法都有缺点，或者效率低下，或者精度不高。
methods: 我们提出了ADAPT方法，它可以同时预测整个场景中所有机器人的轨迹，并且可以在运行时动态调整模型的权重。我们的方法在单机器人和多机器人设置下，在Argoverse和Interaction数据集上都超越了状态静态方法，而且具有相对较低的计算开销。
results: 我们的分析表明，ADAPT方法可以准确地预测每个机器人的轨迹，并且可以快速地完成预测任务。这是因为ADAPT方法可以根据每个机器人的特点，动态地调整预测模型的权重，以确保每个机器人的预测准确性。

Abstract
Forecasting future trajectories of agents in complex traffic scenes requires reliable and efficient predictions for all agents in the scene. However, existing methods for trajectory prediction are either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance: first, to the adaptive head augmenting the model capacity without increasing the model size; second, to our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently. https://KUIS-AI.github.io/adapt

摘要
预测未来行车路径需要可靠和高效的预测，以确保所有在场景中的代理人的行车路径都能够预测 accurately。然而，现有的行车路径预测方法 Either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance to two aspects: first, the adaptive head augmenting the model capacity without increasing the model size; second, our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you prefer Traditional Chinese, I can provide that as well.

Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks

paper_url: http://arxiv.org/abs/2307.14179
repo_url: None
paper_authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim
for: 本研究旨在提供深度学习网络 semantic segmentation 的优化方法，以提高 segmentation 结果的准确性。
methods: 本研究使用了 DeepLab 深度学习网络，并提出了一种基于 atrous spatial pyramid pooling (ASPP) 模块的 практиче guideline，以优化 ASPP 模块中的 atrous rate。
results: comparing with other values, 使用优化的 atrous rate consistently 提高了 segmentation 结果 across multiple datasets，包括 STARE、CHASE_DB1、HRF、Cityscapes 和 iSAID 数据集。

Abstract
DeepLab is a widely used deep neural network for semantic segmentation, whose success is attributed to its parallel architecture called atrous spatial pyramid pooling (ASPP). ASPP uses multiple atrous convolutions with different atrous rates to extract both local and global information. However, fixed values of atrous rates are used for the ASPP module, which restricts the size of its field of view. In principle, atrous rate should be a hyperparameter to change the field of view size according to the target task or dataset. However, the manipulation of atrous rate is not governed by any guidelines. This study proposes practical guidelines for obtaining an optimal atrous rate. First, an effective receptive field for semantic segmentation is introduced to analyze the inner behavior of segmentation networks. We observed that the use of ASPP module yielded a specific pattern in the effective receptive field, which was traced to reveal the module's underlying mechanism. Accordingly, we derive practical guidelines for obtaining the optimal atrous rate, which should be controlled based on the size of input image. Compared to other values, using the optimal atrous rate consistently improved the segmentation results across multiple datasets, including the STARE, CHASE_DB1, HRF, Cityscapes, and iSAID datasets.

摘要
深度学习是一种广泛使用的深度神经网络，用于 semantic segmentation，其成功归功于其平行架构 called atrous spatial pyramid pooling (ASPP)。ASPP使用多个不同的atrous convolutions来提取本地和全局信息。然而，ASPP模块中使用的atrous rate是固定的，这限制了其观察领域的大小。在理论上，atrous rate应该是一个可变的超参数，根据目标任务或数据集来变化观察领域的大小。然而，atrous rate的调整没有任何指导。这个研究提出了实用的指南，以获取最佳的atrous rate。首先，我们引入了 semantic segmentation 的有效覆盖区域，以分析 segmentation 网络的内部行为。我们发现，使用 ASPP 模块后会产生特定的模式在有效覆盖区域中，这被跟踪到了模块的内部机制。根据这些结果，我们 derive 了实用的指南，以控制 atrous rate 的选择，即基于输入图像的大小。与其他值相比，使用最佳的 atrous rate consistently 改善了 segmentation 结果，在多个数据集上，包括 STARE、CHASE_DB1、HRF、Cityscapes 和 iSAID 数据集。

High-definition event frame generation using SoC FPGA devices

paper_url: http://arxiv.org/abs/2307.14177
repo_url: None
paper_authors: Krzysztof Blachut, Tomasz Kryjak
for: 这个论文目的是实现FPGA设备上高分辨率事件数据流（HD -1280 x 720像素）的归一化和投影。
methods: 这个论文使用了FPGA设备来实现高分辨率事件数据流的归一化和投影。
results: 研究结果表明该方法是可行的，但需要考虑一些挑战、限制和让步。选择的数据表示方式的硬件资源与AMD Xilinx等流行平台进行比较。得到的事件帧可以用于典型的视觉算法，如物体分类和检测，使用传统和深度神经网络方法。

Abstract
In this paper we have addressed the implementation of the accumulation and projection of high-resolution event data stream (HD -1280 x 720 pixels) onto the image plane in FPGA devices. The results confirm the feasibility of this approach, but there are a number of challenges, limitations and trade-offs to be considered. The required hardware resources of selected data representations, such as binary frame, event frame, exponentially decaying time surface and event frequency, were compared with those available on several popular platforms from AMD Xilinx. The resulting event frames can be used for typical vision algorithms, such as object classification and detection, using both classical and deep neural network methods.

摘要
在这篇论文中，我们对FPGA设备上高分辨率事件数据流（HD-1280x720像素）的归一化和投影进行了实现。结果表明该方法可行，但需要考虑一些挑战、限制和让步。我们对选择的数据表示方式的硬件资源进行了比较，包括二进制帧、事件帧、加速度度时间表面和事件频率。得到的事件帧可以用于典型的视觉算法，如物体分类和检测，使用 both classical 和深度神经网络方法。

A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

paper_url: http://arxiv.org/abs/2307.14397
repo_url: https://github.com/sutd-visual-computing-group/awesome-generative-modeling-under-data-constraints
paper_authors: Milad Abdollahzadeh, Touba Malekzadeh, Christopher T. H. Teo, Keshigeyan Chandrasegaran, Guimeng Liu, Ngai-Man Cheung
for: 本研究旨在探讨在数据约束下学习生成模型，包括受限数据、几架数据和零架数据等情况。
methods: 本研究提出了两种分类法：一种是基于任务的分类，另一种是基于方法的分类。同时，研究者还分析了不同任务和方法之间的互动。
results: 研究者发现了一些未来研究的潜在方向，包括如何在数据约束下提高生成模型的性能，如何将生成模型应用于健康医疗领域，以及如何将生成模型与其他技术结合使用。

Abstract
In machine learning, generative modeling aims to learn to generate new data statistically similar to the training data distribution. In this paper, we survey learning generative models under limited data, few shots and zero shot, referred to as Generative Modeling under Data Constraint (GM-DC). This is an important topic when data acquisition is challenging, e.g. healthcare applications. We discuss background, challenges, and propose two taxonomies: one on GM-DC tasks and another on GM-DC approaches. Importantly, we study interactions between different GM-DC tasks and approaches. Furthermore, we highlight research gaps, research trends, and potential avenues for future exploration. Project website: https://gmdc-survey.github.io.

摘要
在机器学习中，生成模型目标是学习生成新数据，与训练数据分布相似。在这篇论文中，我们对受限数据的生成模型学习进行报告，包括几个难点和挑战。我们还提出了两种分类：一种是生成模型下数据约束任务（GM-DC）任务，另一种是生成模型下数据约束方法（GM-DC）方法。此外，我们还研究了不同GM-DC任务和方法之间的交互关系。此外，我们还提出了未来探索的研究漏斗和趋势。您可以查看更多信息在我们的项目网站：。

Creative Birds: Self-Supervised Single-View 3D Style Transfer

paper_url: http://arxiv.org/abs/2307.14127
repo_url: https://github.com/wrk226/creative_birds
paper_authors: Renke Wang, Guimin Que, Shuo Chen, Xiang Li, Jun Li, Jian Yang
for: 本研究主要针对鸟类三维重建中的单一视角3D题目，提出了一个新的方法，能够将两个单一视角图像中的形状和文本URE转换到3D mesh上。
methods: 本方法使用了一个新的形式转换生成器（DRGNet）和一个多层感知核（MLP），将源图像和目标图像的特征提取出来，并生成3D mesh的空间坐标。此外，本方法还引入了一个semantic UV文本转换模组，实现了文本类型的Style Transfer，并可以与许多现有的方法相结合。
results: 实验结果显示，本方法在单一视角3D Style Transfer任务上实现了州前的性能，并且可以实现高品质的3D鸟类重建。

Abstract
In this paper, we propose a novel method for single-view 3D style transfer that generates a unique 3D object with both shape and texture transfer. Our focus lies primarily on birds, a popular subject in 3D reconstruction, for which no existing single-view 3D transfer methods have been developed.The method we propose seeks to generate a 3D mesh shape and texture of a bird from two single-view images. To achieve this, we introduce a novel shape transfer generator that comprises a dual residual gated network (DRGNet), and a multi-layer perceptron (MLP). DRGNet extracts the features of source and target images using a shared coordinate gate unit, while the MLP generates spatial coordinates for building a 3D mesh. We also introduce a semantic UV texture transfer module that implements textural style transfer using semantic UV segmentation, which ensures consistency in the semantic meaning of the transferred regions. This module can be widely adapted to many existing approaches. Finally, our method constructs a novel 3D bird using a differentiable renderer. Experimental results on the CUB dataset verify that our method achieves state-of-the-art performance on the single-view 3D style transfer task. Code is available in https://github.com/wrk226/creative_birds.

摘要
在这篇论文中，我们提出了一种新的单视图3D样式传输方法，该方法可以生成一个独特的3D对象，包括形状和Texture传输。我们主要关注鸟类，这是3D重建中非常流行的主题，现有的单视图3D传输方法尚未得到开发。我们的方法可以从两个单视图图像中生成一个鸟类3D网格形状和Texture。为达到这个目标，我们提出了一种新的形状传输生成器，它包括一个双重径脱敏网络（DRGNet）和一个多层权重网络（MLP）。DRGNet使用共享坐标门户单元提取源和目标图像的特征，而MLP生成3D网格的空间坐标。我们还提出了一种semantic UV文本传输模块，它通过semantic UV分割实现文本风格传输，以保证传输的区域具有相同的semantic意义。这个模块可以与许多现有方法结合使用。最后，我们的方法使用一个可导渠 Renderer构建一个新的3D鸟类。实验结果表明，我们的方法在单视图3D样式传输任务中达到了国际级的性能。代码可以在https://github.com/wrk226/creative_birds中找到。

paper_url: http://arxiv.org/abs/2307.14126
repo_url: None
paper_authors: Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, Gustavo Carneiro
For: The paper is written to address the issue of missing modality in multi-modal tasks, and to propose a simpler and more effective method for handling this problem.* Methods: The proposed method, called Shared-Specific Feature Modelling (ShaSpec), uses auxiliary tasks based on distribution alignment and domain classification, as well as a residual feature fusion procedure, to learn shared and specific features from all available input modalities.* Results: The paper reports that ShaSpec outperforms competing methods by a large margin on both medical image segmentation and computer vision classification tasks, with improvements of more than 3% on BraTS2018 for enhancing tumour, 5% for tumour core, and 3% for whole tumour.Here is the same information in Simplified Chinese:* For: 这篇论文是为了解决多Modal任务中缺失Modal的问题而写的。* Methods: 提议的方法是Shared-Specific Feature Modelling（ShaSpec），它使用分布对齐和领域分类的auxiliary任务，以及剩余特征混合过程，来学习所有输入Modalities中的共享和特定特征。* Results: 论文报告说，ShaSpec比竞争方法更高效，在医学像分割和计算机视觉分类任务上都有显著提高（BraTS2018上的提升肿瘤3%，核心肿瘤5%，整个肿瘤3%）。

Abstract
The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour.

摘要
《缺失Modalitate问题是多模态模型解决的核心问题，但它不是易于解决的。目前的方法在评估时或在训练时都只处理缺失的模态，或者通过训练不同的模型来处理特定的缺失模态情况。此外，这些模型是为特定任务设计的，因此例如分类模型不易于适应分割任务，而且反之亦然。本文提出了共享特定特征模型（ShaSpec）方法，它比竞争方法更简单而效果更好。ShaSpec在训练和评估过程中利用所有可用的输入模态，通过学习共享特征和特定特征来更好地表示输入数据。这是通过auxiliary任务基于分布对齐和领域分类，以及剩余特征 fusión过程来实现的。此外，ShaSpec的设计简单，可以方便地适应多个任务，如分类和分割。实验结果表明，ShaSpec在BraTS2018上比前一个SOTA提高了超过3%的恢复肿瘤、5%的肿瘤核心和3%的整体肿瘤。》

Memory-Efficient Graph Convolutional Networks for Object Classification and Detection with Event Cameras

paper_url: http://arxiv.org/abs/2307.14124
repo_url: None
paper_authors: Kamil Jeziorek, Andrea Pinna, Tomasz Kryjak
for: 本研究旨在提高事件摄像头数据处理的效率和准确率，并且考虑了数据存储和计算成本。
methods: 本研究使用图 convolutional neural networks (GCNs) 进行事件数据分析，并对不同的图 convolution 操作进行比较分析，以选择最佳的操作。
results: 研究结果显示，使用提出的方法可以实现52.3%的分类精度，同时减少了特征提取模块中的参数数量450倍，并将数据表示形式的大小减少4.5倍。此外，对 N-Caltech101 数据集进行 object detection 预测，实现了53.7%的 mAP@0.5 精度和82个图像每秒的执行速率。

Abstract
Recent advances in event camera research emphasize processing data in its original sparse form, which allows the use of its unique features such as high temporal resolution, high dynamic range, low latency, and resistance to image blur. One promising approach for analyzing event data is through graph convolutional networks (GCNs). However, current research in this domain primarily focuses on optimizing computational costs, neglecting the associated memory costs. In this paper, we consider both factors together in order to achieve satisfying results and relatively low model complexity. For this purpose, we performed a comparative analysis of different graph convolution operations, considering factors such as execution time, the number of trainable model parameters, data format requirements, and training outcomes. Our results show a 450-fold reduction in the number of parameters for the feature extraction module and a 4.5-fold reduction in the size of the data representation while maintaining a classification accuracy of 52.3%, which is 6.3% higher compared to the operation used in state-of-the-art approaches. To further evaluate performance, we implemented the object detection architecture and evaluated its performance on the N-Caltech101 dataset. The results showed an accuracy of 53.7 % mAP@0.5 and reached an execution rate of 82 graphs per second.

摘要
最近的事件摄像头研究发展强调处理原始稀疏数据，这使得可以利用高时间分辨率、高动态范围、低延迟和图像模糊鲁棒性的独特特点。一种有前途的方法是使用图像会议网络（GCN）来分析事件数据。然而，当前研究主要关注计算成本优化，忽略了相关的内存成本。在这篇论文中，我们同时考虑这两个因素，以达到满意的结果和相对较低的模型复杂度。为此，我们进行了不同图像会议操作的比较分析，考虑因素包括执行时间、可训练模型参数数量、数据格式要求和训练结果。我们的结果显示了特征提取模块中参数数量的450倍减少和数据表示形式的4.5倍减小，同时保持52.3%的分类精度，与现有方法相比增加6.3%。为了进一步评估性能，我们实现了对象检测架构并在N-Caltech101数据集上评估其性能。结果显示了53.7%的mAP@0.5精度和82个图像每秒执行速度。

Periocular biometrics: databases, algorithms and directions

paper_url: http://arxiv.org/abs/2307.14111
repo_url: None
paper_authors: Fernando Alonso-Fernandez, Josef Bigun
for: 本文是一篇对 périocular 生物认证研究进行的综述，提供了现有Literature的概述和主要问题的探讨，以及未来研究趋势的简要介绍。
methods: 本文使用的方法包括 periocular 特征提取和gender和ethnicity分类等，以及对不同 gender 和ethnicity的认证性能的研究。
results: 本文 Summarizes the state of the art in periocular biometric research, including the most relevant issues and a thorough coverage of the existing literature.

Abstract
Periocular biometrics has been established as an independent modality due to concerns on the performance of iris or face systems in uncontrolled conditions. Periocular refers to the facial region in the eye vicinity, including eyelids, lashes and eyebrows. It is available over a wide range of acquisition distances, representing a trade-off between the whole face (which can be occluded at close distances) and the iris texture (which do not have enough resolution at long distances). Since the periocular region appears in face or iris images, it can be used also in conjunction with these modalities. Features extracted from the periocular region have been also used successfully for gender classification and ethnicity classification, and to study the impact of gender transformation or plastic surgery in the recognition performance. This paper presents a review of the state of the art in periocular biometric research, providing an insight of the most relevant issues and giving a thorough coverage of the existing literature. Future research trends are also briefly discussed.

摘要
périocular 生物ometrics 已经被确立为一种独立的modalità，因为关注肉眼或面系统在无控制的环境下表现不佳。periocular 指的是眼睛附近的脸部区域，包括眼皮、毛发和眉毛。它可以在各种距离范围内获得，表示一种质量和距离之间的交易，而整个脸部（可能会被 occluded 在近距离）和眼球 тексту（没有 enough resolution 在远距离）。由于 periocular 区域出现在脸部或眼球图像中，因此也可以与这些modalities 结合使用。从 periocular 区域提取的特征已经成功地用于性别类型和种族类型的分类，以及研究 gender transformation 或整形手术对认知性能的影响。这篇文章介绍了 periocular 生物metric 研究的现状，提供了有关最重要的问题和现有文献的全面概述，以及未来研究趋势的简要讨论。

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

paper_url: http://arxiv.org/abs/2307.14073
repo_url: https://github.com/ZhihaoHu/VideoControlNet
paper_authors: Zhihao Hu, Dong Xu
for: 本研究 propose a new motion-guided video-to-video translation framework called VideoControlNet, which can generate various videos based on given prompts and condition from the input video.
methods: 我们使用 diffusion model with ControlNet 和 motion-guided P-frame generation (MgPG) method, 以及 motion-guided B-frame interpolation (MgBI) module to generate videos.
results: 我们的 эксперименты表明，VideoControlNet 继承了预训练的大型扩散模型的生成能力，并将图像扩散模型扩展到视频扩散模型。更多结果请参考我们项目页面。

Abstract
Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.

摘要
近些年，扩散模型如StableDiffusion在图像生成方面取得了卓越的成绩。然而，扩散模型的生成过程不可控，这使得生成视频时难以保持内容连续和一致。在这项工作中，我们通过将扩散模型与ControlNet结合使用，提出了一种基于提示和输入视频的动作指导的视频到视频翻译框架——VideoControlNet。我们受到视频编码器使用运动信息减少 temporal 重复的启发，我们的框架使用运动信息来避免重新生成冗余区域以保持内容一致。具体来说，我们使用扩散模型与ControlNet生成首帧（i.e., I-frame），然后使用我们新提出的运动指导 P-frame 生成方法（MgPG）生成后续的其他关键帧（i.e., P-frame），并使用扩散模型填充 occlusion 区域。最后，我们使用我们的运动指导 B-frame interpolate 模块（MgBI）生成剩余帧（i.e., B-frame）。我们的实验表明，我们提posed VideoControlNet 继承了预训练的大扩散模型的生成能力，并将图像扩散模型扩展到视频扩散模型，并且使用运动信息。更多结果请参考我们项目页面。

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

paper_url: http://arxiv.org/abs/2307.14070
repo_url: None
paper_authors: Wenjie Xuan, Shanshan Zhao, Yu Yao, Juhua Liu, Tongliang Liu, Yixin Chen, Bo Du, Dacheng Tao
for: 这篇论文是为了解决标签损害问题，尤其是在大规模训练数据中，以提高边测出的性能。
methods: 这篇论文提出了一种Pixel-level NoiseTransitions（PNT）模型，通过估计标签损害的过程来解决标签损害问题。PNT模型包括Pixel-wise Shift Learning（PSL）模块，可以估计标签损害的转换场。
results: 实验结果显示，PNT模型能够有效地减轻标签损害的影响，并且可以保持边测出的高性能。

Abstract
Relying on large-scale training data with pixel-level labels, previous edge detection methods have achieved high performance. However, it is hard to manually label edges accurately, especially for large datasets, and thus the datasets inevitably contain noisy labels. This label-noise issue has been studied extensively for classification, while still remaining under-explored for edge detection. To address the label-noise issue for edge detection, this paper proposes to learn Pixel-level NoiseTransitions to model the label-corruption process. To achieve it, we develop a novel Pixel-wise Shift Learning (PSL) module to estimate the transition from clean to noisy labels as a displacement field. Exploiting the estimated noise transitions, our model, named PNT-Edge, is able to fit the prediction to clean labels. In addition, a local edge density regularization term is devised to exploit local structure information for better transition learning. This term encourages learning large shifts for the edges with complex local structures. Experiments on SBD and Cityscapes demonstrate the effectiveness of our method in relieving the impact of label noise. Codes will be available at github.

摘要
以前的边检测方法通过大规模的训练数据和像素级标注来实现高性能。然而，手动标注边界尚很困难，特别是 для大量数据集，因此标注中存在噪声。这个噪声问题在分类领域已经得到了广泛的研究，而在边检测领域仍然尚未得到充分的研究。为了解决边检测中的噪声问题，本文提出了学习像素级噪声转移（Pixel-level NoiseTransitions，PNT）来模型标签损害过程。为此，我们开发了一种名为像素级Shift学习（Pixel-wise Shift Learning，PSL）模块，以便估计从清晰标签到噪声标签的转移为一个拟合场景。通过利用估计的噪声转移，我们的模型可以适应清晰标签。此外，我们还提出了一种基于地方检测结构信息的本地检测密度规则，以便更好地学习转移。这个规则鼓励学习大尺度的转移，以便处理复杂的地方结构。实验表明，我们的方法可以减轻标签噪声的影响。代码将提供在GitHub上。

Pre-Training with Diffusion models for Dental Radiography segmentation

paper_url: http://arxiv.org/abs/2307.14066
repo_url: None
paper_authors: Jérémy Rousseau, Christian Alaka, Emma Covili, Hippolyte Mayard, Laura Misrachi, Willy Au
for: 针对医疗放射学像 segmentation task, specifically dental radiography, which is limited by the high cost of labeling.
methods: 提出了一种简单的预训练方法，使用 Denoising Diffusion Probabilistic Models (DDPM) 进行 semantic segmentation.
results: 实验结果表明，该方法可以 achieve remarkable performance in terms of label efficiency, without requiring architectural modifications between pre-training and downstream tasks.

Abstract
Medical radiography segmentation, and specifically dental radiography, is highly limited by the cost of labeling which requires specific expertise and labor-intensive annotations. In this work, we propose a straightforward pre-training method for semantic segmentation leveraging Denoising Diffusion Probabilistic Models (DDPM), which have shown impressive results for generative modeling. Our straightforward approach achieves remarkable performance in terms of label efficiency and does not require architectural modifications between pre-training and downstream tasks. We propose to first pre-train a Unet by exploiting the DDPM training objective, and then fine-tune the resulting model on a segmentation task. Our experimental results on the segmentation of dental radiographs demonstrate that the proposed method is competitive with state-of-the-art pre-training methods.

摘要
医疗放射segmentation，特别是牙科放射segmentation，受到标注成本的限制，需要专业知识和劳动密集的标注。在这个工作中，我们提议一种简单的预训练方法 для语义分割，利用Diffusion Probabilistic Models（DDPM），这种模型已经在生成模型中表现出色。我们的简单方法可以达到remarkable的标签效率，不需要预训练和下游任务之间的建筑修改。我们首先预训练了Unet使用DDPM训练目标，然后细化该模型以进行分割任务。我们的实验结果表明，我们提议的方法可以与现有的预训练方法竞争。

ECO: Ensembling Context Optimization for Vision-Language Models

paper_url: http://arxiv.org/abs/2307.14063
repo_url: None
paper_authors: Lorenzo Agnolucci, Alberto Baldrati, Francesco Todino, Federico Becattini, Marco Bertini, Alberto Del Bimbo
for: 这篇论文主要是为了研究如何使用文本提示来进行图像分类，并使用 CLIP 模型实现零 shot 转换。
methods: 该论文使用了一种 ensemble 方法，通过学习多个文本提示来提高图像分类的性能。
results: 研究发现，使用多个文本提示可以提高图像分类的性能，而且不需要在执行时添加额外成本。并且在 11 个 benchmark 上进行了证明。

Abstract
Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

摘要
Image recognition 近期经历了一种新的思维方式，使用视力语言模型来实现几个步骤分类基于文本提示。其中，CLIP 模型表现出了很好的零shot 转移能力，可以通过匹配图像和自定义文本提示在其 latent space 进行匹配。这有助于许多工作，旨在改进或学习图像提示的文本上。在这篇论文中，我们遵循这种趋势，学习一个图像分类的 ensemble 提示。我们发现，学习多样的和可能更短的文本上下文可以大幅提高结果，而不需要在执行时添加额外成本。我们在 11 个标准测试集上进行了证明。

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

paper_url: http://arxiv.org/abs/2307.14061
repo_url: https://github.com/Zoky-2020/Set-level_Guidance_Attack
paper_authors: Dong Lu, Zhiqiang Wang, Teng Wang, Weili Guan, Hongchang Gao, Feng Zheng
for: 本研究是 investigate the adversarial transferability of recent VLP models.
methods: 我们提出了一种高度可转移的 Set-level Guidance Attack (SGA)，它充分利用了modal interactions和cross-modal guidance，并包括了alignment-preserving augmentation.
results: SGA可以生成高度可转移的 adversarial examples，可以强制 transferred across different VLP models on multiple downstream vision-language tasks. 例如，在图像文本检索 task 上，SGA可以significantly enhance the attack success rate for transfer attacks from ALBEF to TCL，比对 estado-of-the-art 高得多 (at least 9.78% and up to 30.21%).

Abstract
Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately transferred to attack other black-box models. However, existing work has mainly focused on investigating white-box attacks. In this paper, we present the first study to investigate the adversarial transferability of recent VLP models. We observe that existing methods exhibit much lower transferability, compared to the strong attack performance in white-box settings. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. Particularly, unlike unimodal learning, VLP models rely heavily on cross-modal interactions and the multimodal alignments are many-to-many, e.g., an image can be described in various natural languages. To this end, we propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Experimental results demonstrate that SGA could generate adversarial examples that can strongly transfer across different VLP models on multiple downstream vision-language tasks. On image-text retrieval, SGA significantly enhances the attack success rate for transfer attacks from ALBEF to TCL by a large margin (at least 9.78% and up to 30.21%), compared to the state-of-the-art.

摘要
“视觉语言预训（VLP）模型在多modal任务中显示易受到恶意攻击。此外，恶意攻击者可以故意传播到攻击其他黑盒模型。然而，现有的工作主要集中在白盒攻击。在这篇论文中，我们提出了第一个研究视觉语言预训模型的攻击传播性的研究。我们发现，现有的方法在白盒 Setting下的攻击性表现强，而在黑盒 Setting下的传播性较弱。这种传播性减退部分由于视觉语言模型在模式之间的交互不充分利用。特别是，不同于单模型学习，VLP模型依赖于模式之间的交互，并且多模式对应关系是多对多的，例如一个图像可以被多种自然语言描述。为此，我们提出了高度传播的集成指导攻击（SGA），它仔细利用了模式交互和具有对应增强的扩展。实验结果表明，SGA可以在多个下游视觉语言任务上生成强制转移的攻击例子，并且在图像文本检索任务上明显提高了传播攻击的成功率（至少9.78%和30.21%），相比之下前 estado-of-the-art。”

Towards Establishing Systematic Classification Requirements for Automated Driving

paper_url: http://arxiv.org/abs/2307.14058
repo_url: None
paper_authors: Ken T. Mori, Trent Brown, Steven Peters
for: 这篇论文是为了定义一种适用于自动驾驶领域的一致的分类要求的方法。
methods: 该方法首先从行为需求角度标识了法律类别，然后考虑了对象和感知类别的两个方面，从而获得了一个分类层次结构。
results: 应用该方法于一个示例法律文本后，与标准数据集类别相比，两者之间存在有限的一致，这表明需要显式考虑法律需求关于感知。

Abstract
Despite the presence of the classification task in many different benchmark datasets for perception in the automotive domain, few efforts have been undertaken to define consistent classification requirements. This work addresses the topic by proposing a structured method to generate a classification structure. First, legal categories are identified based on behavioral requirements for the vehicle. This structure is further substantiated by considering the two aspects of collision safety for objects as well as perceptual categories. A classification hierarchy is obtained by applying the method to an exemplary legal text. A comparison of the results with benchmark dataset categories shows limited agreement. This indicates the necessity for explicit consideration of legal requirements regarding perception.

摘要
即使在自动驾驶领域的识别任务中存在多种benchmark数据集，但很少有努力来定义一致的分类要求。这项工作强调这一点，并提出了一种结构化的方法来生成分类结构。首先，我们根据车辆的行为要求来确定法律类别。然后，我们根据对象避免碰撞安全和感知类别来进一步补充这种结构。通过应用这种方法，我们得到了一个分类层次结构。对比 benchmark数据集类别，我们发现了有限的一致。这表明了法律要求的明确考虑是必要的。

Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

paper_url: http://arxiv.org/abs/2307.14052
repo_url: https://github.com/pjlallen/udun
paper_authors: Jialun Pei, Zhangjun Zhou, Yueming Jin, He Tang, Pheng-Ann Heng
for: 高精度二分图像分割（DIS）目标是从自然场景中找到类型不同的前景对象。
methods: 我们提出了一种新的 Unit-Divide-Unite 网络（UDUN），它通过重新安排和分割补做特征来同时提高轮廓和结构的识别效果。
results: UDUN 在六个评估指标中所有取得了比领先者更高的成绩，并在 1024*1024 输入下实现了实时推理，并且可以在 65.3 fps 上进行推理。

Abstract
High-accuracy Dichotomous Image Segmentation (DIS) aims to pinpoint category-agnostic foreground objects from natural scenes. The main challenge for DIS involves identifying the highly accurate dominant area while rendering detailed object structure. However, directly using a general encoder-decoder architecture may result in an oversupply of high-level features and neglect the shallow spatial information necessary for partitioning meticulous structures. To fill this gap, we introduce a novel Unite-Divide-Unite Network (UDUN} that restructures and bipartitely arranges complementary features to simultaneously boost the effectiveness of trunk and structure identification. The proposed UDUN proceeds from several strengths. First, a dual-size input feeds into the shared backbone to produce more holistic and detailed features while keeping the model lightweight. Second, a simple Divide-and-Conquer Module (DCM) is proposed to decouple multiscale low- and high-level features into our structure decoder and trunk decoder to obtain structure and trunk information respectively. Moreover, we design a Trunk-Structure Aggregation module (TSA) in our union decoder that performs cascade integration for uniform high-accuracy segmentation. As a result, UDUN performs favorably against state-of-the-art competitors in all six evaluation metrics on overall DIS-TE, i.e., achieving 0.772 weighted F-measure and 977 HCE. Using 1024*1024 input, our model enables real-time inference at 65.3 fps with ResNet-18.

摘要
高精度二分图像分割（DIS）目标是从自然场景中找到不受类别限制的前景对象。主要挑战在DIS中是准确地确定高级特征区域，而不是仅仅是提供高级特征。直接使用通用的编码器-解码器架构可能会导致过度产生高级特征，而忽略细致的空间信息，这会导致精细结构的识别受到威胁。为了填补这一漏洞，我们提出了一种新的团结分解网络（UDUN）。UDUN通过重新排序和分割相 complementary 的特征，以同时提高核心区域和细致结构的识别效果。UDUN的主要优势包括：一、使用双Size输入，通过共享背bone生成更加整体和细致的特征，同时保持模型轻量级。二、提出了简单的分割和聚合模块（DCM），将多尺度低级和高级特征分割到我们的结构解码器和核心解码器中，以获得结构信息和核心信息。此外，我们还设计了团结聚合模块（TSA），通过顺序集成来实现高精度分割。因此，UDUN在所有六个评价指标中表现出色，在DIS-TE上 achieved 0.772 weighted F-measure和977 HCE。使用1024*1024输入，我们的模型在65.3 fps上实现了实时推理，并且使用ResNet-18。

3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability

paper_url: http://arxiv.org/abs/2307.14051
repo_url: https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser
paper_authors: Ruowei Wang, Yu Liu, Pei Su, Jianwei Zhang, Qijun Zhao
for: 本研究旨在提供一种基于semantic attribute的3D形状生成模型，以便在3D内容创建中保持形状结构的semantic consistency和提供形状结构的semantic特性编辑功能。
methods: 该模型使用implicit函数来表示3D形状，并结合一种novel的latent-space GAN和一个线性子空间模型，以探索3D形状的本地latent空间中的semantic维度。每个维度对应一个特定的semantic特性，可以通过 traverse这些维度的系数来编辑生成的形状的semantic特性。
results: 实验结果表明，该方法可以生成具有复杂结构的plausible形状，并提供形状结构的semantic特性编辑功能。代码和训练模型可以在https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser上获取。

Abstract
Shape generation is the practice of producing 3D shapes as various representations for 3D content creation. Previous studies on 3D shape generation have focused on shape quality and structure, without or less considering the importance of semantic information. Consequently, such generative models often fail to preserve the semantic consistency of shape structure or enable manipulation of the semantic attributes of shapes during generation. In this paper, we proposed a novel semantic generative model named 3D Semantic Subspace Traverser that utilizes semantic attributes for category-specific 3D shape generation and editing. Our method utilizes implicit functions as the 3D shape representation and combines a novel latent-space GAN with a linear subspace model to discover semantic dimensions in the local latent space of 3D shapes. Each dimension of the subspace corresponds to a particular semantic attribute, and we can edit the attributes of generated shapes by traversing the coefficients of those dimensions. Experimental results demonstrate that our method can produce plausible shapes with complex structures and enable the editing of semantic attributes. The code and trained models are available at https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser

摘要
三维形状生成是三维内容创建中的一种实践，旨在生成具有不同表示形式的三维形状。在先前的研究中，大多数研究者对三维形状生成的焦点是形状质量和结构，而忽略或少考虑semantic信息的重要性。这导致生成的模型往往无法保持形状结构的semantic一致性，也无法在生成过程中修改形状的semantic特征。在本文中，我们提出了一种新的semantic生成模型，名为3Dsemantic Subspace Traverser（3DSS）。该模型利用形状的semantic特征来为不同类别的三维形状进行生成和编辑。我们使用隐函数作为三维形状的表示方式，并结合了一种novel的latent-space GAN和一个线性子空间模型来发现三维形状的semantic维度。每个维度对应一个特定的semantic特征，可以在生成过程中编辑形状的semantic特征。实验结果表明，我们的方法可以生成具有复杂结构的plausible形状，并允许在生成过程中修改形状的semantic特征。代码和训练模型可以在https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser上下载。

Controllable Guide-Space for Generalizable Face Forgery Detection

paper_url: http://arxiv.org/abs/2307.14039
repo_url: None
paper_authors: Ying Guo, Cheng Zhen, Pengfei Yan
for: 提高面伪检测的普遍化能力
methods: 提出控制可能空间（GS）方法，以增强伪造领域特征的分类，并使真伪造领域之间的距离变得更加明显。另外，使用解联模组削弱伪造 irrelevant 跟踪的影响，并根据邻近领域特征的聚类度进行决策面的调整。
results: 经过广泛的实验显示，我们的方法可以在多个内部和交叉领域的设定下实现顶尖的普遍化性。

Abstract
Recent studies on face forgery detection have shown satisfactory performance for methods involved in training datasets, but are not ideal enough for unknown domains. This motivates many works to improve the generalization, but forgery-irrelevant information, such as image background and identity, still exists in different domain features and causes unexpected clustering, limiting the generalization. In this paper, we propose a controllable guide-space (GS) method to enhance the discrimination of different forgery domains, so as to increase the forgery relevance of features and thereby improve the generalization. The well-designed guide-space can simultaneously achieve both the proper separation of forgery domains and the large distance between real-forgery domains in an explicit and controllable manner. Moreover, for better discrimination, we use a decoupling module to weaken the interference of forgery-irrelevant correlations between domains. Furthermore, we make adjustments to the decision boundary manifold according to the clustering degree of the same domain features within the neighborhood. Extensive experiments in multiple in-domain and cross-domain settings confirm that our method can achieve state-of-the-art generalization.

摘要
最近的面孔伪造检测研究表现良好在培训集上，但在未知领域中表现不够 Ideal。这种情况 Motivates 许多研究者增强泛化性，但伪造 irrelevant information，如图像背景和身份，仍然存在不同领域特征中，导致意外的凝集，限制了泛化。在这篇论文中，我们提出一种可控制的引导空间（GS）方法，以提高伪造领域特征的分化程度，从而提高伪造相关性。well-designed引导空间可同时实现各伪造领域的正确分离和真伪造领域之间的大距离。此外，为了提高分化度，我们使用一个解除相关性模块，以减少不同领域特征之间的干扰。此外，我们根据邻域中同一个领域特征的凝集度进行决策边缘 manifold 的调整。广泛的实验表明，我们的方法可以 achieve state-of-the-art 泛化性。

Human-centric Scene Understanding for 3D Large-scale Scenarios

paper_url: http://arxiv.org/abs/2307.14392
repo_url: https://github.com/4dvlab/hucenlife
paper_authors: Yiteng Xu, Peishan Cong, Yichen Yao, Runnan Chen, Yuenan Hou, Xinge Zhu, Xuming He, Jingyi Yu, Yuexin Ma
for: 本研究旨在提供一个大规模多Modal的人Centric场景理解 dataset，以便提高3D感知技术的性能。
methods: 本研究使用了多种方法，包括 LiDAR 技术和多Modal 捕获。
results: 本研究实现了state-of-the-art性能在人Centric场景理解任务中，并提供了多种 benchmark для相关研究。Here’s a breakdown of each point:
for: The paper is aimed at providing a large-scale multi-modal dataset for human-centric scene understanding, in order to improve the performance of 3D perception technologies.
methods: The paper uses various methods, including LiDAR technology and multi-modal capturing.
results: The paper achieves state-of-the-art performance in human-centric scene understanding tasks, and provides multiple benchmarks for related research.

Abstract
Human-centric scene understanding is significant for real-world applications, but it is extremely challenging due to the existence of diverse human poses and actions, complex human-environment interactions, severe occlusions in crowds, etc. In this paper, we present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife, which is collected in diverse daily-life scenarios with rich and fine-grained annotations. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc., and we also provide benchmarks for these tasks to facilitate related research. In addition, we design novel modules for LiDAR-based segmentation and action recognition, which are more applicable for large-scale human-centric scenarios and achieve state-of-the-art performance.

摘要
人Centric场景理解对实际应用有着重要意义，但是它受到人姿态和行为多样化、人环境交互复杂、群体干扰等因素的影响很大。在这篇论文中，我们提供了一个大规模多模态场景理解数据集，名为HuCenLife，该数据集在日常生活场景中收集了丰富细化的注释。我们的HuCenLife可以帮助多种3D感知任务，如分割、检测、动作识别等，并为这些任务提供了参考。此外，我们还设计了基于LiDAR的分割和动作识别模块，这些模块更适合大规模人Centric场景，并实现了当前最佳性能。

Consensus-Adaptive RANSAC

paper_url: http://arxiv.org/abs/2307.14030
repo_url: https://github.com/cavalli1234/ca-ransac
paper_authors: Luca Cavalli, Daniel Barath, Marc Pollefeys, Viktor Larsson
for: 提高 robust estimation 精度，使 RANSAC 能够更好地适应不同 dataset 和任务。
methods: 基于 attention 层和一步 transformer，使 RANSAC 可以更好地探索参数空间，并且可以适应不同的 residuals 情况。
results: 对比 state-of-the-art 估计器，提出的方法具有较高的精度和更好的一致性，并且增加了只有小致用时间 overhead。

Abstract
RANSAC and its variants are widely used for robust estimation, however, they commonly follow a greedy approach to finding the highest scoring model while ignoring other model hypotheses. In contrast, Iteratively Reweighted Least Squares (IRLS) techniques gradually approach the model by iteratively updating the weight of each correspondence based on the residuals from previous iterations. Inspired by these methods, we propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer. This rich state then guides the minimal sampling between iterations as well as the model refinement. We evaluate the proposed approach on essential and fundamental matrix estimation on a number of indoor and outdoor datasets. It outperforms state-of-the-art estimators by a significant margin adding only a small runtime overhead. Moreover, we demonstrate good generalization properties of our trained model, indicating its effectiveness across different datasets and tasks. The proposed attention mechanism and one-step transformer provide an adaptive behavior that enhances the performance of RANSAC, making it a more effective tool for robust estimation. Code is available at https://github.com/cavalli1234/CA-RANSAC.

摘要
RANSAC和其 variants 广泛用于robust estimation，然而它们通常采用一种滥货的方法来找到最高分数模型，而忽略其他模型假设。相反，Iteratively Reweighted Least Squares (IRLS) 技术逐渐接近模型，通过在前一轮的 residuals 基础上更新每个匹配的权重。受到这些方法的启发，我们提出了一个新的 RANSAC 框架，通过一个新的注意层来探索参数空间。这个注意层在一批点到模型的 residuals 上运行，并将每个点的估计状态更新，以考虑前一轮的consensus。这个 ricstate 然后导引最小抽样和模型精度的改进。我们对几个indoor和outdoor数据集进行评估，并证明了我们的方法超过了当前的状态艺术家，并且具有良好的泛化性。这些注意层和一步transformer 提供了一种适应性，使RANSAC成为更有效的robust estimation工具。代码可以在中找到。

Topologically-Regularized Multiple Instance Learning for Red Blood Cell Disease Classification

paper_url: http://arxiv.org/abs/2307.14025
repo_url: None
paper_authors: Salome Kazeminia, Ario Sadafi, Asya Makhro, Anna Bogdanova, Carsten Marr, Bastian Rieck
for: 该研究用于自动识别罕见血液疾病的单细胞图像。
methods: 该研究使用了一种基于 topology 的方法，从单细胞图像中提取多尺度 topological 特征，以规范模型，保持数据的特有 topological 特性。
results: 实验表明，使用 topological 规范可以提高自动识别罕见血液疾病的性能，相比传统的多例学习方法，该方法可以提高性能超过 3%。这是首个使用 topological 性质来规范 MIL 过程的方法。

Abstract
Diagnosing rare anemia disorders using microscopic images is challenging for skilled specialists and machine-learning methods alike. Due to thousands of disease-relevant cells in a single blood sample, this constitutes a complex multiple-instance learning (MIL) problem. While the spatial neighborhood of red blood cells is not meaningful per se, the topology, i.e., the geometry of blood samples as a whole, contains informative features to remedy typical MIL issues, such as vanishing gradients and overfitting when training on limited data. We thus develop a topology-based approach that extracts multi-scale topological features from bags of single red blood cell images. The topological features are used to regularize the model, enforcing the preservation of characteristic topological properties of the data. Applied to a dataset of 71 patients suffering from rare anemia disorders with 521 microscopic images of red blood cells, our experiments show that topological regularization is an effective method that leads to more than 3% performance improvements for the automated classification of rare anemia disorders based on single-cell images. This is the first approach that uses topological properties for regularizing the MIL process.

摘要
诊断罕见血红细胞疾病使用微型图像是复杂的，专家和机器学习方法都面临挑战。由于每个血液样本中有数千个疾病相关的细胞，这构成了复杂的多例学习（MIL）问题。血红细胞的空间邻居不是直接意义的，但血液样本的整体几何结构含有有用的特征，以解决典型的MIL问题，如衰减梯度和预测过拟合。我们因此开发了基于topology的方法，EXTRACTING多尺度的topological特征 FROM single red blood cell images。这些特征用于规范模型，使模型保留特征数据的特有topological属性。应用于71名患有罕见血红细胞疾病的患者，521个微型血红细胞图像的实验显示， topological regularization 是一种有效的方法，可以提高基于单细胞图像的罕见血红细胞自动分类的性能，高于3%。这是首次使用topological属性来规范MIL过程的方法。

Retinotopy Inspired Brain Encoding Model and the All-for-One Training Recipe

paper_url: http://arxiv.org/abs/2307.14021
repo_url: None
paper_authors: Huzheng Yang, Jianbo Shi, James Gee
for: 预测脑细胞响应图像刺激，实现脑信号捕捉技术的复制。
methods: 引入多种多样性优势，包括个体脑功能多样性、个体差异和成像模式差异，并通过分解大型模型问题而解决困难。
results: 采用多种多样性，具有3D脑图像映射的学习，并在五个公共数据集上预训练一个全面的脑编码模型，以及证明该模型可以作为视觉后处理模型的替换。进一步应用脑解码。

Abstract
Brain encoding models aim to predict brain voxel-wise responses to stimuli images, replicating brain signals captured by neuroimaging techniques. There is a large volume of publicly available data, but training a comprehensive brain encoding model is challenging. The main difficulties stem from a) diversity within individual brain, with functional heterogeneous brain regions; b) diversity of brains from different subjects, due to genetic and developmental differences; c) diversity of imaging modalities and processing pipelines. We use this diversity to our advantage by introducing the All-for-One training recipe, which divides the challenging one-big-model problem into multiple small models, with the small models aggregating the knowledge while preserving the distinction between the different functional regions. Agnostic of the training recipe, we use biological knowledge of the brain, specifically retinotopy, to introduce inductive bias to learn a 3D brain-to-image mapping that ensures a) each neuron knows which image regions and semantic levels to gather information, and b) no neurons are left behind in the model. We pre-trained a brain encoding model using over one million data points from five public datasets spanning three imaging modalities. To the best of our knowledge, this is the most comprehensive brain encoding model to the date. We demonstrate the effectiveness of the pre-trained model as a drop-in replacement for commonly used vision backbone models. Furthermore, we demonstrate the application of the model to brain decoding. Code and the model checkpoint will be made available.

摘要
brain编码模型目的是预测脑细胞层次响应外部刺激图像，模拟脑信号 captured by 神经成像技术。有大量公共数据可用，但 trains comprehensive brain编码模型是挑战。主要困难来自于：a) 个体脑中功能多样性，各个功能区域的多样性;b) 不同个体的脑发育差异，基因和发育差异;c) 成像模式和处理流水线的多样性。我们利用这些多样性，提出了 All-for-One 训练方法，将复杂的一个大模型问题分解成多个小模型，小模型互相协同，汇集知识，同时保持不同功能区域的分别。无论训练方法，我们利用脑科学中的知识，具体是视网膜，引入假设导向，学习 3D 脑-图像映射，以确保：a) 每个神经细胞知道哪些图像区域和semantic层次收集信息;b) 无神经细胞被模型忽略。我们预训练了脑编码模型，使用公共数据集总计一百万多个数据点，来自五个公共数据集，覆盖三种成像模式。到目前为止，这是最全面的脑编码模型。我们证明了预训练模型可以作为常用视觉底层模型的替换。此外，我们还应用了模型到脑解oding。代码和模型检查点将被公布。

RPG-Palm: Realistic Pseudo-data Generation for Palmprint Recognition

paper_url: http://arxiv.org/abs/2307.14016
repo_url: None
paper_authors: Lei Shen, Jianlong Jin, Ruixin Zhang, Huaen Li, Kai Zhao, Yingyi Zhang, Jingyun Zhang, Shouhong Ding, Yang Zhao, Wei Jia
for: 提高palmprint认证模型性能，addressing the lack of large-scale public palmprint datasets.
methods: 提出了一种新的 Pseudo-Palmprint Generation (RPG) 模型，使用 conditional modulation generator 和 identity-aware loss 来提高内类多样性和人脸独特性。
results: 实验结果表明，使用 synthetic pretraining 可以显著提高palmprint认证模型的性能，例如，在 $1:1$ 和 $1:3$ Open-set 协议下，我们的模型比 state-of-the-art B'ezierPalm 提高了 более чем $5%$ 和 $14%$。而且，只使用 $10%$ 的实际训练数据，我们的方法仍然可以超越 ArcFace 使用 $100%$ 实际训练数据。

Abstract
Palmprint recently shows great potential in recognition applications as it is a privacy-friendly and stable biometric. However, the lack of large-scale public palmprint datasets limits further research and development of palmprint recognition. In this paper, we propose a novel realistic pseudo-palmprint generation (RPG) model to synthesize palmprints with massive identities. We first introduce a conditional modulation generator to improve the intra-class diversity. Then an identity-aware loss is proposed to ensure identity consistency against unpaired training. We further improve the B\'ezier palm creases generation strategy to guarantee identity independence. Extensive experimental results demonstrate that synthetic pretraining significantly boosts the recognition model performance. For example, our model improves the state-of-the-art B\'ezierPalm by more than $5\%$ and $14\%$ in terms of TAR@FAR=1e-6 under the $1:1$ and $1:3$ Open-set protocol. When accessing only $10\%$ of the real training data, our method still outperforms ArcFace with $100\%$ real training data, indicating that we are closer to real-data-free palmprint recognition.

摘要
最近，手印识别技术已经显示出了很大的潜力，因为它是一种隐私友好的和稳定的生物指纹。然而，由于缺乏大规模的公共手印数据集，进一步的研究和开发受到了限制。在这篇论文中，我们提出了一种新的现实 pseudo-手印生成（RPG）模型，可以Synthesize手印 prints with massive identities。我们首先引入了条件修饰生成器，以提高内类多样性。然后，我们提出了一种身份相关损失，以保证身份一致性。最后，我们进一步改进了Bézier手印皱纹生成策略，以保证身份独立。我们的实验结果表明，在Synthetic预训练下，识别模型的性能得到了显著提高。例如，我们的模型比State-of-the-art BézierPalm提高了 более чем5%和14%在1:1和1:3开放集成协议下的TAR@FAR=1e-6。当只使用实际训练数据的10%时，我们的方法仍然超越了使用100%实际训练数据的ArcFace，这表明我们更接近实际数据免训练的手印识别。

Car-Studio: Learning Car Radiance Fields from Single-View and Endless In-the-wild Images

paper_url: http://arxiv.org/abs/2307.14009
repo_url: https://github.com/lty2226262/Car_studio
paper_authors: Tianyu Liu, Hao Zhao, Yang Yu, Guyue Zhou, Ming Liu
for: 研究人员希望通过自适应驾驶模拟器中的编辑功能来提高自适应驾驶系统的性能。
methods: 作者提出了一种搭建自由图像学习和建立dataset的管道，并针对自驾汽车中的车身射线场进行设计，以满足模拟器的需求。
results: 通过实验，作者证明了他们的模型与基eline相比具有竞争性能，并逐渐实现了控制性图像编辑功能。

Abstract
Compositional neural scene graph studies have shown that radiance fields can be an efficient tool in an editable autonomous driving simulator. However, previous studies learned within a sequence of autonomous driving datasets, resulting in unsatisfactory blurring when rotating the car in the simulator. In this letter, we propose a pipeline for learning unconstrained images and building a dataset from processed images. To meet the requirements of the simulator, which demands that the vehicle maintain clarity when the perspective changes and that the contour remains sharp from the background to avoid artifacts when editing, we design a radiation field of the vehicle, a crucial part of the urban scene foreground. Through experiments, we demonstrate that our model achieves competitive performance compared to baselines. Using the datasets built from in-the-wild images, our method gradually presents a controllable appearance editing function. We will release the dataset and code on https://lty2226262.github.io/car-studio/ to facilitate further research in the field.

摘要
<>转换给定文本到简化中文。> compositional neural scene graph 研究表明，辐射场可以是编辑自驾护 simulate 中的效果 Tools。然而，前一 studies 在一系列自动驾护数据集中学习，导致在 simulate 中旋转车辆时出现不满人的模糊。在这封信中，我们提出一个管道来学习无约束的图像和建立数据集。为满足 simulate 的需求，我们设计了车辆的辐射场，城市前景中的重要部分。通过实验，我们示出了我们的模型与基eline相比具有竞争性。使用从野外的图像构建的数据集，我们逐渐实现了可控的外观编辑功能。我们将发布数据集和代码到 https://lty2226262.github.io/car-studio/，以便进一步研究在这个领域。

Adaptive Frequency Filters As Efficient Global Token Mixers

paper_url: http://arxiv.org/abs/2307.14008
repo_url: https://github.com/microsoft/TokenMixers
paper_authors: Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Zheng-Jun Zha, Yan Lu, Baining Guo
for: 本文 targets 广泛视觉任务中的效率和准确性问题，旨在提出一种有效的减少计算成本的方法，以便在移动设备上部署深度学习模型。
methods: 本文使用了传统的卷积定理来deep learning中，并发现了适应频率筛选器可以作为全球化征素。
results: 实验表明，提出的AFF征素混合器可以减少计算成本，同时保持或提高准确性。AFFNet也在广泛视觉任务中达到了较好的平衡点。

Abstract
Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

摘要

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2307.14006
repo_url: None
paper_authors: Xinshun Wang, Qiongjie Cui, Chen Chen, Shen Zhao, Mengyuan Liu
for: 这篇论文的目的是提出一个多阶段框架，以便更好地预测人类动作。
methods: 该方法使用了一个单一的几何模型，实现了对特征传播的直接和有效的实现。
results: 实验结果显示，该方法可以在Human 3.6M、CMU Mocap和3DPW datasets上取得最佳性能，并且比之前的方法有更好的预测性。

Abstract
Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.

摘要
现有的图像 convolutional neural networks для人体动作预测主要采用一步方案，直接从历史输入输出预测结果，而不利用人体动作的模式。我们发现人体动作有过渡模式，可以将动作分解成表示每个过渡的小剪辑。每个小剪辑可以从其起始和结束姿势（称为过渡姿势）中重建。我们提议一个小剪辑-动作多stage框架，将动作预测分解成更容易实现的子任务。每个子任务包括三个模块：过渡姿势预测、小剪辑重建和小剪辑-动作预测。具体来说，我们首先预测过渡姿势，然后使用它们重建相应的小剪辑，获得一个近似真实动作序列。最后，我们进行细化修正，以生成最终预测输出。为实现网络，我们提议一种新的统一图像模型，允许直接和有效地传播特征，而不是现有的分离空间-时间模型。广泛的实验在人类3.6M、CMU Mocap 和 3DPW 数据集上证明了我们的方法的有效性，达到了状态 искусственный智能的性能。

Causal reasoning in typical computer vision tasks

paper_url: http://arxiv.org/abs/2307.13992
repo_url: None
paper_authors: Kexuan Zhang, Qiyu Sun, Chaoqiang Zhao, Yang Tang
for: This paper aims to comprehensively review existing causal methods in typical vision and vision-language tasks, and provide future roadmaps for the development and application of causal theory in computer vision.
methods: The paper uses a causal paradigm to model the intrinsic causal structure of vision and vision-language tasks, and reviews existing causal methods in semantic segmentation, object detection, and image captioning.
results: The paper discusses the advantages of using causality in deep learning-based computer vision tasks and proposes future roadmaps for the development and application of causal theory in other complex scenes and systems.

Abstract
Deep learning has revolutionized the field of artificial intelligence. Based on the statistical correlations uncovered by deep learning-based methods, computer vision has contributed to tremendous growth in areas like autonomous driving and robotics. Despite being the basis of deep learning, such correlation is not stable and is susceptible to uncontrolled factors. In the absence of the guidance of prior knowledge, statistical correlations can easily turn into spurious correlations and cause confounders. As a result, researchers are now trying to enhance deep learning methods with causal theory. Causal theory models the intrinsic causal structure unaffected by data bias and is effective in avoiding spurious correlations. This paper aims to comprehensively review the existing causal methods in typical vision and vision-language tasks such as semantic segmentation, object detection, and image captioning. The advantages of causality and the approaches for building causal paradigms will be summarized. Future roadmaps are also proposed, including facilitating the development of causal theory and its application in other complex scenes and systems.

摘要
深度学习已经革命化人工智能领域。基于深度学习方法发现的统计相关性，计算机视觉在自动驾驶和机器人等领域带来了巨大的成长。然而，这种相关性并不稳定，容易受到外部因素的影响。在知识导向的指导下 absence，统计相关性可以轻易变成假 correlate 和干扰因素。因此，研究人员现在尝试通过 causal theory 增强深度学习方法。causal theory 模型了不受数据偏见影响的内在 causal 结构，可以减少假 correlate 的出现。本文将对常见视觉和语言视觉任务，如 semantic segmentation、object detection 和 image captioning 等进行全面的review。causality 的优势和建立 causal 模型的方法将被总结。未来的路线图还将包括在其他复杂的场景和系统中应用 causal theory 的发展，以及促进 causal theory 的应用。

paper_url: http://arxiv.org/abs/2307.13991
repo_url: None
paper_authors: Junwon Seo, Taekyung Kim, Seongyong Ahn, Kiho Kwak
for: 这篇论文的目的是为了实现自主导航在非道路环境中，准确地估算地形通行性。
methods: 这篇论文使用了元学习框架，通过自动驾驶数据收集自多种环境，训练了一个全球模型，以估算地形通行性。
results: 研究人员通过在多种地形上收集驾驶数据，训练了一个全球模型，可以准确地估算地形通行性，并且通过与控制器集成，实现了安全和稳定的自主导航。

Abstract
Autonomous navigation in off-road conditions requires an accurate estimation of terrain traversability. However, traversability estimation in unstructured environments is subject to high uncertainty due to the variability of numerous factors that influence vehicle-terrain interaction. Consequently, it is challenging to obtain a generalizable model that can accurately predict traversability in a variety of environments. This paper presents METAVerse, a meta-learning framework for learning a global model that accurately and reliably predicts terrain traversability across diverse environments. We train the traversability prediction network to generate a dense and continuous-valued cost map from a sparse LiDAR point cloud, leveraging vehicle-terrain interaction feedback in a self-supervised manner. Meta-learning is utilized to train a global model with driving data collected from multiple environments, effectively minimizing estimation uncertainty. During deployment, online adaptation is performed to rapidly adapt the network to the local environment by exploiting recent interaction experiences. To conduct a comprehensive evaluation, we collect driving data from various terrains and demonstrate that our method can obtain a global model that minimizes uncertainty. Moreover, by integrating our model with a model predictive controller, we demonstrate that the reduced uncertainty results in safe and stable navigation in unstructured and unknown terrains.

摘要
自主导航在未知预期环境中需要准确地估计地形可行性。然而，在无结构环境中 traversability 估计受到多种因素的变化所带来的不确定性的影响，因此很难取得一个通用的模型，可以准确地预测不同环境中的地形可行性。这篇论文提出了 METAVerse，一个基于 meta-学的框架，用于学习一个准确且可靠地预测地形可行性的全球模型。我们在自然supervised 的方式下，使用汽车-地面交互反馈来训练一个 dense 和连续值的 cost 图，从 LiDAR 点云中生成一个粗略的地形可行性预测结果。通过 meta-学的使用，我们可以在多个环境中收集的驾驶数据上训练一个全球模型，以实现最小化估计uncertainty。在部署时，我们通过在线适应来迅速地适应当地环境，并且通过利用最近的交互经验来进行更新。我们通过收集不同 terrains 的驾驶数据，并进行了全面的评估，得到了一个准确且可靠的全球模型，并且通过与预测控制器集成，实现了安全和稳定的自主导航。

Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in Musculoskeletal Segmentation of Lower Extremities

paper_url: http://arxiv.org/abs/2307.13986
repo_url: None
paper_authors: Ganping Li, Yoshito Otake, Mazen Soufi, Masashi Taniguchi, Masahide Yagi, Noriaki Ichihashi, Keisuke Uemura, Masaki Takao, Nobuhiko Sugano, Yoshinobu Sato
for: 这个研究的目的是提高医疗影像标注的效率，使用 Bayesian active learning (BAL) 方法选择最有用的标注样本，以减少人工标注的时间和努力。methods: 这个研究使用了一个混合表现式增强抽样法，将高密度和多样性的标注样本选择进行人工修改，以便最大化与未标注样本的相似性，最小化与现有训练样本的相似性。results: 研究结果显示，提案的抽样法在两个lower extremity (LE) dataset上表现出色，在两种抽样规则下都达到了superiority或非凡性。量值结果显示，混合表现式增强抽样法在骨附致度标注中表现出色，并且在不同的抽样规则下进行了评估和比较。

Abstract
Purpose: Obtaining manual annotations to train deep learning (DL) models for auto-segmentation is often time-consuming. Uncertainty-based Bayesian active learning (BAL) is a widely-adopted method to reduce annotation efforts. Based on BAL, this study introduces a hybrid representation-enhanced sampling strategy that integrates density and diversity criteria to save manual annotation costs by efficiently selecting the most informative samples. Methods: The experiments are performed on two lower extremity (LE) datasets of MRI and CT images by a BAL framework based on Bayesian U-net. Our method selects uncertain samples with high density and diversity for manual revision, optimizing for maximal similarity to unlabeled instances and minimal similarity to existing training data. We assess the accuracy and efficiency using Dice and a proposed metric called reduced annotation cost (RAC), respectively. We further evaluate the impact of various acquisition rules on BAL performance and design an ablation study for effectiveness estimation. Results: The proposed method showed superiority or non-inferiority to other methods on both datasets across two acquisition rules, and quantitative results reveal the pros and cons of the acquisition rules. Our ablation study in volume-wise acquisition shows that the combination of density and diversity criteria outperforms solely using either of them in musculoskeletal segmentation. Conclusion: Our sampling method is proven efficient in reducing annotation costs in image segmentation tasks. The combination of the proposed method and our BAL framework provides a semi-automatic way for efficient annotation of medical image datasets.

摘要
目的：获取手动标注以训练深度学习（DL）模型的自动分割是时间consuming的。uncertainty-based Bayesian活动学习（BAL）是一种广泛采用的方法，可以减少手动标注的努力。基于BAL，本研究提出了一种混合表示函数增强选择策略，可以高效地选择最有用的样本进行手动修改，以便更好地适应不同的样本。方法：我们在两个lower extremity（LE）数据集上进行了MRI和CT图像的实验，使用基于Bayesian U-net的BAL框架。我们的方法选择了uncertainty高的样本，同时满足高density和多样性要求，以便手动修改后，与未标注数据的最大相似性和已有训练数据的最小相似性。我们使用Dice和一个提出的 metric called reduced annotation cost（RAC）进行评估精度和效率。我们进一步 evaluate了不同的获取规则对BAL性能的影响，并设计了一个ablation study来评估效果。结果：我们的方法在两个数据集上都显示了superiority或非 инфериорity compared to其他方法，并且quantitative results reveal了不同获取规则的优缺点。我们的ablation study表明，混合density和多样性 criterion outperforms使用任一 criterion alone in musculoskeletal segmentation。结论：我们的采样方法可以减少图像分割任务中的手动标注成本。将我们的采样方法与我们的BAL框架结合使用，可以提供一种 semi-automatic的方式，以便快速和高效地注解医疗图像数据集。

Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models

paper_url: http://arxiv.org/abs/2307.13985
repo_url: None
paper_authors: Ryota Iijima, Miki Tanaka, Sayaka Shiota, Hitoshi Kiya
for: 防御深度神经网络（DNNs）受到敌意攻击（Adversarial Examples，AE）的袭击。
methods: 提出了一种随机ensemble的加密ViT模型来实现更加可靠的防御。
results: 在实验中，提议的方案比 conventional方法更加鲁棒 против不仅黑盒攻击，还有白盒攻击。

Abstract
Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In previous studies, it was confirmed that the vision transformer (ViT) is more robust against the property of adversarial transferability than convolutional neural network (CNN) models such as ConvMixer, and moreover encrypted ViT is more robust than ViT without any encryption. In this article, we propose a random ensemble of encrypted ViT models to achieve much more robust models. In experiments, the proposed scheme is verified to be more robust against not only black-box attacks but also white-box ones than convention methods.

摘要
深度神经网络（DNNs）已经广泛地知道它们容易受到敌意例子（AEs）的攻击。此外，AEs还具有敌意传递性，意味着生成 для源模型的AEs可以诱导另一个黑盒模型（目标模型）的非常小的概率。在先前的研究中，确认了视transformer（ViT）比 convolutional neural network（CNN）模型such as ConvMixer更加抵抗性能 adversarial transferability的性能。此外，加密ViT比不加密ViT更加抗性能。在这篇文章中，我们提议一种随机ensemble of encrypted ViT模型来实现更加可靠的模型。在实验中，我们的方案被证明更加抗性能于不只是黑盒攻击，还有白盒攻击。

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

paper_url: http://arxiv.org/abs/2307.13981
repo_url: None
paper_authors: Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, Kede Ma
for:这篇论文的目的是为了评估视频质量评估（BVQA）模型的进步，以及现有的视频质量评估数据集的评估。methods:这篇论文使用了一种简单的BVQA模型，包括视频预处理器、空间质量分析器、可选的时间质量分析器和质量回归器，并对八个视频质量评估数据集进行了比较。results:研究发现，大多数数据集受到容易的数据集问题的影响，一些数据集甚至可以使用盲图质量评估（BIQA）解决方案。研究还发现，不同的BVQA设计选择对于不同的数据集有着不同的影响。这些结果表明，当前的BVQA领域需要进一步的改进，同时也提供了构建下一代视频质量评估数据集和模型的好做法。

Abstract
Blind video quality assessment (BVQA) plays an indispensable role in monitoring and improving the end-users' viewing experience in various real-world video-enabled media applications. As an experimental field, the improvements of BVQA models have been measured primarily on a few human-rated VQA datasets. Thus, it is crucial to gain a better understanding of existing VQA datasets in order to properly evaluate the current progress in BVQA. Towards this goal, we conduct a first-of-its-kind computational analysis of VQA datasets via designing minimalistic BVQA models. By minimalistic, we restrict our family of BVQA models to build only upon basic blocks: a video preprocessor (for aggressive spatiotemporal downsampling), a spatial quality analyzer, an optional temporal quality analyzer, and a quality regressor, all with the simplest possible instantiations. By comparing the quality prediction performance of different model variants on eight VQA datasets with realistic distortions, we find that nearly all datasets suffer from the easy dataset problem of varying severity, some of which even admit blind image quality assessment (BIQA) solutions. We additionally justify our claims by contrasting our model generalizability on these VQA datasets, and by ablating a dizzying set of BVQA design choices related to the basic building blocks. Our results cast doubt on the current progress in BVQA, and meanwhile shed light on good practices of constructing next-generation VQA datasets and models.

摘要
视频质量评估（BVQA）在各种视频媒体应用中扮演着不可或缺的角色，评估用户在实际场景中的观看体验。作为实验领域，BVQA模型的改进主要基于一些人工评估的VQA数据集。因此，更深刻地理解现有VQA数据集是关键的。为了实现这个目标，我们通过设计最简单的BVQA模型进行计算分析。我们限制我们的BVQA模型只能使用基本块：视频预处理器（用于激进的时空下采样）、空间质量分析器、可选的时间质量分析器和质量回归器，其中所有的实现都是最简单的。通过对不同模型变体在八个VQA数据集上的质量预测性能进行比较，我们发现大多数数据集受到不同程度的易于评估问题的影响，一些甚至接受盲图质量评估（BIQA）解决方案。此外，我们还通过对这些VQA数据集的模型普适性进行比较，以及对BVQA设计决策的绝对多样化进行排除，来证明我们的结论。结果表明，目前BVQA领域的进步存在很大的问题，同时也提供了构建下一代VQA数据集和模型的好做法。

Tracking Anything in High Quality

paper_url: http://arxiv.org/abs/2307.13974
repo_url: https://github.com/jiawen-zhu/hqtrack
paper_authors: Jiawen Zhu, Zhenyu Chen, Zeqi Hao, Shijie Chang, Lu Zhang, Dong Wang, Huchuan Lu, Bin Luo, Jun-Yan He, Jin-Peng Lan, Hanyuan Chen, Chenyang Li
for: 本文提出了一种高质量视频对象跟踪框架（HQTrack），用于高精度地跟踪视频中的任意对象。
methods: 该框架包括视频多对象分割器（VMOS）和掩码精度提升器（MR）两部分。VMOS通过卷积神经网络进行对象分割，而MR使用预训练的模型来精度地改善跟踪结果。
results: 对比其他参赛方法，HQTrack在Visual Object Tracking and Segmentation（VOTS2023）挑战中得到了第二名的成绩，而不使用任何套路如测试时数据增强和模型ensemble。

Abstract
Visual object tracking is a fundamental video task in computer vision. Recently, the notably increasing power of perception algorithms allows the unification of single/multiobject and box/mask-based tracking. Among them, the Segment Anything Model (SAM) attracts much attention. In this report, we propose HQTrack, a framework for High Quality Tracking anything in videos. HQTrack mainly consists of a video multi-object segmenter (VMOS) and a mask refiner (MR). Given the object to be tracked in the initial frame of a video, VMOS propagates the object masks to the current frame. The mask results at this stage are not accurate enough since VMOS is trained on several closeset video object segmentation (VOS) datasets, which has limited ability to generalize to complex and corner scenes. To further improve the quality of tracking masks, a pretrained MR model is employed to refine the tracking results. As a compelling testament to the effectiveness of our paradigm, without employing any tricks such as test-time data augmentations and model ensemble, HQTrack ranks the 2nd place in the Visual Object Tracking and Segmentation (VOTS2023) challenge. Code and models are available at https://github.com/jiawen-zhu/HQTrack.

摘要
<>Translate the given text into Simplified Chinese.<>视觉对象跟踪是计算机视觉中的基本任务。近些年，人工智能的识别算法的能力不断提高，使得单/多对象和框/面签基本跟踪可以协调。其中，Segment Anything Model（SAM）吸引了很多关注。在这份报告中，我们提出了高质量跟踪任何对象的框架（HQTrack）。HQTrack主要由视频多对象分割器（VMOS）和面签修正器（MR）两部分组成。给定视频中的对象在初始帧，VMOS将对象面签推广到当前帧。但由于VMOS被训练于一些相似的视频对象分割（VOS）数据集，其泛化能力有限，因此对跟踪面签的结果进行修正。为了进一步提高跟踪面签的质量，我们采用了预训练的MR模型进行修正。作为我们模型的吸引力，在Visual Object Tracking and Segmentation（VOTS2023）挑战中，无需使用任何套路技术和模型集成，HQTrack在评测中排名第二。代码和模型可以在https://github.com/jiawen-zhu/HQTrack上获取。

paper_url: http://arxiv.org/abs/2307.13958
repo_url: None
paper_authors: Zitong Yu, Rizhao Cai, Yawen Cui, Ajian Liu, Changsheng Chen
for: 提高face anti-spoofing（FAS）系统的 robustness，使用视觉转换器基于多模态学习方法。
methods: 提出了一种名为Vision Prompt flexible-modal FAS（VP-FAS）的方法，通过在固定预训练基模型上学习模式相关的提示来适应流处理时缺失的模态。
results: 在两个多模态FAS benchmark数据集上进行了广泛的实验，证明了VP-FAS框架在不同缺失模态情况下的高效性，同时减少了模型重新训练的需求。

Abstract
Recently, vision transformer based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, multimodal face data collected from the real world is often imperfect due to missing modalities from various imaging sensors. Recently, flexible-modal FAS~\cite{yu2023flexible} has attracted more attention, which aims to develop a unified multimodal FAS model using complete multimodal face data but is insensitive to test-time missing modalities. In this paper, we tackle one main challenge in flexible-modal FAS, i.e., when missing modality occurs either during training or testing in real-world situations. Inspired by the recent success of the prompt learning in language models, we propose \textbf{V}isual \textbf{P}rompt flexible-modal \textbf{FAS} (VP-FAS), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to downstream flexible-modal FAS task. Specifically, both vanilla visual prompts and residual contextual prompts are plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 4\% learnable parameters compared to training the entire model. Furthermore, missing-modality regularization is proposed to force models to learn consistent multimodal feature embeddings when missing partial modalities. Extensive experiments conducted on two multimodal FAS benchmark datasets demonstrate the effectiveness of our VP-FAS framework that improves the performance under various missing-modality cases while alleviating the requirement of heavy model re-training.

摘要
最近，基于视觉变换器的多Modal学习方法被提议以提高face anti-spoofing（FAS）系统的 Robustness。然而，从实际世界中收集的多Modal face数据经常受到不同感知器的数据损失。近期，flexible-modal FAS 在这些损失中吸引了更多的关注，它的目标是开发一个可以使用完整的多Modal face数据进行融合的 FAS模型，但是不敏感于测试时缺失的模态。在这篇论文中，我们解决了flexible-modal FAS中的一个主要挑战，即在训练或测试过程中缺失模态。我们灵感自近期的语言模型的Prompt学习的成功，我们提出了Visual Prompt flexible-modal FAS（VP-FAS），它通过学习模态相关的Prompt来适应冻结预训练基础模型到下游多Modal FAS任务。特别是，我们在多Modal transformer中插入了vanilla visual prompt和 residual contextual prompt，以处理一般缺失模态的情况，而无需更新整个模型。此外，我们还提出了缺失模态的Regularization，以强制模型学习一致的多Modal特征嵌入，即使缺失部分模态。我们在两个多Modal FAS benchmark数据集上进行了广泛的实验， demonstarted VP-FAS框架可以在不同的缺失模态情况下提高性能，而且降低模型重新训练的需求。

Heterogeneous Embodied Multi-Agent Collaboration

paper_url: http://arxiv.org/abs/2307.13957
repo_url: None
paper_authors: Xinzhu Liu, Di Guo, Huaping Liu
for: 这个论文研究了多智能体在复杂的室内视觉环境中完成多智能体任务的协作方法。
methods: 该论文提出了一种基于多智能体探测异常物品并预测合理容器的层次决策模型，以及一种基于手势交换的群体通信机制。
results: 经过广泛的实验，论文证明了提出的模型的效果。项目官方网站和实验视频可以在https://hetercol.github.io/查看。

Abstract
Multi-agent embodied tasks have recently been studied in complex indoor visual environments. Collaboration among multiple agents can improve work efficiency and has significant practical value. However, most of the existing research focuses on homogeneous multi-agent tasks. Compared with homogeneous agents, heterogeneous agents can leverage their different capabilities to allocate corresponding sub-tasks and cooperate to complete complex tasks. Heterogeneous multi-agent tasks are common in real-world scenarios, and the collaboration strategy among heterogeneous agents is a challenging and important problem to be solved. To study collaboration among heterogeneous agents, we propose the heterogeneous multi-agent tidying-up task, in which multiple heterogeneous agents with different capabilities collaborate with each other to detect misplaced objects and place them in reasonable locations. This is a demanding task since it requires agents to make the best use of their different capabilities to conduct reasonable task planning and complete the whole task. To solve this task, we build a heterogeneous multi-agent tidying-up benchmark dataset in a large number of houses with multiple rooms based on ProcTHOR-10K. We propose the hierarchical decision model based on misplaced object detection, reasonable receptacle prediction, as well as the handshake-based group communication mechanism. Extensive experiments are conducted to demonstrate the effectiveness of the proposed model. The project's website and videos of experiments can be found at https://hetercol.github.io/.

摘要
多智能体任务在复杂的室内视觉环境中最近得到了研究。多个代理机器人可以增加工作效率，具有重要的实用价值。然而，大多数现有研究都集中在同类代理机器人任务上。相比同类代理机器人，多种代理机器人可以利用不同的能力来分配相应的子任务并合作完成复杂任务。多种代理机器人任务在实际场景中很常见，合作策略中的多种代理机器人是一个挑战性和重要的问题。为了研究多种代理机器人之间的合作，我们提出了多种代理机器人整理任务，在多个房间的多个室内进行探测落入的物品并将其置于合理的位置。这是一项需要代理机器人利用不同的能力进行合理的任务规划，以完成整个任务的任务。为解决这个任务，我们建立了多种代理机器人整理任务的基准数据集，基于ProcTHOR-10K。我们提出了层次决策模型，包括落入物品探测、合理容器预测以及手势交换机制。我们对这些实验进行了广泛的实验，以证明提案的模型的有效性。项目的官方网站和实验视频可以在https://hetercol.github.io/查看。

The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

paper_url: http://arxiv.org/abs/2307.13953
repo_url: None
paper_authors: Liao Qu, Xianwei Zou, Xiang Li, Yandong Wen, Rita Singh, Bhiksha Raj
for: 这篇论文探讨了声音和面部特征之间的关系。传统的声音-面部相关性研究通常需要使用长时间的声音输入，包括从声音生成面像和从声音重建3D面膜。但在voice-based犯罪调查中，可能只有有限的声音证据。此外，从 физиологи学角度来看，每个 segment of speech （phoneme）对应不同的空气流和面部运动。因此，发现声音和面部特征之间的隐藏关系是有利的。
methods: 我们提出了一个分析管道，用于在细致的方式探讨声音和面部特征之间的关系。我们建立了每个声音-特征对的估计器，并通过假设检测来评估相关性。我们发现，在元音上更容易预测面部特征，特别是填凿音。此外，我们发现，在某些特征在声音发音时的更大运动，可以更好地预测。
results: 我们的结果支持 physiology 中关于声音和面部特征之间的相关性的发现。我们的研究为未来的声音-面部多模态学习奠基。

Abstract
This work unveils the enigmatic link between phonemes and facial features. Traditional studies on voice-face correlations typically involve using a long period of voice input, including generating face images from voices and reconstructing 3D face meshes from voices. However, in situations like voice-based crimes, the available voice evidence may be short and limited. Additionally, from a physiological perspective, each segment of speech -- phoneme -- corresponds to different types of airflow and movements in the face. Therefore, it is advantageous to discover the hidden link between phonemes and face attributes. In this paper, we propose an analysis pipeline to help us explore the voice-face relationship in a fine-grained manner, i.e., phonemes v.s. facial anthropometric measurements (AM). We build an estimator for each phoneme-AM pair and evaluate the correlation through hypothesis testing. Our results indicate that AMs are more predictable from vowels compared to consonants, particularly with plosives. Additionally, we observe that if a specific AM exhibits more movement during phoneme pronunciation, it is more predictable. Our findings support those in physiology regarding correlation and lay the groundwork for future research on speech-face multimodal learning.

摘要

Rethinking Voice-Face Correlation: A Geometry View

paper_url: http://arxiv.org/abs/2307.13948
repo_url: https://github.com/lxa9867/VAF
paper_authors: Xiang Li, Yandong Wen, Muqiao Yang, Jinglu Wang, Rita Singh, Bhiksha Raj
for: 这种研究旨在探索voice和face之间的含义，从geometry角度来恢复3D面征。
methods: 该研究提出了一种voice-anthropometric measurement（AM）-face模式，通过利用AM作为voice和face之间的拟合器，消除不可预测的AM的影响，使face geometry变得可追踪。
results: 研究发现，voice和specific parts of the face geometry（如鼻腔和头骨）之间存在显著的相关性，这些结果可能为人骨学科提供新的视角。

Abstract
Previous works on voice-face matching and voice-guided face synthesis demonstrate strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

摘要
previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.Here's the translation in Traditional Chinese:previous works on voice-face matching and voice-guided face synthesis have shown strong correlations between voice and face, but mainly rely on coarse semantic cues such as gender, age, and emotion. In this paper, we aim to investigate the capability of reconstructing the 3D facial shape from voice from a geometry perspective without any semantic information. We propose a voice-anthropometric measurement (AM)-face paradigm, which identifies predictable facial AMs from the voice and uses them to guide 3D face reconstruction. By leveraging AMs as a proxy to link the voice and face geometry, we can eliminate the influence of unpredictable AMs and make the face geometry tractable. Our approach is evaluated on our proposed dataset with ground-truth 3D face scans and corresponding voice recordings, and we find significant correlations between voice and specific parts of the face geometry, such as the nasal cavity and cranium. Our work offers a new perspective on voice-face correlation and can serve as a good empirical study for anthropometry science.

Centroid-aware feature recalibration for cancer grading in pathology images

paper_url: http://arxiv.org/abs/2307.13947
repo_url: https://github.com/colin19950703/cafenet
paper_authors: Jaeung Lee, Keunho Byeon, Jin Tae Kwak
for: 静观细胞质量评估是生物医学影像分析中的一项关键任务，Recent developments in artificial neural networks have shown great potential for improving the accuracy and quality of cancer diagnosis.
methods: 提议使用一种具有中心点感知的特征重新调整网络，该网络可以将输入病理图像映射到一个嵌入空间中，并通过注意机制将其调整为不同类型的肿瘤等级中心点嵌入向量。
results: 通过对COLON dataset进行实验，确认提议网络可以准确地进行病理图像评估，并且能够鲁棒地适应不同环境下的数据集。

Abstract
Cancer grading is an essential task in pathology. The recent developments of artificial neural networks in computational pathology have shown that these methods hold great potential for improving the accuracy and quality of cancer diagnosis. However, the issues with the robustness and reliability of such methods have not been fully resolved yet. Herein, we propose a centroid-aware feature recalibration network that can conduct cancer grading in an accurate and robust manner. The proposed network maps an input pathology image into an embedding space and adjusts it by using centroids embedding vectors of different cancer grades via attention mechanism. Equipped with the recalibrated embedding vector, the proposed network classifiers the input pathology image into a pertinent class label, i.e., cancer grade. We evaluate the proposed network using colorectal cancer datasets that were collected under different environments. The experimental results confirm that the proposed network is able to conduct cancer grading in pathology images with high accuracy regardless of the environmental changes in the datasets.

摘要
乳腺癌等级是病理学中的一项重要任务。现代人工神经网络在计算病理学中的应用表明，这些方法在改善肿瘤诊断的准确性和质量方面具有很大潜力。然而，这些方法的可靠性和可重复性问题仍未得到完全解决。在这里，我们提出了一种注意力机制基于中心点的特征重新调整网络，可以准确地进行肿瘤等级诊断。该网络将输入病理图像映射到嵌入空间中，并通过中心点嵌入向量的注意力机制进行调整。根据重新调整后的嵌入vector，该网络将输入病理图像分类为相应的肿瘤等级。我们使用了不同环境下收集的直肠癌数据集进行评估，实验结果表明，我们的方法能够在不同环境下准确地进行肿瘤等级诊断。

Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

paper_url: http://arxiv.org/abs/2307.13938
repo_url: https://github.com/kunzhan/DSSN
paper_authors: Zhibo Tain, Xiaolin Zhang, Peng Zhang, Kun Zhan
for: 提高semantic segmentation任务中使用无标例数据的效果，减少标注训练示例的成本。
methods: 提posed dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning，通过对各类强弱视图进行匹配，使用强大的扩展视图在低级图像空间和高级特征空间进行对齐，以 maximize 使用可用的无标例数据。 Additionally, 引入一种新的类意识 Pseudo-label 选择策略，解决大多数现有方法不进行选择或应用预先定义的阈值 для 所有类。 Specifically, 我们的策略选择每个类的高 confidence 预测值作为 pseudo labels，以便使用强 augmented views 进行supervision。 This strategy 可以考虑类偏移和提高长尾类的性能。
results: 对 PASCAL VOC 2012 和 Cityscapes 两个 dataset 进行实验，实现了semantic segmentation任务中使用无标例数据的最佳效果，比其他 SSS 算法出色。

Abstract
Semi-supervised semantic segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce expenses on labeling training examples. However, the effectiveness of SSS algorithms is limited by the difficulty of fully exploiting the potential of unlabeled data. To address this, we propose a dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning. By aligning positive pairs with a pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space, the proposed DSSN is designed to maximize the utilization of available unlabeled data. Additionally, we introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, which addresses the limitations of most existing methods that do not perform selection or apply a predefined threshold for all classes. Specifically, our strategy selects the top high-confidence prediction of the weak view for each class to generate pseudo labels that supervise the strong augmented views. This strategy is capable of taking into account the class imbalance and improving the performance of long-tailed classes. Our proposed method achieves state-of-the-art results on two datasets, PASCAL VOC 2012 and Cityscapes, outperforming other SSS algorithms by a significant margin.

摘要
semi-supervised semantic segmentation (SSS) 是一项重要的任务，它利用标注和无标注数据来降低标注训练示例的成本。然而，SSS 算法的效果受到无标注数据的利用的限制。为解决这个问题，我们提出了 dual-level Siamese structure network (DSSN) для像素级对比学习。DSSN 通过在低级图像空间和高级特征空间使用强大的扩展视图进行像素级对比损失，以 maximize 利用可用的无标注数据。此外，我们还引入了一种新的类感知 pseudo-label 选择策略，用于弱到强超vision。这种策略选择每个类的高信度预测值作为 pseudo label，以便在强augmented views中进行supervision。这种策略能够考虑类别不均衡和改进长尾类的性能。我们的提议方法在 PASCAL VOC 2012 和 Cityscapes 两个数据集上实现了state-of-the-art 的结果，比其他 SSS 算法有明显的优势。

paper_url: http://arxiv.org/abs/2307.13933
repo_url: https://github.com/ydk122024/aide
paper_authors: Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, Yan Wang, Jing Liu, Peixuan Zhang, Peng Zhai, Lihua Zhang
for: 这篇论文主要旨在提供一个包含上下文信息的driver monitoring系统测试数据集，以提高交通安全和安全驾驶。
methods: 本论文使用多个视角设置、多Modal注释和四种实践任务来提供全面的Driver monitoring。同时， authors也提供了三种基线框架的实验比较，以及两种 fusions 策略来学习有效的多流/多Modal表示。
results: 研究人员通过对AIDE数据集进行了extensive 测试和分析，发现了数据集中的关键组成部分和基线框架的重要性和合理性。

Abstract
Driver distraction has become a significant cause of severe traffic accidents over the past decade. Despite the growing development of vision-driven driver monitoring systems, the lack of comprehensive perception datasets restricts road safety and traffic security. In this paper, we present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle in naturalistic scenarios. AIDE facilitates holistic driver monitoring through three distinctive characteristics, including multi-view settings of driver and scene, multi-modal annotations of face, body, posture, and gesture, and four pragmatic task designs for driving understanding. To thoroughly explore AIDE, we provide experimental benchmarks on three kinds of baseline frameworks via extensive methods. Moreover, two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations. We also systematically investigate the importance and rationality of the key components in AIDE and benchmarks. The project link is https://github.com/ydk122024/AIDE.

摘要
驾驶员分心已成为过去十年内最重要的严重交通事故原因之一。尽管激发驾驶员视觉系统的发展，但由于缺乏全面的感知数据集，道路安全和交通安全仍然受到限制。本文提出了一个帮助驾驶员观察 dataset（AIDE），该dataset考虑了车内和车外情况的上下文信息，并包括驾驶员和场景的多视图设置、面部、身体、姿势和手势的多模式注解、以及适用于驾驶理解的四种实用任务设计。为了全面探索AIDE，我们提供了三种基eline框架的实验均衡，以及两种 fusión策略，以获得有效的多流/多模式表示。此外，我们系统地调查了AIDE和基eline的关键组件的重要性和合理性。项目链接：https://github.com/ydk122024/AIDE。

Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception

paper_url: http://arxiv.org/abs/2307.13929
repo_url: https://github.com/ydk122024/SCOPE
paper_authors: Kun Yang, Dingkang Yang, Jingyu Zhang, Mingcheng Li, Yang Liu, Jing Liu, Hanqi Wang, Peng Sun, Liang Song
for: 提高自动驾驶车辆的感知性能
methods: 提出了一种新的协同感知框架（SCOPE），通过综合考虑多个agent的空间和时间特征来提高目标agent的感知
results: 对实际和模拟的协同3D物体检测任务进行了广泛的实验，证明了我们的方法的优越性和必要性

Abstract
Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the perception performance of autonomous vehicles over single-agent perception. However, several challenges remain in achieving pragmatic information sharing in this emerging research. In this paper, we propose SCOPE, a novel collaborative perception framework that aggregates the spatio-temporal awareness characteristics across on-road agents in an end-to-end manner. Specifically, SCOPE has three distinct strengths: i) it considers effective semantic cues of the temporal context to enhance current representations of the target agent; ii) it aggregates perceptually critical spatial information from heterogeneous agents and overcomes localization errors via multi-scale feature interactions; iii) it integrates multi-source representations of the target agent based on their complementary contributions by an adaptive fusion paradigm. To thoroughly evaluate SCOPE, we consider both real-world and simulated scenarios of collaborative 3D object detection tasks on three datasets. Extensive experiments demonstrate the superiority of our approach and the necessity of the proposed components.

摘要
多智能合作感知作为自动驾驶车辆与所有东西通信的潜在应用，可以 significatively 提高自动驾驶车辆的感知性能。然而，在实现这项新研究领域中，仍有许多挑战。在这篇论文中，我们提出了 SCOPE，一种新的合作感知框架，可以在综合方式上聚合路上智能机器人的空间时间意识特征。具体来说，SCOPE具有以下三大优势：1. 它考虑有效的时间上下文semantic见解，以提高目标机器人的当前表示;2. 它聚合异ogeneous智能机器人的感知核心空间信息，并通过多尺度特征互动来超越地理化错误;3. 它通过适应融合方式，将多个来源的目标机器人表示融合，以便充分利用它们的补做贡献。为了全面评估 SCOPE，我们在三个数据集上进行了合作3D物体检测任务的实际和 simulated 实验。广泛的实验结果表明我们的方法的优越性，以及提案的组件的必要性。

paper_url: http://arxiv.org/abs/2307.13927
repo_url: None
paper_authors: Zhongze Wang, Haitao Zhao, Lujian Yao, Jingchao Peng, Kaijie Zhao
for: 这个论文主要用于提高雾度缓解方法的性能，特别是利用雾度差异来提高雾度特征的精度。methods: 该方法使用了一种名为全球分支（GB）和本地分支（LB）的树 структуры，其中GB使用了同构网络进行特征提取，并提出了全球雾度特征纠正模块（GDFR）来更新全球特征。LB则是利用本地雾度差异来更新本地特征，并引入了中间雾度差异feedforward（IDRF）模块。results: 该方法在多个数据集上达到了现有方法的最佳性能，并且能够更好地处理具有不同雾度差异的图像。

Abstract
In image dehazing task, haze density is a key feature and affects the performance of dehazing methods. However, some of the existing methods lack a comparative image to measure densities, and others create intermediate results but lack the exploitation of their density differences, which can facilitate perception of density. To address these deficiencies, we propose a density-aware dehazing method named Density Feature Refinement Network (DFR-Net) that extracts haze density features from density differences and leverages density differences to refine density features. In DFR-Net, we first generate a proposal image that has lower overall density than the hazy input, bringing in global density differences. Additionally, the dehazing residual of the proposal image reflects the level of dehazing performance and provides local density differences that indicate localized hard dehazing or high density areas. Subsequently, we introduce a Global Branch (GB) and a Local Branch (LB) to achieve density-awareness. In GB, we use Siamese networks for feature extraction of hazy inputs and proposal images, and we propose a Global Density Feature Refinement (GDFR) module that can refine features by pushing features with different global densities further away. In LB, we explore local density features from the dehazing residuals between hazy inputs and proposal images and introduce an Intermediate Dehazing Residual Feedforward (IDRF) module to update local features and pull them closer to clear image features. Sufficient experiments demonstrate that the proposed method achieves results beyond the state-of-the-art methods on various datasets.

摘要
在图像霾除任务中，霾 densities 是关键特征，影响霾除方法的性能。然而，一些现有方法缺乏对比图像，而其他方法创造了中间结果，但缺乏利用它们的density differences 来促进霾除性能。为解决这些不足，我们提出了一种名为 density feature refinement network (DFR-Net) 的霾除方法，它从 density differences 中提取霾 densities 特征，并利用 density differences 来精细化霾 densities 特征。在 DFR-Net 中，我们首先生成一个 proposal 图像，其全体density 较低于霾输入图像，从而带来全局 density differences。此外，提取霾除 residual 的 proposal 图像反映了霾除性能的水平，并提供了本地 density differences，表示本地强霾或高density 区域。接着，我们引入了全球分支 (GB) 和本地分支 (LB)，以实现density-awareness。在 GB 中，我们使用 Siamese 网络 для霾输入图像和提案图像的特征提取，并提出了全球density feature refinement (GDFR) 模块，可以通过推动不同全局density 的特征更远的方式来精细化特征。在 LB 中，我们探索本地霾 densities 特征从霾除 residual 中的霾输入图像和提案图像之间的差异，并引入了中间霾除 residual feedforward (IDRF) 模块来更新本地特征并吸引它们更近于清晰图像特征。充分的实验结果表明，我们的方法可以在不同的 dataset 上达到现有方法的 state-of-the-art 性能。

EasyNet: An Easy Network for 3D Industrial Anomaly Detection

paper_url: http://arxiv.org/abs/2307.13925
repo_url: None
paper_authors: Ruitao Chen, Guoyang Xie, Jiaqi Liu, Jinbao Wang, Ziqi Luo, Jinfan Wang, Feng Zheng
for: 这个研究旨在提高工业制程中的3D异常检测，以应对现有的缺陷。
methods: 我们提出了一个简单易用的网络（称为EasyNet），不使用预训模型和内存库。我们设计了一个多维多模式特征编码解oder，以精准地重建异常区域的分类图像，并透过多模式异常分类网络获得精确的异常地图。最后，我们提出了一个注意力基于信息熵融合模组，用于Feature融合，使其适合实时部署。
results: 实验结果显示，EasyNet可以在不使用预训模型和内存库的情况下，达到92.6%的异常检测AUROC。此外，EasyNet比现有的方法更快，在Tesla V100 GPU上 achieve 94.55 FPS的高帧率。

Abstract
3D anomaly detection is an emerging and vital computer vision task in industrial manufacturing (IM). Recently many advanced algorithms have been published, but most of them cannot meet the needs of IM. There are several disadvantages: i) difficult to deploy on production lines since their algorithms heavily rely on large pre-trained models; ii) hugely increase storage overhead due to overuse of memory banks; iii) the inference speed cannot be achieved in real-time. To overcome these issues, we propose an easy and deployment-friendly network (called EasyNet) without using pre-trained models and memory banks: firstly, we design a multi-scale multi-modality feature encoder-decoder to accurately reconstruct the segmentation maps of anomalous regions and encourage the interaction between RGB images and depth images; secondly, we adopt a multi-modality anomaly segmentation network to achieve a precise anomaly map; thirdly, we propose an attention-based information entropy fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6% without using pre-trained models and memory banks. In addition, EasyNet is faster than existing methods, with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

摘要
三维异常检测是现代计算机视觉任务中的一个突出和生命关键任务，在工业生产中扮演着重要的角色。Recently many advanced algorithms have been published, but most of them cannot meet the needs of IM. There are several disadvantages: i) difficult to deploy on production lines since their algorithms heavily rely on large pre-trained models; ii) hugely increase storage overhead due to overuse of memory banks; iii) the inference speed cannot be achieved in real-time. To overcome these issues, we propose an easy and deployment-friendly network (called EasyNet) without using pre-trained models and memory banks: firstly, we design a multi-scale multi-modality feature encoder-decoder to accurately reconstruct the segmentation maps of anomalous regions and encourage the interaction between RGB images and depth images; secondly, we adopt a multi-modality anomaly segmentation network to achieve a precise anomaly map; thirdly, we propose an attention-based information entropy fusion module for feature fusion during inference, making it suitable for real-time deployment. Extensive experiments show that EasyNet achieves an anomaly detection AUROC of 92.6% without using pre-trained models and memory banks. In addition, EasyNet is faster than existing methods, with a high frame rate of 94.55 FPS on a Tesla V100 GPU.

trajdata: A Unified Interface to Multiple Human Trajectory Datasets

paper_url: http://arxiv.org/abs/2307.13924
repo_url: https://github.com/nvlabs/trajdata
paper_authors: Boris Ivanovic, Guanyu Song, Igor Gilitschenski, Marco Pavone
for: 本研究旨在提供一个统一的人行轨迹数据接口，以便研究人行轨迹预测和自动驾驶车辆的动作识别。
methods: 本研究使用了多个大规模的实际世界人行轨迹数据集，并提供了一个简单、统一的轨迹和地图数据表示方式和API。
results: 本研究通过对现有轨迹数据集进行广泛的实验性评估，为研究人行轨迹预测和自动驾驶车辆的动作识别提供了深入的数据理解，并提出了未来数据集的建议。

Abstract
The field of trajectory forecasting has grown significantly in recent years, partially owing to the release of numerous large-scale, real-world human trajectory datasets for autonomous vehicles (AVs) and pedestrian motion tracking. While such datasets have been a boon for the community, they each use custom and unique data formats and APIs, making it cumbersome for researchers to train and evaluate methods across multiple datasets. To remedy this, we present trajdata: a unified interface to multiple human trajectory datasets. At its core, trajdata provides a simple, uniform, and efficient representation and API for trajectory and map data. As a demonstration of its capabilities, in this work we conduct a comprehensive empirical evaluation of existing trajectory datasets, providing users with a rich understanding of the data underpinning much of current pedestrian and AV motion forecasting research, and proposing suggestions for future datasets from these insights. trajdata is permissively licensed (Apache 2.0) and can be accessed online at https://github.com/NVlabs/trajdata

摘要
领域趋势预测在最近几年内得到了大量的研究和应用，部分归功于许多大规模的实际世界人 trajectory 数据集（AV 和行人运动跟踪）的发布。而这些数据集各自使用自定义的数据格式和 API，使研究人员在多个数据集之间训练和评估方法变得繁琐。为了解决这个问题，我们提出了 trajdata：一个统一的界面，用于多个人 trajectory 数据集。trajdata 的核心思想是提供简单、统一、高效的数据表示和 API，用于 trajectory 和地图数据。在这个工作中，我们进行了大量的实验性评估，为用户提供了许多现有的 pedestrian 和 AV 运动预测研究的数据基础知识，并提出了未来数据集的建议。trajdata 采用 Apache 2.0 许可证，可在 GitHub 上获取，详细信息请参考。

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

paper_url: http://arxiv.org/abs/2307.13908
repo_url: None
paper_authors: Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, Fan Wang
for: 提供一种基于稀疏3D点 cloud的文本到3D生成框架，以填补现有方法的束缚和不可控性问题。
methods: 使用Point-E生成的稀疏3D点云作为几何假设，并通过维护点云导向损失来适应NeRF的几何。同时，通过控制NeRF的外观分布来提高视角一致性。
results: 比较和分析表明，Points-to-3D可以提高视角一致性并实现良好的形状控制，从而为文本到3D生成提供一个新的控制方法。

Abstract
Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

摘要
文本到3D生成最近受到了广泛关注，受到了2D扩散模型在数百万张图像和文本对的训练。现有方法主要通过分数散熔炼来利用2D扩散先验来监督3D模型的生成，例如NeRF。然而，分数散熔炼容易受到视角不一致问题的影响，而半 implicit NeRF模型也可能导致不可预测的3D形态，因此导致文本到3D生成的真实性和可控性受到限制。在这种情况下，我们提出了一种灵活的点 clouds到3D框架，用于跨越稀疏可得到的3D点 cloud和真实形态可控的3D生成知识。核心思想是通过控制可控稀疏3D点来指导文本到3D生成。我们使用由3D扩散模型Point-E生成的稀疏点云作为几何优先，根据单个参考图像进行条件。为了更好地利用稀疏3D点，我们提出了一种高效的点云引导损失，以适应NeRF的几何进行适应。此外，我们还提出了优化NeRF以实现更加视角一致的外观。具体来说，我们通过分数散熔炼来ControlNet的2D图像扩散模型进行学习，并在文本和深度图中获得学习的紧凑geometry。经过质量和量度比较，我们发现Point-to-3D可以提高视角一致和实现好的形态可控性，提供了一种新的方法来改进和控制文本到3D生成。

YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

paper_url: http://arxiv.org/abs/2307.13901
repo_url: https://github.com/deeplite/deeplite-torch-zoo
paper_authors: Ivan Lazarevich, Matteo Grimaldi, Ravish Kumar, Saptarshi Mitra, Shahrukh Khan, Sudhakar Sah
for: 本研究为了提供一个包含550多个YOLO基于物体检测模型的benchmark，以及4种不同的嵌入式硬件平台（x86 CPU、ARM CPU、Nvidia GPU、NPU）上的4个数据集。
methods: 本研究使用了一种控制性的比较方法来评估不同的YOLO基于一stage检测器，并在一个固定的训练环境下收集了准确率和延迟数据。
results: 研究发现，如果将现代检测头和训练技术integrated into the learning process，包括older模型如YOLOv3和YOLOv4在内的多个YOLO系列模型可以实现良好的准确率-延迟质量平衡。此外，研究还评估了在YOLOBench上使用的训练成本为零的准确率估计器，并发现其中一些可以有效地预测Pareto优化的检测模型。

Abstract
We present YOLOBench, a benchmark comprised of 550+ YOLO-based object detection models on 4 different datasets and 4 different embedded hardware platforms (x86 CPU, ARM CPU, Nvidia GPU, NPU). We collect accuracy and latency numbers for a variety of YOLO-based one-stage detectors at different model scales by performing a fair, controlled comparison of these detectors with a fixed training environment (code and training hyperparameters). Pareto-optimality analysis of the collected data reveals that, if modern detection heads and training techniques are incorporated into the learning process, multiple architectures of the YOLO series achieve a good accuracy-latency trade-off, including older models like YOLOv3 and YOLOv4. We also evaluate training-free accuracy estimators used in neural architecture search on YOLOBench and demonstrate that, while most state-of-the-art zero-cost accuracy estimators are outperformed by a simple baseline like MAC count, some of them can be effectively used to predict Pareto-optimal detection models. We showcase that by using a zero-cost proxy to identify a YOLO architecture competitive against a state-of-the-art YOLOv8 model on a Raspberry Pi 4 CPU. The code and data are available at https://github.com/Deeplite/deeplite-torch-zoo

摘要
我们介绍YOLOBench，一个包含550多个YOLO基于物件探测模型的benchmark，采用4个不同的 datasets和4个不同的嵌入式硬件平台（x86 CPU、ARM CPU、Nvidia GPU、NPU）。我们收集了一些YOLO基于一阶探测器的精度和延迟数据，并通过一个公平的比较方式，以确定这些探测器在不同的模型 scales 中的精度-延迟贸易。我们通过 pareto-optimality 分析所收集的数据，发现如果应用现代探测头和训练技术，多个YOLO系列的架构都可以取得良好的精度-延迟贸易，包括旧的模型YOLOv3和YOLOv4。我们还评估了用于神经建构搜寻的训练-自由精度估计器，并发现大多数现有的零成本精度估计器被简单的基准值MAC Count所出perform。我们还示出了使用零成本代理来识别YOLO架构，与现代YOLOv8模型在Raspberry Pi 4 CPU上竞争。我们的代码和数据可以在https://github.com/Deeplite/deeplite-torch-zoo上取得。

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

paper_url: http://arxiv.org/abs/2307.13897
repo_url: https://github.com/siyi-wind/avit
paper_authors: Siyi Du, Nourhan Bayasi, Ghassan Harmarneh, Rafeef Garbi
for: 这个论文主要针对皮肤损伤分割（SLS）问题，旨在提高皮肤损伤分割的精度和效率。
methods: 这个论文提出了一种新的策略，即将预训练的视transformer（ViT）转移到SLS任务上，并通过附加轻量级模块（adapters）来修改特征表示。此外，论文还使用了一个浅层Convolutional Neural Network（CNN）作为提示生成器，从输入图像中生成提示embedding，以便指导分割任务。
results: 根据论文的实验结果，AViT可以在4个皮肤损伤Dataset上达到与当前最佳性能相当或更高的性能，而且具有许多 fewer Trainable parameters。

Abstract
Skin lesion segmentation (SLS) plays an important role in skin lesion analysis. Vision transformers (ViTs) are considered an auspicious solution for SLS, but they require more training data compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive. In this paper, we propose AViT, a novel efficient strategy to mitigate ViTs' data-hunger by transferring any pre-trained ViTs to the SLS task. Specifically, we integrate lightweight modules (adapters) within the transformer layers, which modulate the feature representation of a ViT without updating its pre-trained weights. In addition, we employ a shallow CNN as a prompt generator to create a prompt embedding from the input image, which grasps fine-grained information and CNN's inductive biases to guide the segmentation task on small datasets. Our quantitative experiments on 4 skin lesion datasets demonstrate that AViT achieves competitive, and at times superior, performance to SOTA but with significantly fewer trainable parameters. Our code is available at https://github.com/siyi-wind/AViT.

摘要
皮肤损伤分割（SLS）在皮肤损伤分析中扮演着重要的角色。视transformer（ViTs）被视为有利的解决方案，但它们需要更多的训练数据 Compared to convolutional neural networks (CNNs) due to their inherent parameter-heavy structure and lack of some inductive biases. To alleviate this issue, current approaches fine-tune pre-trained ViT backbones on SLS datasets, aiming to leverage the knowledge learned from a larger set of natural images to lower the amount of skin training data needed. However, fully fine-tuning all parameters of large backbones is computationally expensive and memory intensive.在这篇论文中，我们提出了一种新的有效策略，以降低ViTs的数据喂食量。具体来说，我们在转换层中添加轻量级模块（适配器），以 modify the feature representation of a ViT without updating its pre-trained weights. 此外，我们使用一个浅深的CNN作为提示生成器，以从输入图像中生成提示embedding，这个embedding捕捉了细节信息和CNN的适应性来导航分割任务。我们的量化实验表明，AViT在4个皮肤损伤数据集上具有竞争力和时刻优于SOTA的性能，但具有显著更少的可训练参数。我们的代码可以在https://github.com/siyi-wind/AViT中找到。

Pretrained Deep 2.5D Models for Efficient Predictive Modeling from Retinal OCT

paper_url: http://arxiv.org/abs/2307.13865
repo_url: None
paper_authors: Taha Emre, Marzieh Oghbaie, Arunava Chakravarty, Antoine Rivail, Sophie Riedl, Julia Mai, Hendrik P. N. Scholl, Sobha Sivaprasad, Daniel Rueckert, Andrew Lotery, Ursula Schmidt-Erfurth, Hrvoje Bogunović
for: 这篇论文旨在探讨如何使用2.5D架构来优化医疗影像处理中的深度学习模型，以提高模型的性能和数据效率。
methods: 本论文使用了2.5D架构， combining 2D和3D技术，以及Convolutional Neural Networks (CNNs)、Long Short-Term Memory (LSTM)和Transformers等方法，并且将2D非对称预训练方法应用到2.5D架构中，以提高模型的性能和数据效率。
results: 本论文透过实验表明，2.5D架构可以优化医疗影像处理中的深度学习模型，并且可以预测在6个月内进展到泼血性macular degeneration (AMD)的风险，在两个大量 longitudinal OCT数据集上。

Abstract
In the field of medical imaging, 3D deep learning models play a crucial role in building powerful predictive models of disease progression. However, the size of these models presents significant challenges, both in terms of computational resources and data requirements. Moreover, achieving high-quality pretraining of 3D models proves to be even more challenging. To address these issues, hybrid 2.5D approaches provide an effective solution for utilizing 3D volumetric data efficiently using 2D models. Combining 2D and 3D techniques offers a promising avenue for optimizing performance while minimizing memory requirements. In this paper, we explore 2.5D architectures based on a combination of convolutional neural networks (CNNs), long short-term memory (LSTM), and Transformers. In addition, leveraging the benefits of recent non-contrastive pretraining approaches in 2D, we enhanced the performance and data efficiency of 2.5D techniques even further. We demonstrate the effectiveness of architectures and associated pretraining on a task of predicting progression to wet age-related macular degeneration (AMD) within a six-month period on two large longitudinal OCT datasets.

摘要
医疗影像领域中，3D深度学习模型在建立疾病进程预测的力度环境中扮演着关键的角色。然而，这些模型的大小带来了计算资源和数据需求的挑战。此外，获得高质量预训练3D模型也是非常困难的。为解决这些问题，混合2.5D方法提供了有效的解决方案，能够有效地利用3DVolume数据，同时减少内存需求。在这篇论文中，我们探索了基于Convolutional Neural Networks（CNN）、Long Short-Term Memory（LSTM）和Transformers的2.5D架构，并利用了2D非对抗预训练方法的优点，进一步提高了2.5D技术的性能和数据效率。我们在两个大 longitudinal OCT数据集上进行了预测湿性年龄相关macular degeneration（AMD）在6个月期内的进程预测任务，以证明我们的架构和预训练方法的有效性。

On the unreasonable vulnerability of transformers for image restoration – and an easy fix

paper_url: http://arxiv.org/abs/2307.13856
repo_url: None
paper_authors: Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper
for: 这个论文 investigate the adversarial robustness of Vision Transformers (ViTs) in image restoration tasks.
methods: 他们使用 Projected Gradient Descent (PGD) 和 CosPGD 等 adversarial attack 来评估 ViTs 的 Robustness.
results: 他们发现 ViTs 在实际图像Deblurring task 中高度易受到 adversarial attack 的影响，而且 adversarial training 可以提高 Restormer 的 Robustness, but other networks 的 result less promising.

Abstract
Following their success in visual recognition tasks, Vision Transformers(ViTs) are being increasingly employed for image restoration. As a few recent works claim that ViTs for image classification also have better robustness properties, we investigate whether the improved adversarial robustness of ViTs extends to image restoration. We consider the recently proposed Restormer model, as well as NAFNet and the "Baseline network" which are both simplified versions of a Restormer. We use Projected Gradient Descent (PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise prediction tasks for our robustness evaluation. Our experiments are performed on real-world images from the GoPro dataset for image deblurring. Our analysis indicates that contrary to as advocated by ViTs in image classification works, these models are highly susceptible to adversarial attacks. We attempt to improve their robustness through adversarial training. While this yields a significant increase in robustness for Restormer, results on other networks are less promising. Interestingly, the design choices in NAFNet and Baselines, which were based on iid performance, and not on robust generalization, seem to be at odds with the model robustness. Thus, we investigate this further and find a fix.

摘要
根据视觉Recognition任务的成功，视觉Transformers（ViTs）现在在图像恢复领域得到了越来越多的应用。一些最近的研究表明，ViTs也有更好的鲁棒性质量，因此我们想 investigate这点是否扩展到图像恢复领域。我们考虑了Recently proposed Restormer模型，以及NAFNet和基eline网络，它们都是Restormer的简化版本。我们使用Projected Gradient Descent（PGD）和CosPGD，一种专门为像素精度预测任务设计的攻击方法来评估这些模型的鲁棒性。我们的实验使用了GoPro dataset上的实际图像锐化任务。我们的分析表明，与图像分类任务中所advocated的ViTs不同，这些模型对攻击很容易受到影响。我们通过对这些模型进行鲁棒训练来提高其鲁棒性。虽然这对Restormer模型具有显著的效果，但对NAFNet和基eline网络来说，结果较为吃亏。我们进一步调查这个问题，并发现一些design choice在NAFNet和基eline网络中，它们是基于独立性能而不是鲁棒泛化的设计，与模型的鲁棒性相冲突。因此，我们进一步调查这个问题，并发现一些解决方案。

Exploring the Sharpened Cosine Similarity

paper_url: http://arxiv.org/abs/2307.13855
repo_url: None
paper_authors: Skyler Wu, Fred Lu, Edward Raff, James Holt
for: 检查SCS可以取代卷积层来提高图像识别器的性能。
methods: 研究SCS参数的行为和可以作为卷积层的替代方案，并 benchmarked on CIFAR-10 多个CNN架构。
results: SCS可能不会带来明显的增加精度，但可能将特征更加易于理解；在某些情况下，SCS可能会增加防火墙性。

Abstract
Convolutional layers have long served as the primary workhorse for image classification. Recently, an alternative to convolution was proposed using the Sharpened Cosine Similarity (SCS), which in theory may serve as a better feature detector. While multiple sources report promising results, there has not been to date a full-scale empirical analysis of neural network performance using these new layers. In our work, we explore SCS's parameter behavior and potential as a drop-in replacement for convolutions in multiple CNN architectures benchmarked on CIFAR-10. We find that while SCS may not yield significant increases in accuracy, it may learn more interpretable representations. We also find that, in some circumstances, SCS may confer a slight increase in adversarial robustness.

摘要
convolutional layers 长期作为图像分类的主要工具。最近，一种使用简化 cosine similarity（SCS）的替代方案被提出，这 theoretically 可能是一个更好的特征检测器。虽然多个来源报告了 promising results，但到目前为止没有一个全面的 empirical analysis of neural network performance 使用这些新层。在我们的工作中，我们探索 SCS 的参数行为和作为替换 convolutions 的多个 CNN 架构在 CIFAR-10 上的性能。我们发现，虽然 SCS 可能不会导致显著的准确率提高，但它可能学习更易于理解的表示。我们还发现，在某些情况下，SCS 可能会提供一个 slight 的 adversarial robustness 提高。

SplitFed resilience to packet loss: Where to split, that is the question

paper_url: http://arxiv.org/abs/2307.13851
repo_url: None
paper_authors: Chamani Shiranthika, Zahra Hafezi Kafshgari, Parvaneh Saeedi, Ivan V. Bajić
for: 本文研究了Split Federated Learning（SplitFed或SFL）的可靠性问题，具体来说是在通信链路上 packet loss 的影响下对 SFL 的性能的研究。
methods: 本文使用了多种 SFL 聚合策略，并在不同的拆分点（ shallow split 和 deep split）进行了测试，以判断拆分点是否对最终模型的准确率产生 statistically significant 的影响。
results: 实验结果表明，使用 deeper split point 可以获得更高的准确率。

Abstract
Decentralized machine learning has broadened its scope recently with the invention of Federated Learning (FL), Split Learning (SL), and their hybrids like Split Federated Learning (SplitFed or SFL). The goal of SFL is to reduce the computational power required by each client in FL and parallelize SL while maintaining privacy. This paper investigates the robustness of SFL against packet loss on communication links. The performance of various SFL aggregation strategies is examined by splitting the model at two points -- shallow split and deep split -- and testing whether the split point makes a statistically significant difference to the accuracy of the final model. Experiments are carried out on a segmentation model for human embryo images and indicate the statistically significant advantage of a deeper split point.

摘要
<>使用分布式机器学习的许多应用程序在最近几年内得到扩展，包括联邦学习（FL）、分解学习（SL）以及其它混合技术，如分解联邦学习（SplitFed或SFL）。SFL的目标是降低每个客户端在FL中所需的计算能力，并平行化SL，同时保持隐私。这篇论文研究了SFL在通信链路上 packet loss 的影响，并对不同的SFL聚合策略进行了测试，以确定它们在最终模型的准确性方面是否存在 statistically significant 的差异。实验使用了人类胚胎图像分割模型，并显示了深度分割点能够获得 statistically significant 的优势。Note: " statistically significant" in Chinese is "统计学上有意义" (tòng jí yǐng xìng)

CosSIF: Cosine similarity-based image filtering to overcome low inter-class variation in synthetic medical image datasets

paper_url: http://arxiv.org/abs/2307.13842
repo_url: https://github.com/mominul-ssv/cossif
paper_authors: Mominul Islam, Hasib Zunair, Nabeel Mohammed
for: 这个研究旨在提高医疗图像分析中的深度学习模型效能，特别是在没有明显间类差异的医疗图像资料集上。methods: 本研究提出了一种新的筛选算法 called Cosine Similarity-based Image Filtering (CosSIF)，并在其基础上开发了两种筛选方法：Filtering Before GAN Training (FBGT) 和 Filtering After GAN Training (FAGT)。results: 实验结果显示，使用modern transformer和 convolutional-based networks并与 CosSIF 筛选方法可以实现substantial performance gain in various evaluation metrics。尤其是在 ISIC-2016 资料集上，FAGT 方法可以比基eline方法提高 sensitivity 1.59% 和 AUC 1.88%。在 HAM10000 资料集上，将 FABT 方法应用于过滤 synthetic images 可以提高 recall 13.75%，并且仅使用 FAGT 方法可以 дости得最大的准确率 94.44%。

Abstract
Crafting effective deep learning models for medical image analysis is a complex task, particularly in cases where the medical image dataset lacks significant inter-class variation. This challenge is further aggravated when employing such datasets to generate synthetic images using generative adversarial networks (GANs), as the output of GANs heavily relies on the input data. In this research, we propose a novel filtering algorithm called Cosine Similarity-based Image Filtering (CosSIF). We leverage CosSIF to develop two distinct filtering methods: Filtering Before GAN Training (FBGT) and Filtering After GAN Training (FAGT). FBGT involves the removal of real images that exhibit similarities to images of other classes before utilizing them as the training dataset for a GAN. On the other hand, FAGT focuses on eliminating synthetic images with less discriminative features compared to real images used for training the GAN. Experimental results reveal that employing either the FAGT or FBGT method with modern transformer and convolutional-based networks leads to substantial performance gains in various evaluation metrics. FAGT implementation on the ISIC-2016 dataset surpasses the baseline method in terms of sensitivity by 1.59\% and AUC by 1.88\%. Furthermore, for the HAM10000 dataset, applying FABT outperforms the baseline approach in terms of recall by 13.75\%, and with the sole implementation of FAGT, achieves a maximum accuracy of 94.44\%.

摘要
制作深度学习模型用于医学影像分析是一个复杂的任务，特别在医学影像集lacks significant inter-class variation的情况下。这个挑战进一步加剧了在使用这些数据集来生成synthetic images using generative adversarial networks (GANs)时，GANs的输出 heavily relies on the input data。在这项研究中，我们提出了一种新的筛选算法called Cosine Similarity-based Image Filtering (CosSIF)。我们利用CosSIF开发了两种不同的筛选方法：Filtering Before GAN Training (FBGT)和Filtering After GAN Training (FAGT)。FBGT involves the removal of real images that exhibit similarities to images of other classes before using them as the training dataset for a GAN。 On the other hand, FAGT focuses on eliminating synthetic images with less discriminative features compared to real images used for training the GAN。实验结果表明，使用FABT或FBGT方法并与现代转换和卷积网络结合使用，可以实现明显的性能提升在多个评价指标中。FAGT实现在ISIC-2016数据集上超越基准方法，敏感性提升1.59%，AUC提升1.88%。此外，对HAM10000数据集应用FAGT，可以提高记忆率by 13.75%，并且只通过FAGT实现最高的准确率为94.44%。

A real-time material breakage detection for offshore wind turbines based on improved neural network algorithm

paper_url: http://arxiv.org/abs/2307.13765
repo_url: None
paper_authors: Yantong Liu
for: 提高陆上风电机稳定性，为可持续能源产生做出了重要贡献。
methods: 使用高级版YOLOv8物体检测模型，配备卷积块注意模块（CBAM），进一步提高特征识别能力。
results: 通过使用5432张风电园和公共数据集进行严谨测试，实现了精度稳定性的显著提高，为陆上风电机维护做出了重要贡献。

Abstract
The integrity of offshore wind turbines, pivotal for sustainable energy generation, is often compromised by surface material defects. Despite the availability of various detection techniques, limitations persist regarding cost-effectiveness, efficiency, and applicability. Addressing these shortcomings, this study introduces a novel approach leveraging an advanced version of the YOLOv8 object detection model, supplemented with a Convolutional Block Attention Module (CBAM) for improved feature recognition. The optimized loss function further refines the learning process. Employing a dataset of 5,432 images from the Saemangeum offshore wind farm and a publicly available dataset, our method underwent rigorous testing. The findings reveal a substantial enhancement in defect detection stability, marking a significant stride towards efficient turbine maintenance. This study's contributions illuminate the path for future research, potentially revolutionizing sustainable energy practices.

摘要
“陆上风电机的完整性，是可持续能源生产的重要因素，但常受到表面材料欠整问题的影响。尽管有许多检测技术可用，但还有许多限制，包括成本高、效率低和应用范围仅对某些类型的材料有效。本研究提出了一种新的方法，利用进步版的YOLOv8物体检测模型，并补助了一个卷积层注意模块（CBAM），以提高特征识别能力。另外，我们还对检测过程进行了优化损失函数。使用了5,432幅陆上风电机Saemangeum数据集和公开 disponibile数据集，我们对方法进行了严格的测试。发现的结果显示，我们的方法能够实现更高稳定性的欠整检测，创造了一个重要的进步，将来可能对可持续能源实践产生革命性的影响。本研究的贡献照明了未来研究的道路，并且将为可持续能源领域的发展带来新的希望。”Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

Implementing and Benchmarking the Locally Competitive Algorithm on the Loihi 2 Neuromorphic Processor

paper_url: http://arxiv.org/abs/2307.13762
repo_url: None
paper_authors: Gavin Parpart, Sumedh R. Risbud, Garrett T. Kenyon, Yijing Watkins
for: 这研究旨在证明 neuromorphic processor 可以实现高效、低功耗的数据处理，尤其是在小型 робот、卫星等具有 strict SWaP 要求的应用中。
methods: 这研究使用了 Locally Competitive Algorithm (LCA) 在 neuromorphic processor 上进行了实现，并对 Loihi 2 processor 进行了 optimize。
results: 研究发现，使用 Loihi 2 processor 实现 LCA 的效率和速度比 CPU 和 GPU 设备更高，特别是在大 sparse penalty 下。此外，调整 LCA 参数可以提高性能。这些结果表明 neuromorphic processor 可以在资源受限的设备上进行高效、高精度的数据处理。

Abstract
Neuromorphic processors have garnered considerable interest in recent years for their potential in energy-efficient and high-speed computing. The Locally Competitive Algorithm (LCA) has been utilized for power efficient sparse coding on neuromorphic processors, including the first Loihi processor. With the Loihi 2 processor enabling custom neuron models and graded spike communication, more complex implementations of LCA are possible. We present a new implementation of LCA designed for the Loihi 2 processor and perform an initial set of benchmarks comparing it to LCA on CPU and GPU devices. In these experiments LCA on Loihi 2 is orders of magnitude more efficient and faster for large sparsity penalties, while maintaining similar reconstruction quality. We find this performance improvement increases as the LCA parameters are tuned towards greater representation sparsity. Our study highlights the potential of neuromorphic processors, particularly Loihi 2, in enabling intelligent, autonomous, real-time processing on small robots, satellites where there are strict SWaP (small, lightweight, and low power) requirements. By demonstrating the superior performance of LCA on Loihi 2 compared to conventional computing device, our study suggests that Loihi 2 could be a valuable tool in advancing these types of applications. Overall, our study highlights the potential of neuromorphic processors for efficient and accurate data processing on resource-constrained devices.

摘要
We present a new LCA implementation designed for the Loihi 2 processor and perform initial benchmarks comparing it to LCA on CPU and GPU devices. Our results show that LCA on Loihi 2 is several orders of magnitude more efficient and faster for large sparsity penalties, while maintaining similar reconstruction quality. We find that the performance improvement increases as the LCA parameters are tuned towards greater representation sparsity.Our study highlights the potential of neuromorphic processors, particularly Loihi 2, for enabling intelligent, autonomous, and real-time processing on small robots and satellites with strict SWaP (small, lightweight, and low power) requirements. By demonstrating the superior performance of LCA on Loihi 2 compared to conventional computing devices, our study suggests that Loihi 2 could be a valuable tool in advancing these types of applications.Overall, our study highlights the potential of neuromorphic processors for efficient and accurate data processing on resource-constrained devices.

PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View

paper_url: http://arxiv.org/abs/2307.13756
repo_url: https://github.com/sjingjia/planerectr
paper_authors: Jingjia Shi, Shuaifeng Zhi, Kai Xu
for: 本研究旨在提出一种能够独立从单张图像中恢复3D平面的全新框架，即PlaneRecTR，该框架可以同时处理多个相关的子任务，包括平面检测、分割、参数估计和深度估计。
methods: PlaneRecTR 使用 Transformer 架构，并通过一种新的嵌入式嵌入法将所有相关的子任务集成到一个紧凑的模型中。
results: 经过广泛的量化和质量实验，我们的提议的统一学习方法在公共的 ScanNet 和 NYUv2-Plane 数据集上达到了新的状态场下的最佳性能。

Abstract
3D plane recovery from a single image can usually be divided into several subtasks of plane detection, segmentation, parameter estimation and possibly depth estimation. Previous works tend to solve this task by either extending the RCNN-based segmentation network or the dense pixel embedding-based clustering framework. However, none of them tried to integrate above related subtasks into a unified framework but treat them separately and sequentially, which we suspect is potentially a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR, a Transformer-based architecture, which for the first time unifies all subtasks related to single-view plane recovery with a single compact model. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across subtasks, obtaining a new state-of-the-art performance on public ScanNet and NYUv2-Plane datasets. Codes are available at https://github.com/SJingjia/PlaneRecTR.

摘要
三元平面恢复从单张图像通常可以分解为多个子任务，包括平面检测、分割、参数估算和可能的深度估算。现有的工作通常是通过扩展RCNN基于分割网络或密集像素嵌入基于聚类框架来解决这个任务。然而，现有的方法都没有尝试将上述相关的子任务集成到一个简单的框架中，而是将它们分别处理并处理，我们认为这可能是现有方法性能下限的主要原因。受这一发现和semantic Entities之间的聚合学习的成功启发，在这篇论文中，我们提出PlaneRecTR，一种基于Transformer架构的architecture，可以同时解决单视图平面恢复的所有相关子任务。我们的提议的统一学习方法在各个子任务之间带来互助效果，并在公共的ScanNet和NYUv2-Plane数据集上实现了新的state-of-the-art性能。代码可以在https://github.com/SJingjia/PlaneRecTR中获取。

ChildGAN: Large Scale Synthetic Child Facial Data Using Domain Adaptation in StyleGAN

paper_url: http://arxiv.org/abs/2307.13746
repo_url: None
paper_authors: Muhammad Ali Farooq, Wang Yao, Gabriel Costache, Peter Corcoran
for: 这个论文是为了生成Synthetic boys and girls facial data，使用StyleGAN2建立了一个新的ChildGAN网络。
methods: 这个论文使用了含 Transfer Learning的smooth domain transfer方法，并生成了大规模的数据集，包括多种智能的 facial 变换，如表情、年龄增长、眼睛打开效果、头部 pose、皮肤和头发颜色变化、不同的照明条件等。
results: 这个论文通过多种计算机视觉应用测试，如CNN基于的儿童性别分类器、面部定位和表情特征检测测试、人脸认知度评估使用ArcFace，以及眼睛检测和眼球比例测试， validate了生成的儿童脸部数据的真实性和特点。

Abstract
In this research work, we proposed a novel ChildGAN, a pair of GAN networks for generating synthetic boys and girls facial data derived from StyleGAN2. ChildGAN is built by performing smooth domain transfer using transfer learning. It provides photo-realistic, high-quality data samples. A large-scale dataset is rendered with a variety of smart facial transformations: facial expressions, age progression, eye blink effects, head pose, skin and hair color variations, and variable lighting conditions. The dataset comprises more than 300k distinct data samples. Further, the uniqueness and characteristics of the rendered facial features are validated by running different computer vision application tests which include CNN-based child gender classifier, face localization and facial landmarks detection test, identity similarity evaluation using ArcFace, and lastly running eye detection and eye aspect ratio tests. The results demonstrate that synthetic child facial data of high quality offers an alternative to the cost and complexity of collecting a large-scale dataset from real children.

摘要
在这项研究中，我们提出了一种新的ChildGAN，即基于StyleGAN2的两个GAN网络，用于生成Synthetic的男孩和女孩脸部数据。ChildGAN通过平滑领域传输学习来建立，可以提供高质量、photo-realistic的数据样本。我们使用了多种智能的脸部变换，包括表情、年龄增长、眼睛跳动效果、头部姿态、皮肤和头发颜色变化以及不同的照明条件。数据集包含超过300k个不同的数据样本。此外，我们验证了生成的脸部特征的独特性和特点，通过运行不同的计算机视觉应用测试，包括CNN基于儿童性别分类器、脸部定位和面部特征检测测试、ArcFace进行identity similarity评估以及最后运行眼睛检测和眼睛方向测试。结果表明，高质量的Synthetic儿童脸部数据提供了一种可行的代替实际收集大规模数据的方式，减少了成本和复杂度。

A Comprehensive Analysis on the Leakage of Fuzzy Matchers

paper_url: http://arxiv.org/abs/2307.13717
repo_url: None
paper_authors: Axel Durbet, Paul-Marie Grollemund, Kevin Thiry-Atighehchi
for: 本文对评估距离时的信息泄露进行了全面的分析，特别是对于基于权限误差的误差距离（i.e., 杂化误差）。
methods: 本文提出了一个枚举式的信息泄露场景catalog，以及这些场景对数据隐私安全性的影响。每个场景都导致了通用攻击，攻击的计算成本用于确定安全性水平的Upper bound。
results: 本文的分析结果显示，在使用弱隐私保护matcher时， informations leakage可能导致攻击者通过side channel attack或部分杂化设计获取敏感数据。

Abstract
This paper provides a comprehensive analysis of information leakage during distance evaluation, with an emphasis on threshold-based obfuscated distance (i.e., Fuzzy Matcher). Leakage can occur due to a malware infection or the use of a weakly privacy-preserving matcher, exemplified by side channel attacks or partially obfuscated designs. We provide an exhaustive catalog of information leakage scenarios as well as their impacts on the security concerning data privacy. Each of the scenarios leads to generic attacks whose impacts are expressed in terms of computational costs, hence allowing the establishment of upper bounds on the security level.

摘要
Translated into Simplified Chinese:这篇论文提供了评估距离时信息泄露的全面分析，强调阈值基于杂化距离（即杂化匹配器）的情况。泄露可能由恶意软件感染或弱privacy保护匹配器引起，例如侧通攻击或部分杂化设计。我们提供了丰富的信息泄露场景目录以及它们对数据隐私安全的影响。每个场景都导致了通用攻击，攻击的影响表现为计算成本，因此可以确定安全水平的Upper bound。

Personal Protective Equipment Detection in Extreme Construction Conditions

paper_url: http://arxiv.org/abs/2307.13654
repo_url: None
paper_authors: Yuexiong Ding, Xiaowei Luo
for: 这个研究旨在开发一个可靠的人员侦测模型，以应对在建筑工程中的极端环境。
methods: 这个研究使用了神经风格转移（NST）和YOLOv5技术，组合了这两种技术来建立一个抗衰变的侦测模型。
results: 实验结果显示，NST模组可以对极端环境进行更好的模拟，并帮助NST-YOLOv5模型在真实世界极端环境中获得0.141和0.083 mAP_(05:95)的改善。

Abstract
Object detection has been widely applied for construction safety management, especially personal protective equipment (PPE) detection. Though the existing PPE detection models trained on conventional datasets have achieved excellent results, their performance dramatically declines in extreme construction conditions. A robust detection model NST-YOLOv5 is developed by combining the neural style transfer (NST) and YOLOv5 technologies. Five extreme conditions are considered and simulated via the NST module to endow the detection model with excellent robustness, including low light, intense light, sand dust, fog, and rain. Experiments show that the NST has great potential as a tool for extreme data synthesis since it is better at simulating extreme conditions than other traditional image processing algorithms and helps the NST-YOLOv5 achieve 0.141 and 0.083 mAP_(05:95) improvements in synthesized and real-world extreme data. This study provides a new feasible way to obtain a more robust detection model for extreme construction conditions.

摘要
<>传统的人工智能技术已经广泛应用于建筑安全管理中，特别是个人防护设备（PPE）检测。现有的PPE检测模型通过传统数据集训练已经达到了出色的结果，但是其在极端建筑条件下的性能却显著下降。为了解决这个问题，本研究开发了一种Robust检测模型NST-YOLOv5，通过结合神经风格传输（NST）和YOLOv5技术。研究认为，在极端条件下，NST模块可以更好地模拟极端情况，包括低光照、强光照、尘埃、雾和雨。实验结果表明，NST具有优秀的可模拟极端条件的能力，可以帮助NST-YOLOv5在人工生成的极端数据中达到0.141和0.083 mAP_(05:95)的改进。本研究为建筑安全管理中的极端条件下的检测模型提供了一个新的可靠的方法。

Learning Transferable Object-Centric Diffeomorphic Transformations for Data Augmentation in Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.13645
repo_url: None
paper_authors: Nilesh Kumar, Prashnna K. Gyawali, Sandesh Ghimire, Linwei Wang
for:* 这种论文主要是为了解决医疗图像分割中获取标注数据的挑战，特别是需要专家 manually annotate每个像素点。methods:* 这种方法使用了对象的可变变换来减少这个挑战，但这些变换通常是全像的，因此无法在不同的数据集或问题中进行传输。results:* 我们提出了一种新的对象中心数据增强模型，可以学习对象的形态变化并在图像中增强对象，而不需要修改图像的其他部分。* 我们证明了该模型在儿科肿瘤分割中的效果，并且可以从同一个数据集中学习形态变化，以及从外部数据集中传输形态变化。

Abstract
Obtaining labelled data in medical image segmentation is challenging due to the need for pixel-level annotations by experts. Recent works have shown that augmenting the object of interest with deformable transformations can help mitigate this challenge. However, these transformations have been learned globally for the image, limiting their transferability across datasets or applicability in problems where image alignment is difficult. While object-centric augmentations provide a great opportunity to overcome these issues, existing works are only focused on position and random transformations without considering shape variations of the objects. To this end, we propose a novel object-centric data augmentation model that is able to learn the shape variations for the objects of interest and augment the object in place without modifying the rest of the image. We demonstrated its effectiveness in improving kidney tumour segmentation when leveraging shape variations learned both from within the same dataset and transferred from external datasets.

摘要
Simplified Chinese:医学图像分割中获取标注数据具有挑战性，因为需要专家进行像素级别的标注。最近的研究表明，将对象兴趣添加到扩展变换可以减轻这些挑战。然而，这些变换通常是基于整个图像而学习的，导致其在不同数据集或图像对齐问题中的传输性不佳。而对象中心的扩展变换提供了一个大好的机会来超越这些问题，但现有的工作仅关注位置和随机变换而忽略对象形状的变化。为此，我们提议一种新的对象中心数据增强模型，能够学习对象兴趣的形状变化并在图像中增强对象而不改变其他部分。我们在使用同一个数据集中的形状变化和外部数据集中的形状变化来优化肾癌分 segmentation时进行了证明。

Optical Flow boosts Unsupervised Localization and Segmentation

paper_url: http://arxiv.org/abs/2307.13640
repo_url: https://github.com/mlzxy/flowdino
paper_authors: Xinyu Zhang, Abdeslam Boularias
for: 本研究旨在提出一种基于运动指示的无监督本地化和分割方法，以解决自主 робоット视觉任务中的长期挑战。
methods: 我们提出了一种新的损失函数形式ulation，使用无监督视频中的摄像机流来鼓励自我监督视transformer（ViT）特征来 closer to each other。
results: 我们的finetuning过程超过了无监督 semantic segmentation的状态前方法，并且在无监督物体localization和semantic segmentation benchmark上也达到了更高的性能。

Abstract
Unsupervised localization and segmentation are long-standing robot vision challenges that describe the critical ability for an autonomous robot to learn to decompose images into individual objects without labeled data. These tasks are important because of the limited availability of dense image manual annotation and the promising vision of adapting to an evolving set of object categories in lifelong learning. Most recent methods focus on using visual appearance continuity as object cues by spatially clustering features obtained from self-supervised vision transformers (ViT). In this work, we leverage motion cues, inspired by the common fate principle that pixels that share similar movements tend to belong to the same object. We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other if their corresponding spatial locations share similar movements, and vice versa. We use the proposed loss function to finetune vision transformers that were originally trained on static images. Our fine-tuning procedure outperforms state-of-the-art techniques for unsupervised semantic segmentation through linear probing, without the use of any labeled data. This procedure also demonstrates increased performance over original ViT networks across unsupervised object localization and semantic segmentation benchmarks.

摘要
自主化机器人视觉挑战包括无监督的本地化和分割，这两个任务都是关键的，因为有限的杂化图像手动标注数据。这些任务是重要的，因为它们可以适应不断变化的物品类别，并且拥有长期学习的承袭。现有的方法主要基于视觉外观继续性作为物体征料，通过自动将自然语言中的词语分配到相应的位置。在这项工作中，我们启用运动征料，基于共同命运原则，即像素之间的运动相似性可以用于推断这些像素属于同一个物体。我们提出了一种新的损失函数表述，使用无标注视频中的光流来鼓励自我超vised Vision Transformer（ViT）特征自适应更近。我们使用该损失函数来练化原始基于静止图像的ViT网络，并通过线性探测超越当前最佳无监督Semantic Segmentation技术。此外，我们还证明了我们的练化方法可以在无监督物体本地化和Semantic Segmentation benchmark中表现出更高的性能。

Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Shape Estimation

paper_url: http://arxiv.org/abs/2307.13639
repo_url: None
paper_authors: Will Rowan, Patrik Huber, Nick Pears, Andrew Keeling
for: bridging the gap between 2D and 3D face shape estimation
methods: conditioned stable diffusion model for face image generation, leveraging abundant 2D facial information to inform 3D space
results: large-scale synthesized dataset of 250K photorealistic images and corresponding 3DMM parameters, and a deep neural network (ControlFace) that achieves competitive performance on the NoW benchmark without requiring 3D supervision or manual 3D asset creation.Here’s the full Chinese text:
for: bridging the gap between 2D 和 3D 人脸形状估计
methods: conditioned stable diffusion model for face image generation, 利用丰富的 2D 脸部信息来指导 3D 空间
results: large-scale synthesized dataset of 250K photorealistic images and corresponding 3DMM parameters, 以及一个深度神经网络 (ControlFace) ，在 NoW benchmark 上 achieve 竞争性性能，无需 3D 监督或手动 3D 资产创建。

Abstract
Accurate 3D face shape estimation is an enabling technology with applications in healthcare, security, and creative industries, yet current state-of-the-art methods either rely on self-supervised training with 2D image data or supervised training with very limited 3D data. To bridge this gap, we present a novel approach which uses a conditioned stable diffusion model for face image generation, leveraging the abundance of 2D facial information to inform 3D space. By conditioning stable diffusion on depth maps sampled from a 3D Morphable Model (3DMM) of the human face, we generate diverse and shape-consistent images, forming the basis of SynthFace. We introduce this large-scale synthesised dataset of 250K photorealistic images and corresponding 3DMM parameters. We further propose ControlFace, a deep neural network, trained on SynthFace, which achieves competitive performance on the NoW benchmark, without requiring 3D supervision or manual 3D asset creation.

摘要
当前最先进的3D面部形态估算技术都是基于自动超参的2D图像数据或有限的3D数据进行超参数学习的。为了bridging这个差距，我们提出了一种新的方法，该方法使用conditioned stable diffusion模型来生成面图像，通过利用丰富的2D facial信息来导航3D空间。我们通过将稳定扩散模型conditioned on 3DMM中的深度图像，生成了多样化和形态一致的图像，这些图像组成了SynthFace大规模合成数据集。我们还提出了ControlFace，一种基于SynthFace的深度神经网络，在NoW标准测试集上达到了竞争性表现，不需要3D指导或手动创建3D资产。

RecursiveDet: End-to-End Region-based Recursive Object Detection

paper_url: http://arxiv.org/abs/2307.13619
repo_url: https://github.com/bravezzzzzz/recursivedet
paper_authors: Jing Zhao, Li Sun, Qingli Li
for: 提高 End-to-end 区域基于对象检测器的性能和参数数量，使其更加高效。
methods: 提出一种基于重复解码的方法，通过共享参数并使用可 recursive 的解码器，提高检测器的性能。并在解码器中使用位置编码（PE），使其根据输入 bounding box 的具体位置和大小进行适应。
results: 通过对多个主流区域基于对象检测器进行减少，并在不同的 stage 进行 recursive 解码，实现了明显的性能提升，并且需要 fewer 参数和轻微增加计算成本。

Abstract
End-to-end region-based object detectors like Sparse R-CNN usually have multiple cascade bounding box decoding stages, which refine the current predictions according to their previous results. Model parameters within each stage are independent, evolving a huge cost. In this paper, we find the general setting of decoding stages is actually redundant. By simply sharing parameters and making a recursive decoder, the detector already obtains a significant improvement. The recursive decoder can be further enhanced by positional encoding (PE) of the proposal box, which makes it aware of the exact locations and sizes of input bounding boxes, thus becoming adaptive to proposals from different stages during the recursion. Moreover, we also design centerness-based PE to distinguish the RoI feature element and dynamic convolution kernels at different positions within the bounding box. To validate the effectiveness of the proposed method, we conduct intensive ablations and build the full model on three recent mainstream region-based detectors. The RecusiveDet is able to achieve obvious performance boosts with even fewer model parameters and slightly increased computation cost. Codes are available at https://github.com/bravezzzzzz/RecursiveDet.

摘要
通常的端到端区域基于对象检测器如Sparse R-CNN都有多个阶段性 bounding box 解码机制，这些阶段性机制会根据之前的结果进行重复的精度调整。在这篇论文中，我们发现了这些阶段性机制的通用设置实际上是重复的。通过将参数共享并实现一个循环解码器，检测器就可以获得显著的提升。此外，我们还设计了基于中心点编码（PE）的方法，使循环解码器变得能够根据不同的阶段提供不同的提档。此外，我们还设计了基于中心点编码的方法，使循环解码器能够根据不同的位置和大小在矩形框内分辨不同的 RoI 特征元素和动态核心。为验证提档的效果，我们进行了广泛的ablation和在三个最新的主流区域基于检测器上建立了全模型。RecusiveDet 能够在更少的模型参数和微量的计算成本下实现明显的性能提升。代码可以在上下载。

Object-based Probabilistic Similarity Evidence of Sparse Latent Features from Fully Convolutional Networks

paper_url: http://arxiv.org/abs/2307.13606
repo_url: https://github.com/cjuliani/probabilistic-similarity-evidence-FCN
paper_authors: Cyril Juliani
for: 本研究旨在探讨使用神经网络学习的相似性分析方法，以便更好地理解和分类复杂pattern。
methods: 该研究使用了全连接卷积网络（FCN）生成的含义表示，并通过软件推理来确定对象在2D图像中的视觉相似性。
results: 研究发现，通过增加卷积网络中的特征变量权重，可以更好地识别对象的视觉特征，并提高相似性分析的准确性。

Abstract
Similarity analysis using neural networks has emerged as a powerful technique for understanding and categorizing complex patterns in various domains. By leveraging the latent representations learned by neural networks, data objects such as images can be compared effectively. This research explores the utilization of latent information generated by fully convolutional networks (FCNs) in similarity analysis, notably to estimate the visual resemblance of objects segmented in 2D pictures. To do this, the analytical scheme comprises two steps: (1) extracting and transforming feature patterns per 2D object from a trained FCN, and (2) identifying the most similar patterns through fuzzy inference. The step (2) can be further enhanced by incorporating a weighting scheme that considers the significance of latent variables in the analysis. The results provide valuable insights into the benefits and challenges of employing neural network-based similarity analysis for discerning data patterns effectively.

摘要
neural network Similarity analysis 已经成为了复杂模式理解和分类的有力的技术。通过利用 neural network 学习的含义表示，数据对象如图像可以比较有效地比较。这个研究探讨了使用 fully convolutional network (FCN) 生成的含义信息在 similarity analysis 中的应用，特别是用于估计在 2D 图像中分割的物体的视觉相似性。这个方法包括两步：（1）从 trained FCN 中提取和转换每个 2D 对象的特征模式，并（2）通过权重补做来确定最相似的模式。在第二步中，可以进一步增强使用 weighting scheme，考虑 latent variables 在分析中的重要性。研究结果为我们提供了有价值的理解，以及使用 neural network 基于的 similarity analysis 在分析复杂数据模式时的挑战和机会。

Decisive Data using Multi-Modality Optical Sensors for Advanced Vehicular Systems

paper_url: http://arxiv.org/abs/2307.13600
repo_url: None
paper_authors: Muhammad Ali Farooq, Waseem Shariff, Mehdi Sefidgar Dilmaghani, Wang Yao, Moazam Soomro, Peter Corcoran
for: 这篇论文主要针对各种光学技术的设计和开发，以实现现代车辆前视系统和驾驶员监测系统。
methods: 论文使用了各种光学探测技术，包括长波热成像（LWIR）摄像头、近红外（NIR）摄像头、神经科学摄像头、可见CMOS摄像头和深度摄像头。
results: 论文描述了这些光学技术在真实环境中的应用，以及它们在不同应用中的独特优势。

Abstract
Optical sensors have played a pivotal role in acquiring real world data for critical applications. This data, when integrated with advanced machine learning algorithms provides meaningful information thus enhancing human vision. This paper focuses on various optical technologies for design and development of state-of-the-art out-cabin forward vision systems and in-cabin driver monitoring systems. The focused optical sensors include Longwave Thermal Imaging (LWIR) cameras, Near Infrared (NIR), Neuromorphic/ event cameras, Visible CMOS cameras and Depth cameras. Further the paper discusses different potential applications which can be employed using the unique strengths of each these optical modalities in real time environment.

摘要
光学感知技术在实际应用中发挥了关键作用，提供了有用的信息，从而增强人类视觉。这篇论文探讨了各种光学技术在开发前瞻系统和司机监测系统中的设计和开发。主要涉及的光学感知器包括长波紫外线摄像机（LWIR）、近红外（NIR）、神经科学摄像机、可见CMOS摄像机和深度摄像机。论文还讨论了每种光学特性在实时环境中的不同应用可能性。

2023-07-26

Artifact Restoration in Histology Images with Diffusion Probabilistic Models

Sparse Double Descent in Vision Transformers: real or phantom threat?

Fluorescent Neuronal Cells v2: Multi-Task, Multi-Format Annotations for Deep Learning in Microscopy

Defending Adversarial Patches via Joint Region Localizing and Inpainting

DisguisOR: Holistic Face Anonymization for the Operating Room

Computational Approaches for Traditional Chinese Painting: From the “Six Principles of Painting” Perspective

ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation

Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks

High-definition event frame generation using SoC FPGA devices

A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot

Creative Birds: Self-Supervised Single-View 3D Style Transfer

Multi-modal Learning with Missing Modality via Shared-Specific Feature Modelling

Memory-Efficient Graph Convolutional Networks for Object Classification and Detection with Event Cameras

Periocular biometrics: databases, algorithms and directions

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

PNT-Edge: Towards Robust Edge Detection with Noisy Labels by Learning Pixel-level Noise Transitions

Pre-Training with Diffusion models for Dental Radiography segmentation

ECO: Ensembling Context Optimization for Vision-Language Models

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models

Towards Establishing Systematic Classification Requirements for Automated Driving

Unite-Divide-Unite: Joint Boosting Trunk and Structure for High-accuracy Dichotomous Image Segmentation

3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability

Controllable Guide-Space for Generalizable Face Forgery Detection

Human-centric Scene Understanding for 3D Large-scale Scenarios

Consensus-Adaptive RANSAC

Topologically-Regularized Multiple Instance Learning for Red Blood Cell Disease Classification

Retinotopy Inspired Brain Encoding Model and the All-for-One Training Recipe

RPG-Palm: Realistic Pseudo-data Generation for Palmprint Recognition

Car-Studio: Learning Car Radiance Fields from Single-View and Endless In-the-wild Images

Adaptive Frequency Filters As Efficient Global Token Mixers

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

Causal reasoning in typical computer vision tasks

METAVerse: Meta-Learning Traversability Cost Map for Off-Road Navigation

Hybrid Representation-Enhanced Sampling for Bayesian Active Learning in Musculoskeletal Segmentation of Lower Extremities

Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models

Analysis of Video Quality Datasets via Design of Minimalistic Video Quality Models

Tracking Anything in High Quality

Visual Prompt Flexible-Modal Face Anti-Spoofing

Heterogeneous Embodied Multi-Agent Collaboration

The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features

Rethinking Voice-Face Correlation: A Geometry View

Centroid-aware feature recalibration for cancer grading in pathology images

Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception

Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception

DFR-Net: Density Feature Refinement Network for Image Dehazing Utilizing Haze Density Difference

EasyNet: An Easy Network for 3D Industrial Anomaly Detection

trajdata: A Unified Interface to Multiple Human Trajectory Datasets

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems

AViT: Adapting Vision Transformers for Small Skin Lesion Segmentation Datasets

Pretrained Deep 2.5D Models for Efficient Predictive Modeling from Retinal OCT

On the unreasonable vulnerability of transformers for image restoration – and an easy fix

Exploring the Sharpened Cosine Similarity

SplitFed resilience to packet loss: Where to split, that is the question

CosSIF: Cosine similarity-based image filtering to overcome low inter-class variation in synthetic medical image datasets

A real-time material breakage detection for offshore wind turbines based on improved neural network algorithm

Implementing and Benchmarking the Locally Competitive Algorithm on the Loihi 2 Neuromorphic Processor

PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View

ChildGAN: Large Scale Synthetic Child Facial Data Using Domain Adaptation in StyleGAN

A Comprehensive Analysis on the Leakage of Fuzzy Matchers

Personal Protective Equipment Detection in Extreme Construction Conditions

Learning Transferable Object-Centric Diffeomorphic Transformations for Data Augmentation in Medical Image Segmentation

Optical Flow boosts Unsupervised Localization and Segmentation

Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Shape Estimation

RecursiveDet: End-to-End Region-based Recursive Object Detection

Object-based Probabilistic Similarity Evidence of Sparse Latent Features from Fully Convolutional Networks

Decisive Data using Multi-Modality Optical Sensors for Advanced Vehicular Systems