2023-07-01

cs.CV

cs.CV - 2023-07-01

Learning Content-enhanced Mask Transformer for Domain Generalized Urban-Scene Segmentation

paper_url: http://arxiv.org/abs/2307.00371
repo_url: None
paper_authors: Qi Bi, Shaodi You, Theo Gevers
for: 这个研究目的是为了发展一个可以应对不同城市景观风格的 semantic segmentation 方法（Domain-generalized urban-scene semantic segmentation，USSS）。
methods: 这篇研究提出了一个基于 Transformer 的 Content-enhanced Mask TransFormer（CMFormer）方法，它强调了面罩注意力机制的增强，以提高模型的内容识别能力。
results: 实验结果显示，CMFormer 在不同城市景观风格下的 semantic segmentation task 中表现出色，与已有的 CNN 方法相比，CMFormer 可以达到14.00% 的 mIoU 提升（mean intersection over union）。

Abstract
Domain-generalized urban-scene semantic segmentation (USSS) aims to learn generalized semantic predictions across diverse urban-scene styles. Unlike domain gap challenges, USSS is unique in that the semantic categories are often similar in different urban scenes, while the styles can vary significantly due to changes in urban landscapes, weather conditions, lighting, and other factors. Existing approaches typically rely on convolutional neural networks (CNNs) to learn the content of urban scenes. In this paper, we propose a Content-enhanced Mask TransFormer (CMFormer) for domain-generalized USSS. The main idea is to enhance the focus of the fundamental component, the mask attention mechanism, in Transformer segmentation models on content information. To achieve this, we introduce a novel content-enhanced mask attention mechanism. It learns mask queries from both the image feature and its down-sampled counterpart, as lower-resolution image features usually contain more robust content information and are less sensitive to style variations. These features are fused into a Transformer decoder and integrated into a multi-resolution content-enhanced mask attention learning scheme. Extensive experiments conducted on various domain-generalized urban-scene segmentation datasets demonstrate that the proposed CMFormer significantly outperforms existing CNN-based methods for domain-generalized semantic segmentation, achieving improvements of up to 14.00\% in terms of mIoU (mean intersection over union). The source code for CMFormer will be made available at this \href{https://github.com/BiQiWHU/domain-generalized-urban-scene-segmentation}{repository}.

摘要
领域总体化的城市场景semantic segmentation（USSS）目标是学习多样化城市场景风格下的通用semantic预测。与领域差异挑战不同，USSS的semantic类别通常在不同的城市场景中相似，而style可以因城市风貌、天气、照明和其他因素而发生显著变化。现有方法通常采用卷积神经网络（CNN）来学习城市场景的内容。在这篇论文中，我们提出了一种基于Transformer segmentation模型的Content-enhanced Mask TransFormer（CMFormer）。主要思想是在Transformer segmentation模型中增强基本组件的面积注意力，以便更好地利用内容信息。为此，我们提出了一种新的内容增强面积注意力机制。它从图像特征和其下采样后的图像特征中学习面 queries，以便更好地利用图像的内容信息和风格特征。这些特征被混合到Transformer解码器中，并在多resolution content-enhanced面积注意力学习方案中集成。我们对多个领域总体化城市场景semantic segmentation数据集进行了广泛的实验，结果表明，提出的CMFormer显著超过了现有的CNN基于方法，在领域总体化semantic segmentation中实现了14.00%的提升， measured by mean intersection over union（mIoU）。我们将CMFormer的源代码公开在这个\href{https://github.com/BiQiWHU/domain-generalized-urban-scene-segmentation}{存储库}中。

Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection

paper_url: http://arxiv.org/abs/2307.00347
repo_url: None
paper_authors: Yifan Zhang, Zhiyu Zhu, Junhui Hou
for: 这篇论文旨在探讨多帧3D物体检测系统中的DETR模型，并提出一个基于DETR的端到端框架STEMD，用于解决多帧3D物体检测中的问题。
methods: STEMD使用DETR-like的方法，将多帧3D物体检测视为一个序列到序列的任务，并具有优化的空间-时间对话网络，以实现更好地捕捉物体之间的空间-时间依存性。
results: 经过实验证明，STEMD可以在复杂的测试场景下实现更好的多帧3D物体检测效果，并且仅增加了少量的计算负载。

Abstract
The Detection Transformer (DETR) has revolutionized the design of CNN-based object detection systems, showcasing impressive performance. However, its potential in the domain of multi-frame 3D object detection remains largely unexplored. In this paper, we present STEMD, a novel end-to-end framework for multi-frame 3D object detection based on the DETR-like paradigm. Our approach treats multi-frame 3D object detection as a sequence-to-sequence task and effectively captures spatial-temporal dependencies at both the feature and query levels. To model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. This network represents queries as nodes in a graph and enables effective modeling of object interactions within a social context. In addition, to solve the problem of missing hard cases in the proposed output of the encoder in the current frame, we incorporate the output of the previous frame to initialize the query input of the decoder. Moreover, we tackle the issue of redundant detection results, where the model generates numerous overlapping boxes from similar queries. To mitigate this, we introduce an IoU regularization term in the loss function. This term aids in distinguishing between queries matched with the ground-truth box and queries that are similar but unmatched during the refinement process, leading to reduced redundancy and more accurate detections. Through extensive experiments, we demonstrate the effectiveness of our approach in handling challenging scenarios, while incurring only a minor additional computational overhead. The code will be available at \url{https://github.com/Eaphan/STEMD}.

摘要
<>Translate the given text into Simplified Chinese.<>德州检测变换（DETR）已经革命化了基于Convolutional Neural Networks（CNN）的物体检测系统的设计，展示了出色的性能。然而，它在多帧三维物体检测领域的潜力仍然未能得到充分发挥。在这篇论文中，我们提出了STEMD，一种基于DETR的novel结束到终点框架 для多帧三维物体检测。我们的方法将多帧三维物体检测视为一个序列到序列任务，并有效地捕捉了空间-时间依赖关系在特征和查询层次。为了模型对象之间的空间依赖关系和复杂的时间依赖关系，我们引入了空间-时间图注意力网络。这个网络将查询视为图形节点，并允许有效地模型对象之间的社交往来。此外，为解决提议编码器输出的问题，我们在当前帧的输出作为查询输入初始化decoder。此外，我们解决了模型生成重复的检测结果问题，其中模型生成了多个重叠的检测框。为此，我们引入了交合率 regularization项到损失函数中，以助于在反调过程中分辨与实际匹配的查询和相似 yet 未匹配的查询。经过广泛的实验，我们证明了我们的方法在面临挑战的场景下表现出色，同时只增加了少量的计算负担。代码将在 \url{https://github.com/Eaphan/STEMD} 上提供。

SDRCNN: A single-scale dense residual connected convolutional neural network for pansharpening

paper_url: http://arxiv.org/abs/2307.00327
repo_url: None
paper_authors: Yuan Fang, Yuanzhi Cai, Lei Fan
for: 本研究开发了一个单枝、单测度的轻量级卷积神经网络（SDRCNN），用于折衣高分辨率多spectral影像和低分辨率摄像头影像的混合。
methods: SDRCNN使用了一个新的紧密连接的构造和卷积层，以取得更好的精确性和效率的协调。
results: 根据四个来自世界视三、世界视二和快鹰镜头的测试数据，SDRCNN在 Visual inspection 和相关统计量评估中表现最佳，与传统方法和轻量级深度学习方法相比。

Abstract
Pansharpening is a process of fusing a high spatial resolution panchromatic image and a low spatial resolution multispectral image to create a high-resolution multispectral image. A novel single-branch, single-scale lightweight convolutional neural network, named SDRCNN, is developed in this study. By using a novel dense residual connected structure and convolution block, SDRCNN achieved a better trade-off between accuracy and efficiency. The performance of SDRCNN was tested using four datasets from the WorldView-3, WorldView-2 and QuickBird satellites. The compared methods include eight traditional methods (i.e., GS, GSA, PRACS, BDSD, SFIM, GLP-CBD, CDIF and LRTCFPan) and five lightweight deep learning methods (i.e., PNN, PanNet, BayesianNet, DMDNet and FusionNet). Based on a visual inspection of the pansharpened images created and the associated absolute residual maps, SDRCNN exhibited least spatial detail blurring and spectral distortion, amongst all the methods considered. The values of the quantitative evaluation metrics were closest to their ideal values when SDRCNN was used. The processing time of SDRCNN was also the shortest among all methods tested. Finally, the effectiveness of each component in the SDRCNN was demonstrated in ablation experiments. All of these confirmed the superiority of SDRCNN.

摘要
文本翻译：杜邦普兰推算（Pansharpening）是将高空间分辨率粉尘图和低空间分辨率多spectral图像联合成高分辨率多spectral图像的过程。本研究中提出了一种单支持单尺度轻量级卷积神经网络，即SDRCNN。通过使用单支持密集连接结构和卷积块，SDRCNN实现了更好的精度和效率之间的平衡。SDRCNN的性能在四个世界视图-3、世界视图-2和快鸟卫星的四个数据集上进行测试，与传统方法（GS、GSA、PRACS、BDSD、SFIM、GLP-CBD、CDIF和LRTCFPan）和轻量级深度学习方法（PNN、PanNet、概率网络、DMDNet和FusionNet）进行比较。视觉检查照片和相关绝对差异图中的详细信息，SDRCNN表现最好，其他方法中的详细信息均受到了锐化和spectral扭曲的影响。量化评价指标的值最接近理想值时，SDRCNN被使用。SDRCNN的处理时间也是所有方法中最短。最后，SDRCNN的每个组件的效果在减少实验中得到了证明。这些证明了SDRCNN的优越性。

DeepMediX: A Deep Learning-Driven Resource-Efficient Medical Diagnosis Across the Spectrum

paper_url: http://arxiv.org/abs/2307.00324
repo_url: None
paper_authors: Kishore Babu Nampalle, Pradeep Singh, Uppala Vivek Narayan, Balasubramanian Raman
for: 这个研究旨在提出一个高精度 yet computationally efficient 的医疗影像诊断模型，以应对医疗影像诊断中的挑战。
methods: 这个模型基于 MobileNetV2 架构，并包括 Federated Learning 的概念，实现了跨院所的学习合作，不需要直接存取敏感患者数据，同时保持数据隐私和完整性。
results: 这个研究透过严谨的测试，证明 DeepMediX 具有出色的诊断能力，与现有模型在大多数任务上匹配或超越其表现，并且适合在手持式设备上部署，实现实时诊断支持。

Abstract
In the rapidly evolving landscape of medical imaging diagnostics, achieving high accuracy while preserving computational efficiency remains a formidable challenge. This work presents \texttt{DeepMediX}, a groundbreaking, resource-efficient model that significantly addresses this challenge. Built on top of the MobileNetV2 architecture, DeepMediX excels in classifying brain MRI scans and skin cancer images, with superior performance demonstrated on both binary and multiclass skin cancer datasets. It provides a solution to labor-intensive manual processes, the need for large datasets, and complexities related to image properties. DeepMediX's design also includes the concept of Federated Learning, enabling a collaborative learning approach without compromising data privacy. This approach allows diverse healthcare institutions to benefit from shared learning experiences without the necessity of direct data access, enhancing the model's predictive power while preserving the privacy and integrity of sensitive patient data. Its low computational footprint makes DeepMediX suitable for deployment on handheld devices, offering potential for real-time diagnostic support. Through rigorous testing on standard datasets, including the ISIC2018 for dermatological research, DeepMediX demonstrates exceptional diagnostic capabilities, matching the performance of existing models on almost all tasks and even outperforming them in some cases. The findings of this study underline significant implications for the development and deployment of AI-based tools in medical imaging and their integration into point-of-care settings. The source code and models generated would be released at https://github.com/kishorebabun/DeepMediX.

摘要
在医疗影像诊断领域中，快速发展的景象中，实现高精度的同时保持计算效率是一项具有挑战性的任务。本研究提出了《DeepMediX》，一种创新的、资源高效的模型，可以有效地解决这个问题。基于MobileNetV2架构，DeepMediX在脑MRI扫描和皮肤癌图像分类方面表现出色，在双类和多类皮肤癌数据集上都达到了优秀的性能。它解决了劳动 INTENSIVE 的手动过程、大量数据的需求以及图像属性的复杂性等问题。DeepMediX的设计还包括联邦学习概念，允许不同的医疗机构共同学习而不需要直接访问敏感 patient 数据，从而提高模型的预测力而保护患者数据的隐私和完整性。它的低计算脚本使得 DeepMediX 适用于手持设备上部署，为实时诊断支持提供了潜在的可能性。经过对标准数据集的严格测试，包括 ISIC2018 皮肤科研数据集，DeepMediX 在大多数任务上表现出了极佳的诊断能力，与现有模型在大多数任务上几乎相当，甚至在一些情况下超越它们。这些发现对医疗影像中的 AI 基于工具的开发和部署以及其集成到点您护Setting 中具有重要的含义。研究所生成的代码和模型将在上发布。

Automatic Solver Generator for Systems of Laurent Polynomial Equations

paper_url: http://arxiv.org/abs/2307.00320
repo_url: None
paper_authors: Evgeniy Martyushev, Snehal Bhayani, Tomas Pajdla
for: 解决给定的 Laurent 多项式系统中的一家问题，即在不同的系统中寻找可以快速计算解的方法。
methods: 提出了一种新的实用算法，用于检查给定的 Laurent 多项式是否可以构建排除模板。基于这个算法，我们提出了一个自动生成器，可以快速地生成解系统的 Laurent 多项式方程的解。
results: 我们的生成器可以快速地生成解系统的 Laurent 多项式方程的解，并且可以自动探测部分 $p$-fold 对称性。我们对各种简单的问题进行了测试，并证明了我们的生成器比现有的方法快速。

Abstract
In computer vision applications, the following problem often arises: Given a family of (Laurent) polynomial systems with the same monomial structure but varying coefficients, find a solver that computes solutions for any family member as fast as possible. Under appropriate genericity assumptions, the dimension and degree of the respective polynomial ideal remain unchanged for each particular system in the same family. The state-of-the-art approach to solving such problems is based on elimination templates, which are the coefficient (Macaulay) matrices that encode the transformation from the initial polynomials to the polynomials needed to construct the action matrix. Knowing an action matrix, the solutions of the system are computed from its eigenvectors. The important property of an elimination template is that it applies to all polynomial systems in the family. In this paper, we propose a new practical algorithm that checks whether a given set of Laurent polynomials is sufficient to construct an elimination template. Based on this algorithm, we propose an automatic solver generator for systems of Laurent polynomial equations. The new generator is simple and fast; it applies to ideals with positive-dimensional components; it allows one to uncover partial $p$-fold symmetries automatically. We test our generator on various minimal problems, mostly in geometric computer vision. The speed of the generated solvers exceeds the state-of-the-art in most cases. In particular, we propose the solvers for the following problems: optimal 3-view triangulation, semi-generalized hybrid pose estimation and minimal time-of-arrival self-calibration. The experiments on synthetic scenes show that our solvers are numerically accurate and either comparable to or significantly faster than the state-of-the-art solvers.

摘要
在计算机视觉应用中，常遇到以下问题：给定一个 Laurent 多项式系统家族，找到一个可以尽快计算系统解的求解器。在适当的泛化假设下，每个特定系统的多项式理想的维度和度数保持不变。现状的解决方法是基于减法模板，它们是变量多项式系统中的约化矩阵，它们编码了将初始多项式转换成构造动作矩阵的过程中的多项式。知道动作矩阵，系统的解可以从其各自的特征向量中计算出来。减法模板的重要特点是它适用于所有多项式系统家族中的系统。在这篇论文中，我们提出了一个新的实用算法，该算法可以判断给定的 Laurent 多项式集是否具有构建减法模板的能力。基于这个算法，我们提出了一个自动生成器 для Laurent 多项式方程系统的解。新的生成器简单快速，适用于具有正的维度组分的理想；它允许自动找到部分 $p$-次对称性。我们在各种最小问题上进行了测试，包括优化三视图三角形、半总化混合位姿估计和最小时间到达自我校准。实验结果表明，我们的生成器可以在大多数情况下提供更快的解决方案，并且数值精度和状态艺术家的解决方案相当或更高。

Detection of River Sandbank for Sand Mining with the Presence of Other High Mineral Content Regions Using Multi-spectral Images

paper_url: http://arxiv.org/abs/2307.00314
repo_url: None
paper_authors: Jit Mukherjee
for: 检测河川砂岸区域，直接影响经济、社会和环境。
methods: 使用多Modal分析，包括多spectral成像、Synthetic Aperture Radar（SAR）成像、航空图像和点云数据，但尚未充分利用河川砂岸区域的特征。
results: 提出了一种基于多spectral成像的新方法，可以在不使用标注数据的情况下，准确地检测河川砂岸区域。该方法基于河川和矿物质的强烈相关性，可以在不同季节下提供90.75%的准确率、85.47%的精度和73.5%的回归率。

Abstract
Sand mining is a booming industry. The river sandbank is one of the primary sources of sand mining. Detection of potential river sandbank regions for sand mining directly impacts the economy, society, and environment. In the past, semi-supervised and supervised techniques have been used to detect mining regions including sand mining. A few techniques employ multi-modal analysis combining different modalities such as multi-spectral imaging, synthetic aperture radar (\emph{SAR}) imaging, aerial images, and point cloud data. However, the distinguishing spectral characteristics of river sandbank regions are yet to be fully explored. This paper provides a novel method to detect river sandbank regions for sand mining using multi-spectral images without any labeled data over the seasons. Association with a river stream and the abundance of minerals are the most prominent features of such a region. The proposed work uses these distinguishing features to determine the spectral signature of a river sandbank region, which is robust to other high mineral abundance regions. It follows a two-step approach, where first, potential high mineral regions are detected and next, they are segregated using the presence of a river stream. The proposed technique provides average accuracy, precision, and recall of 90.75%, 85.47%, and 73.5%, respectively over the seasons from Landsat 8 images without using any labeled dataset.

摘要
river sandbank 是一个潜在的重要来源 для冲积泥矿产。探测 potential river sandbank 区域的泥矿可以直接影响经济、社会和环境。在过去，半supervised和supervised技术已经被用来探测包括冲积泥矿在内的采矿区域。一些技术使用多模态分析，结合不同的模式，如多spectral imaging、Synthetic Aperture Radar（SAR） imaging、飞行图像和点云数据。然而，river sandbank 区域的特征还未被完全探索。本文提出了一种新的方法，用于探测 river sandbank 区域，不需要任何标注数据。该方法基于river stream的相关性和矿物质的充足程度，可以准确地分类不同的区域。本文采用了两步方法，首先检测高矿物质区域，然后使用river stream的存在来分类。提出的方法在不使用任何标注数据的情况下，从LandSat 8 图像上获得了90.75%、85.47%和73.5%的平均准确率、精度和回归率。

PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers

paper_url: http://arxiv.org/abs/2307.00313
repo_url: None
paper_authors: Peidong Jia, Jiaming Liu, Senqiao Yang, Jiarui Wu, Xiaodong Xie, Shanghang Zhang
for: 本研究旨在适应检测器（DETR）在不同数据分布下的性能下降问题。
methods: 我们提出了一种层次Prompt Domain Memory（PDM），用于适应检测器的适应。PDM通过各个提示和其相应的分布值的对应关系，抽象地提取了域特有的知识，并将其作为多级嵌入和DETR输入的一部分进行注入。此外，我们还引入了Prompt Memory Alignment（PMA），用于减少源和目标域之间的差异，并充分利用提取到的域特有的知识。
results: 我们的方法在三个测试准则上（场景、 sintetic to real 和天气适应）比靶状态的领域适应检测方法表现出色，得到了更好的性能。

Abstract
The Transformer-based detectors (i.e., DETR) have demonstrated impressive performance on end-to-end object detection. However, transferring DETR to different data distributions may lead to a significant performance degradation. Existing adaptation techniques focus on model-based approaches, which aim to leverage feature alignment to narrow the distribution shift between different domains. In this study, we propose a hierarchical Prompt Domain Memory (PDM) for adapting detection transformers to different distributions. PDM comprehensively leverages the prompt memory to extract domain-specific knowledge and explicitly constructs a long-term memory space for the data distribution, which represents better domain diversity compared to existing methods. Specifically, each prompt and its corresponding distribution value are paired in the memory space, and we inject top M distribution-similar prompts into the input and multi-level embeddings of DETR. Additionally, we introduce the Prompt Memory Alignment (PMA) to reduce the discrepancy between the source and target domains by fully leveraging the domain-specific knowledge extracted from the prompt domain memory. Extensive experiments demonstrate that our method outperforms state-of-the-art domain adaptive object detection methods on three benchmarks, including scene, synthetic to real, and weather adaptation. Codes will be released.

摘要
《 transformer 基于检测器（即 DETR）在端到端对象检测方面表现出了很好的表现。然而，在不同数据分布下传输 DETR 可能会导致性能下降。现有的适应技术主要集中在模型基于的方法上， aiming to leverage feature alignment to narrow the distribution shift between different domains。在这种研究中，我们提出了层次结构 Prompt Domain Memory（PDM），用于适应检测转换器到不同分布。PDM 通过全面利用提示记忆来抽取域pecific的知识，并将每个提示和其相应的分布值配对在记忆空间中，以及在 DETR 的输入和多级嵌入中注入 top M Distribution-similar 提示。此外，我们还引入了 Prompt Memory Alignment（PMA），以减少源和目标域之间的差异，全面利用域specific 知识从提示域记忆中提取。广泛的实验表明，我们的方法在三个标准检测benchmark上（包括场景、Synthetic to Real 和天气适应）都超过了现有的领先方法。代码将会被发布。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.

Adversarial Attacks and Defenses on 3D Point Cloud Classification: A Survey

paper_url: http://arxiv.org/abs/2307.00309
repo_url: None
paper_authors: Hanieh Naderi, Ivan V. Bajić
for: 本文主要探讨了点云分类领域中的 adversarial attack 和 defense 技术的现状，以便鼓励未来的研究。
methods: 本文首先介绍了反对攻击的原理和特点，然后总结了过去几年中的反对例生成方法。此外，它还分类了防御策略为输入变换、数据优化和深度模型修改。
results: 本文综合梳理了防御策略的效果，并提出了未来研究的一些挑战和方向。I hope that helps! Let me know if you have any other questions.

Abstract
Deep learning has successfully solved a wide range of tasks in 2D vision as a dominant AI technique. Recently, deep learning on 3D point clouds is becoming increasingly popular for addressing various tasks in this field. Despite remarkable achievements, deep learning algorithms are vulnerable to adversarial attacks. These attacks are imperceptible to the human eye but can easily fool deep neural networks in the testing and deployment stage. To encourage future research, this survey summarizes the current progress on adversarial attack and defense techniques on point cloud classification. This paper first introduces the principles and characteristics of adversarial attacks and summarizes and analyzes the adversarial example generation methods in recent years. Besides, it classifies defense strategies as input transformation, data optimization, and deep model modification. Finally, it presents several challenging issues and future research directions in this domain.

摘要
深度学习在2D视觉任务中已经成为当今AI技术的主导者，最近它在3D点云任务中也越来越受欢迎。尽管它们已经取得了很多成就，但深度学习算法却容易受到抗击攻击。这些攻击可以让人类不可见，但可以轻松地让深度神经网络在测试和部署阶段出现错误。为鼓励未来的研究，本文将summarize了当前在点云分类领域中的抗击攻击和防御技术。本文首先介绍了抗击攻击的原则和特点，然后总结和分析了过去几年中的抗击示例生成方法。此外，它还分类了防御策略为输入转换、数据优化和深度模型修改。最后，它提出了一些挑战性的问题和未来研究方向。

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

paper_url: http://arxiv.org/abs/2307.00300
repo_url: None
paper_authors: Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Yongdong Zhang, Zhendong Mao
for: 本研究旨在提高文本到图像模型的可编辑性，同时保持人脸identität的稳定性。
methods: 我们提出了一种无需优化的方法，通过学习多比例人脸特征并应用多个映射项目来直接生成文本空间中的pseudo字。此外，我们还提出了一种自适应可编辑学习方法，通过使用名人名称来构建对应的生成和修改的人脸图像对。
results: 我们的方法可以在不同的场景下生成快速速度下生成identity-保持的图像，并且可以增强模型的可编辑性。

Abstract
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity for conditioned face images. Existing methods either require time-consuming optimization for each face-identity or learning an efficient encoder at the cost of harming the editability of models. In this work, we present an optimization-free method for each face identity, meanwhile keeping the editability for text-to-image models. Specifically, we propose a novel face-identity encoder to learn an accurate representation of human faces, which applies multi-scale face features followed by a multi-embedding projector to directly generate the pseudo words in the text embedding space. Besides, we propose self-augmented editability learning to enhance the editability of models, which is achieved by constructing paired generated face and edited face images using celebrity names, aiming at transferring mature ability of off-the-shelf text-to-image models in celebrity faces to unseen faces. Extensive experiments show that our methods can generate identity-preserved images under different scenes at a much faster speed.

摘要
大规模预训练文本到图模型可以生成多样性和高质量的人acentric图像，但是一个不可解决的问题是如何保持人脸identidad дляconditioned face images。现有方法可以通过时间consuming的优化来保持每个face identity，但是这会对模型的可编辑性造成损害。在这种情况下，我们提出了一种不需要优化的方法，同时保持文本到图模型的可编辑性。具体来说，我们提出了一种新的人脸identidad encoder，该encoder通过多个scale的人脸特征 followed by一个多重投影器来直接生成 pseudo words在文本embedding空间中。此外，我们还提出了自然增强的可编辑性学习，该学习是通过使用celebrity名称construct paired generated face和 edited face图像来实现，以传递不visible faces的成熟能力到off-the-shelf文本到图模型中。广泛的实验表明，我们的方法可以在不同的场景下生成保持人脸identidad的图像，并且比 tradicional方法快得多。

AutoST: Training-free Neural Architecture Search for Spiking Transformers

paper_url: http://arxiv.org/abs/2307.00293
repo_url: None
paper_authors: Ziqing Wang, Qidong Zhao, Jinku Cui, Xu Liu, Dongkuan Xu
for: AutoST is designed to rapidly identify high-performance and energy-efficient Spiking Transformer architectures, addressing the limitations of traditional approaches.methods: AutoST uses Floating-Point Operations (FLOPs) as a performance metric, which is independent of model computations and training dynamics, leading to a stronger correlation with performance. Additionally, activation patterns are leveraged during initialization to estimate the energy consumption of Spiking Transformers.results: AutoST models outperform state-of-the-art manually or automatically designed SNN architectures on static and neuromorphic datasets, while significantly reducing energy consumption.

Abstract
Spiking Transformers have gained considerable attention because they achieve both the energy efficiency of Spiking Neural Networks (SNNs) and the high capacity of Transformers. However, the existing Spiking Transformer architectures, derived from ANNs, exhibit a notable architectural gap, resulting in suboptimal performance compared to their ANN counterparts. Traditional approaches to discovering optimal architectures primarily rely on either manual procedures, which are time-consuming, or Neural Architecture Search (NAS) methods, which are usually expensive in terms of memory footprints and computation time. To address these limitations, we introduce AutoST, a training-free NAS method for Spiking Transformers, to rapidly identify high-performance and energy-efficient Spiking Transformer architectures. Unlike existing training-free NAS methods, which struggle with the non-differentiability and high sparsity inherent in SNNs, we propose to utilize Floating-Point Operations (FLOPs) as a performance metric, which is independent of model computations and training dynamics, leading to a stronger correlation with performance. Moreover, to enable the search for energy-efficient architectures, we leverage activation patterns during initialization to estimate the energy consumption of Spiking Transformers. Our extensive experiments show that AutoST models outperform state-of-the-art manually or automatically designed SNN architectures on static and neuromorphic datasets, while significantly reducing energy consumption.

摘要
<> translate into Simplified Chinese神经网络中的启突变换器在过去几年中获得了广泛关注，因为它们可以同时实现神经网络的能量效率和变换器的高容量。然而，现有的启突变换器架构，基于神经网络，存在一定的建筑性差距，导致其性能远低于其神经网络对应的架构。传统的最佳架构发现方法主要靠manual procedures，这些方法需要很长时间，或者使用神经网络搜索（NAS）方法，这些方法通常需要大量的存储空间和计算资源。为了解决这些限制，我们介绍了AutoST，一种无需训练的NAS方法，用于快速发现高性能和能效的启突变换器架构。与现有的无需训练NAS方法不同，我们使用操作数（FLOPs）作为性能指标，这个指标与模型计算和训练动态无关，从而具有更强的相关性。此外，为了搜索能效的架构，我们利用初始化时的活动模式来估算启突变换器的能 consumption。我们的广泛实验表明，AutoST模型在静态和 neuromorphic 数据集上都高于当前最佳手动或自动设计的神经网络架构，同时显著降低能 consumption。Note: Simplified Chinese is used here, as it is the most widely used variety of Chinese. However, if you prefer Traditional Chinese, I can also provide the translation.

All-in-SAM: from Weak Annotation to Pixel-wise Nuclei Segmentation with Prompt-based Finetuning

paper_url: http://arxiv.org/abs/2307.00290
repo_url: None
paper_authors: Can Cui, Ruining Deng, Quan Liu, Tianyuan Yao, Shunxing Bao, Lucas W. Remedios, Yucheng Tang, Yuankai Huo
for: This paper aims to improve the efficiency of the Segment Anything Model (SAM) for biomedical image segmentation tasks by eliminating the need for manual prompts during the inference stage.
methods: The proposed pipeline utilizes SAM to generate pixel-level annotations from weak prompts (e.g., points, bounding boxes), which are then used to finetune the SAM segmentation model without requiring manual prompts during the inference stage.
results: The proposed pipeline achieved competitive performance compared to using strong pixel-wise annotated data, and surpassed the state-of-the-art (SOTA) methods in a nuclei segmentation task on the public Monuseg dataset.

Abstract
The Segment Anything Model (SAM) is a recently proposed prompt-based segmentation model in a generic zero-shot segmentation approach. With the zero-shot segmentation capacity, SAM achieved impressive flexibility and precision on various segmentation tasks. However, the current pipeline requires manual prompts during the inference stage, which is still resource intensive for biomedical image segmentation. In this paper, instead of using prompts during the inference stage, we introduce a pipeline that utilizes the SAM, called all-in-SAM, through the entire AI development workflow (from annotation generation to model finetuning) without requiring manual prompts during the inference stage. Specifically, SAM is first employed to generate pixel-level annotations from weak prompts (e.g., points, bounding box). Then, the pixel-level annotations are used to finetune the SAM segmentation model rather than training from scratch. Our experimental results reveal two key findings: 1) the proposed pipeline surpasses the state-of-the-art (SOTA) methods in a nuclei segmentation task on the public Monuseg dataset, and 2) the utilization of weak and few annotations for SAM finetuning achieves competitive performance compared to using strong pixel-wise annotated data.

摘要
Segment Anything Model (SAM) 是一种最近提出的批处理基于的分割模型，可以在无预料分割的情况下实现出色的灵活性和精度。然而，现有的管道仍然需要在推理阶段手动提供批处理，这对生物医学图像分割而言仍然是资源浪费。在这篇文章中，我们提出一个管道，使用 SAM 来实现从注解生成到模型精度调整的整个人工智能开发工程，无需在推理阶段手动提供批处理。具体来说，SAM 首先用弱提示（例如，点和 bounding box）生成像素级注解。然后，这些像素级注解被用来调整 SAM 分割模型，而不是从 scratch retrained。我们的实验结果表明了两点：1）我们的管道超过了目前最佳方法（SOTA）在公共的 Monuseg 数据集上核心分割任务中的性能；2）使用弱和少量的注解来调整 SAM 模型达到了与使用强像素级注解数据相同的性能。

Common Knowledge Learning for Generating Transferable Adversarial Examples

paper_url: http://arxiv.org/abs/2307.00274
repo_url: None
paper_authors: Ruijie Yang, Yuanfang Guo, Junfu Wang, Jiantao Zhou, Yunhong Wang
for: This paper focuses on improving the adversarial transferability of transfer-based black-box attacks, where the attacker uses a substitute (source) model to generate adversarial examples for an unseen target model.methods: The proposed method uses a common knowledge learning (CKL) framework to learn better network weights for generating adversarial examples with better transferability, under fixed network architectures. The CKL framework involves constructing a multi-teacher framework where the knowledge is distilled from different teacher architectures into one student network, and imposing constraints on the gradients between the student and teacher models to alleviate the output inconsistency problem.results: The proposed method significantly improves the adversarial transferability, as demonstrated by extensive experiments.

Abstract
This paper focuses on an important type of black-box attacks, i.e., transfer-based adversarial attacks, where the adversary generates adversarial examples by a substitute (source) model and utilize them to attack an unseen target model, without knowing its information. Existing methods tend to give unsatisfactory adversarial transferability when the source and target models are from different types of DNN architectures (e.g. ResNet-18 and Swin Transformer). In this paper, we observe that the above phenomenon is induced by the output inconsistency problem. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we construct a multi-teacher framework, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generated adversarial examples, we impose constraints on the gradients between the student and teacher models, to further alleviate the output inconsistency problem and enhance the adversarial transferability. Extensive experiments demonstrate that our proposed work can significantly improve the adversarial transferability.

摘要
Translated into Simplified Chinese:这篇论文关注了一种重要的黑盒攻击方法，即传输基于敌意攻击，敌对方通过一个卷积神经网络（源模型）生成攻击示例，然后使用这些示例攻击一个未知的目标模型。现有的方法在不同类型的深度神经网络（DNN）模型之间存在很大的不满意的攻击传输性能。在这篇论文中，我们发现这种问题是由输出不一致问题引起的。为了解决这个问题并有效利用现有的DNN模型，我们提出了一个共同知识学习（CKL）框架，用于学习更好的网络权重，以生成更好的攻击示例，并且能够更好地传输到目标模型。具体来说，我们构建了一个多教程框架，其中知识从不同的教师模型中提取到一个学生网络中。由于通常通过输入的梯度来生成攻击示例，我们在学生和教师模型之间受限制梯度，以进一步缓解输出不一致问题，提高攻击传输性能。广泛的实验表明，我们的提议可以明显提高攻击传输性能。

HrSegNet : Real-time High-Resolution Neural Network with Semantic Guidance for Crack Segmentation

paper_url: http://arxiv.org/abs/2307.00270
repo_url: https://github.com/CHDyshli/HrSegNet4CrackSegmentation
paper_authors: Yongshang Li, Ronggui Ma, Han Liu, Gaoli Cheng
for: 这个研究是为了提高建筑物破损检测的精度和效率，特别是在实时应用中。
methods: 我们提出了一个高分辨率模型，具有semantic guidance，专门用于实时破损分类。我们的模型保持高分辨率 throughout the entire process，并使用low-resolution semantic features导引高分辨率 features的重建。我们还设计了一个简单 yet effective的方法来控制模型的计算成本，以提高效率。
results: 我们的模型HrSegNet在 crack dataset CrackSeg9k 上 achieves the best trade-off between efficiency and effectiveness。我们的最快模型HrSegNet-B16 在 182 FPS 上获得 78.43% mIoU，而我们的最精准模型HrSegNet-B48 在 140.3 FPS 上获得 80.32% mIoU。

Abstract
Through extensive research on deep learning in recent years and its application in construction, crack detection has evolved rapidly from rough detection at the image-level and patch-level to fine-grained detection at the pixel-level, which better suits the nature of this field. Despite numerous existing studies utilizing off-the-shelf deep learning models or enhancing them, these models are not always effective or efficient in real-world applications. In order to bridge this gap, we propose a High-resolution model with Semantic guidance, specifically designed for real-time crack segmentation, referred to as HrSegNet. Our model maintains high resolution throughout the entire process, as opposed to recovering from low-resolution features to high-resolution ones, thereby maximizing the preservation of crack details. Moreover, to enhance the context information, we use low-resolution semantic features to guide the reconstruction of high-resolution features. To ensure the efficiency of the algorithm, we design a simple yet effective method to control the computation cost of the entire model by controlling the capacity of high-resolution channels, while providing the model with extremely strong scalability. Extensive quantitative and qualitative evaluations demonstrate that our proposed HrSegNet has exceptional crack segmentation capabilities, and that maintaining high resolution and semantic guidance are crucial to the final prediction. Compared to state-of-the-art segmentation models, HrSegNet achieves the best trade-off between efficiency and effectiveness. Specifically, on the crack dataset CrackSeg9k, our fastest model HrSegNet-B16 achieves a speed of 182 FPS with 78.43% mIoU, while our most accurate model HrSegNet-B48 achieves 80.32% mIoU with an inference speed of 140.3 FPS.

摘要
随着深度学习在近年的研究和应用于建筑领域的扩展，从图像级和补充级到像素级的裂隙检测技术快速发展，更加适应建筑领域的特点。 despite numerous existing studies using off-the-shelf deep learning models or enhancing them, these models are not always effective or efficient in real-world applications. To bridge this gap, we propose a High-resolution model with Semantic guidance, specifically designed for real-time crack segmentation, referred to as HrSegNet. Our model maintains high resolution throughout the entire process, as opposed to recovering from low-resolution features to high-resolution ones, thereby maximizing the preservation of crack details. Moreover, to enhance the context information, we use low-resolution semantic features to guide the reconstruction of high-resolution features. To ensure the efficiency of the algorithm, we design a simple yet effective method to control the computation cost of the entire model by controlling the capacity of high-resolution channels, while providing the model with extremely strong scalability. Extensive quantitative and qualitative evaluations demonstrate that our proposed HrSegNet has exceptional crack segmentation capabilities, and that maintaining high resolution and semantic guidance are crucial to the final prediction. Compared to state-of-the-art segmentation models, HrSegNet achieves the best trade-off between efficiency and effectiveness. Specifically, on the crack dataset CrackSeg9k, our fastest model HrSegNet-B16 achieves a speed of 182 FPS with 78.43% mIoU, while our most accurate model HrSegNet-B48 achieves 80.32% mIoU with an inference speed of 140.3 FPS.

AE-RED: A Hyperspectral Unmixing Framework Powered by Deep Autoencoder and Regularization by Denoising

paper_url: http://arxiv.org/abs/2307.00269
repo_url: None
paper_authors: Min Zhao, Jie Chen, Nicolas Dobigeon
for: This paper is written for the purpose of proposing a novel framework for spectral unmixing, which integrates autoencoder networks with regularization by denoising (RED) to enhance the unmixing performance.
methods: The paper uses a generic unmixing framework that combines deep autoencoder networks with regularization by denoising (RED) to solve the blind unmixing problem. The framework consists of two subproblems: the first one is solved using deep autoencoders to implicitly regularize the estimates and model the mixture mechanism, while the second one leverages denoising techniques to bring in explicit information.
results: The paper reports superior unmixing performance compared to state-of-the-art approaches on both synthetic and real data sets. The proposed framework is able to effectively integrate the advantages of deep autoencoder based unmixing methods and priors provided by denoisers, leading to improved unmixing results.

Abstract
Spectral unmixing has been extensively studied with a variety of methods and used in many applications. Recently, data-driven techniques with deep learning methods have obtained great attention to spectral unmixing for its superior learning ability to automatically learn the structure information. In particular, autoencoder based architectures are elaborately designed to solve blind unmixing and model complex nonlinear mixtures. Nevertheless, these methods perform unmixing task as blackboxes and lack of interpretability. On the other hand, conventional unmixing methods carefully design the regularizer to add explicit information, in which algorithms such as plug-and-play (PnP) strategies utilize off-the-shelf denoisers to plug powerful priors. In this paper, we propose a generic unmixing framework to integrate the autoencoder network with regularization by denoising (RED), named AE-RED. More specially, we decompose the unmixing optimized problem into two subproblems. The first one is solved using deep autoencoders to implicitly regularize the estimates and model the mixture mechanism. The second one leverages the denoiser to bring in the explicit information. In this way, both the characteristics of the deep autoencoder based unmixing methods and priors provided by denoisers are merged into our well-designed framework to enhance the unmixing performance. Experiment results on both synthetic and real data sets show the superiority of our proposed framework compared with state-of-the-art unmixing approaches.

摘要
On the other hand, conventional unmixing methods use carefully designed regularizers to add explicit information. Algorithms such as plug-and-play (PnP) strategies use off-the-shelf denoisers to plug powerful priors. In this paper, we propose a generic unmixing framework that integrates autoencoder networks with regularization by denoising (RED), called AE-RED.We decompose the unmixing optimized problem into two subproblems. The first one is solved using deep autoencoders to implicitly regularize the estimates and model the mixture mechanism. The second one leverages the denoiser to bring in explicit information. By merging the characteristics of deep autoencoder-based unmixing methods and the priors provided by denoisers, our well-designed framework can enhance the unmixing performance.Experiment results on both synthetic and real data sets show the superiority of our proposed framework compared with state-of-the-art unmixing approaches.

Deep Angiogram: Trivializing Retinal Vessel Segmentation

paper_url: http://arxiv.org/abs/2307.00245
repo_url: None
paper_authors: Dewei Hu, Xing Yao, Jiacheng Wang, Yuankai K. Tao, Ipek Oguz
for: 本研究旨在提出一种可以Robustly分割血管的深度学习模型，以便在不同的域域中进行血管分割。
methods: 该模型使用了对比强化自适应网络，可以过滤不相关的特征并生成一个深度图像，表示血管。然后通过阈值处理，可以实现血管分割。
results: 对比基eline模型，该模型在不同的目标域域上能够提供更高的分割性能，并且可以生成稳定的血管图像，提供非侵入性、安全的替代方案。

Abstract
Among the research efforts to segment the retinal vasculature from fundus images, deep learning models consistently achieve superior performance. However, this data-driven approach is very sensitive to domain shifts. For fundus images, such data distribution changes can easily be caused by variations in illumination conditions as well as the presence of disease-related features such as hemorrhages and drusen. Since the source domain may not include all possible types of pathological cases, a model that can robustly recognize vessels on unseen domains is desirable but remains elusive, despite many proposed segmentation networks of ever-increasing complexity. In this work, we propose a contrastive variational auto-encoder that can filter out irrelevant features and synthesize a latent image, named deep angiogram, representing only the retinal vessels. Then segmentation can be readily accomplished by thresholding the deep angiogram. The generalizability of the synthetic network is improved by the contrastive loss that makes the model less sensitive to variations of image contrast and noisy features. Compared to baseline deep segmentation networks, our model achieves higher segmentation performance via simple thresholding. Our experiments show that the model can generate stable angiograms on different target domains, providing excellent visualization of vessels and a non-invasive, safe alternative to fluorescein angiography.

摘要
Among the research efforts to segment the retinal vasculature from fundus images, deep learning models consistently achieve superior performance. However, this data-driven approach is very sensitive to domain shifts. For fundus images, such data distribution changes can easily be caused by variations in illumination conditions as well as the presence of disease-related features such as hemorrhages and drusen. Since the source domain may not include all possible types of pathological cases, a model that can robustly recognize vessels on unseen domains is desirable but remains elusive, despite many proposed segmentation networks of ever-increasing complexity. In this work, we propose a contrastive variational auto-encoder that can filter out irrelevant features and synthesize a latent image, named deep angiogram, representing only the retinal vessels. Then segmentation can be readily accomplished by thresholding the deep angiogram. The generalizability of the synthetic network is improved by the contrastive loss that makes the model less sensitive to variations of image contrast and noisy features. Compared to baseline deep segmentation networks, our model achieves higher segmentation performance via simple thresholding. Our experiments show that the model can generate stable angiograms on different target domains, providing excellent visualization of vessels and a non-invasive, safe alternative to fluorescein angiography.Here is a word-for-word translation of the text into Simplified Chinese:在血管胶层图像分割方面，深度学习模型一直表现出超越性能。然而，这种数据驱动的方法对域Shift非常敏感。对于胶层图像，这种数据分布变化可以轻松地由照明条件的变化以及疾病相关特征所致。因为源领域可能不包含所有可能的疾病情况，一个可以在未seen域中坚定地识别血管的模型是极其愿望的，但尚未实现。在这项工作中，我们提出了一种对比吸引变换自动编码器，可以筛选出无关的特征，并生成一个含有只有血管的潜在图像，称为深度血管图像。然后，可以通过resholding这个深度血管图像来进行分割。对于基线深度分割网络，我们的模型可以通过简单的阈值设定来实现更高的分割性能。我们的实验表明，模型可以在不同的目标域上稳定生成安全、不侵入的抗颜色气体抑血管图像，提供出色的血管视化和替代吸入气体抑血管报告的非侵入、安全的选择。

S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

paper_url: http://arxiv.org/abs/2307.00226
repo_url: None
paper_authors: Ye Xue, Diego Klabjan, Jean Utke
for: 这篇论文主要是关于多模态多任务学习，强调在不同频谱和结构数据上学习多种任务的能力。
methods: 该论文提出了一种名为Structured-data-enhanced Omninet（S-Omninet）的新型多模态模型，通过跨缓存注意力和嵌入patch来增强视觉特征表示，同时支持结构数据。
results: 对多个多模态数据集进行测试，S-Omninet模型表现出了明显的改善，较基eline模型Omninet具有更好的学习能力和灵活性。

Abstract
Multimodal multitask learning has attracted an increasing interest in recent years. Singlemodal models have been advancing rapidly and have achieved astonishing results on various tasks across multiple domains. Multimodal learning offers opportunities for further improvements by integrating data from multiple modalities. Many methods are proposed to learn on a specific type of multimodal data, such as vision and language data. A few of them are designed to handle several modalities and tasks at a time. In this work, we extend and improve Omninet, an architecture that is capable of handling multiple modalities and tasks at a time, by introducing cross-cache attention, integrating patch embeddings for vision inputs, and supporting structured data. The proposed Structured-data-enhanced Omninet (S-Omninet) is a universal model that is capable of learning from structured data of various dimensions effectively with unstructured data through cross-cache attention, which enables interactions among spatial, temporal, and structured features. We also enhance spatial representations in a spatial cache with patch embeddings. We evaluate the proposed model on several multimodal datasets and demonstrate a significant improvement over the baseline, Omninet.

摘要
多modal多任务学习在最近几年内受到了越来越多的关注。单modal模型在不同领域的多种任务上取得了非常出色的成绩。多modal学习具有融合多种数据的机会，可以进一步提高模型的性能。许多方法是为了处理特定类型的多modal数据而提出，其中一些可以同时处理多个模式和任务。在这个工作中，我们扩展并改进了Omninet架构，该架构可以同时处理多个模式和任务。我们引入了交叉缓存注意力，将视觉输入集成到缓存中，并支持结构化数据。我们称之为Structured-data-enhanced Omninet（S-Omninet）。该模型可以有效地从结构化数据中学习，并且可以通过交叉缓存注意力和覆盖式嵌入来实现视觉特征的增强。我们在多个多modal数据集上评估了提议模型，并证明了与基线模型Omninet相比有显著的提高。

StyleStegan: Leak-free Style Transfer Based on Feature Steganography

paper_url: http://arxiv.org/abs/2307.00225
repo_url: None
paper_authors: Xiujian Liang, Bingshan Liu, Qichao Ying, Zhenxing Qian, Xinpeng Zhang
for: 解决现代社交媒体中存在的内容泄露问题，以便实现序列和反向的风格传递。
methods: 基于特征隐写技术的风格传递方法，包括两个主要组成部分：风格传递方法和图像隐写方法。
results: 对公共可用数据集MS-COCO和Wikiart进行了全面的实验验证，结果显示StyleStegan成功解决了风格传递中的内容泄露问题，相比于一个不优化的基线模型，SSIM表现指标在序列和反向风格传递任务中分别提高了14.98%和7.28%。

Abstract
In modern social networks, existing style transfer methods suffer from a serious content leakage issue, which hampers the ability to achieve serial and reversible stylization, thereby hindering the further propagation of stylized images in social networks. To address this problem, we propose a leak-free style transfer method based on feature steganography. Our method consists of two main components: a style transfer method that accomplishes artistic stylization on the original image and an image steganography method that embeds content feature secrets on the stylized image. The main contributions of our work are as follows: 1) We identify and explain the phenomenon of content leakage and its underlying causes, which arise from content inconsistencies between the original image and its subsequent stylized image. 2) We design a neural flow model for achieving loss-free and biased-free style transfer. 3) We introduce steganography to hide content feature information on the stylized image and control the subsequent usage rights. 4) We conduct comprehensive experimental validation using publicly available datasets MS-COCO and Wikiart. The results demonstrate that StyleStegan successfully mitigates the content leakage issue in serial and reversible style transfer tasks. The SSIM performance metrics for these tasks are 14.98% and 7.28% higher, respectively, compared to a suboptimal baseline model.

摘要
现代社交媒体中，现有的风格传输方法受到严重的内容泄露问题困扰，这限制了实现串行和反转的风格化，从而阻碍了更多的风格化图像在社交媒体上传播。为解决这问题，我们提出了无泄露风格传输方法基于特征隐藏。我们的方法包括两个主要组成部分：一个实现艺术风格化的风格传输方法和一个嵌入隐藏特征信息的图像隐藏方法。我们的主要贡献如下：1. 我们识别和解释内容泄露问题的现象和其根本原因，这些原因来自于图像和其后风格化图像之间的内容不一致。2. 我们设计了一个无损和偏见的风格传输模型。3. 我们引入隐藏特征信息的图像隐藏技术，以控制后续使用权限。4. 我们对公开available的MS-COCO和Wikiart dataset进行了广泛的实验验证。结果表明，StyleStegan成功解决了串行和反转风格传输任务中的内容泄露问题。对于这两个任务，我们的SSIM性能指标分别高于相对优化的基eline模型14.98%和7.28%。

Q-YOLO: Efficient Inference for Real-time Object Detection

paper_url: http://arxiv.org/abs/2307.04816
repo_url: None
paper_authors: Mingze Wang, Huixin Sun, Jun Shi, Xuhui Liu, Baochang Zhang, Xianbin Cao
for: 本研究旨在提高资源有限的边缘设备上部署实时物体检测模型的效率，以实现实时检测而减少计算和内存占用。
methods: 本文提出了一种低位数量化方法，称为Q-YOLO，以构建高效的一stage检测器。Q-YOLO使用了一个完整的Post-Training Quantization（PTQ）管道，并采用了一种名为Unilateral Histogram-based（UH）活动量化方案，通过 histogram 分析，确定最大 truncation 值，以最小化 Mean Squared Error（MSE）量化错误。
results: 对COCO数据集进行了广泛的实验， demonstarted Q-YOLO的有效性，并在其他PTQ方法之上具有更好的平衡性，同时实现了减少计算和内存占用的实时检测。这些研究启动了资源有限边缘设备上部署物体检测模型的高效部署，以实现实时检测而减少计算和内存占用。

Abstract
Real-time object detection plays a vital role in various computer vision applications. However, deploying real-time object detectors on resource-constrained platforms poses challenges due to high computational and memory requirements. This paper describes a low-bit quantization method to build a highly efficient one-stage detector, dubbed as Q-YOLO, which can effectively address the performance degradation problem caused by activation distribution imbalance in traditional quantized YOLO models. Q-YOLO introduces a fully end-to-end Post-Training Quantization (PTQ) pipeline with a well-designed Unilateral Histogram-based (UH) activation quantization scheme, which determines the maximum truncation values through histogram analysis by minimizing the Mean Squared Error (MSE) quantization errors. Extensive experiments on the COCO dataset demonstrate the effectiveness of Q-YOLO, outperforming other PTQ methods while achieving a more favorable balance between accuracy and computational cost. This research contributes to advancing the efficient deployment of object detection models on resource-limited edge devices, enabling real-time detection with reduced computational and memory overhead.

摘要

More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data

paper_url: http://arxiv.org/abs/2307.00213
repo_url: None
paper_authors: Andrew Kean Gao
for: 这篇研究旨在测试Compact Convolutional Transformers（CCT）在生物医学影像分类 зада中的可行性，以扩展Transformers在生物医学领域的应用。
methods: 本研究使用了一种混合了 transformers 和卷积层的方法——CCT，以扩展Transformers的应用范围。
results: 研究获得了92.49%的类别准确率和0.9935的微 averaged ROC AUC，表明CCT在有限数据情况下可以实现高度的类别准确率。

Abstract
Transformers are very powerful tools for a variety of tasks across domains, from text generation to image captioning. However, transformers require substantial amounts of training data, which is often a challenge in biomedical settings, where high quality labeled data can be challenging or expensive to obtain. This study investigates the efficacy of Compact Convolutional Transformers (CCT) for robust medical image classification with limited data, addressing a key issue faced by conventional Vision Transformers - their requirement for large datasets. A hybrid of transformers and convolutional layers, CCTs demonstrate high accuracy on modestly sized datasets. We employed a benchmark dataset of peripheral blood cell images of eight distinct cell types, each represented by approximately 2,000 low-resolution (28x28x3 pixel) samples. Despite the dataset size being smaller than those typically used with Vision Transformers, we achieved a commendable classification accuracy of 92.49% and a micro-average ROC AUC of 0.9935. The CCT also learned quickly, exceeding 80% validation accuracy after five epochs. Analysis of per-class precision, recall, F1, and ROC showed that performance was strong across cell types. Our findings underscore the robustness of CCTs, indicating their potential as a solution to data scarcity issues prevalent in biomedical imaging. We substantiate the applicability of CCTs in data-constrained areas and encourage further work on CCTs.

摘要
它们（transformers）是非常强大的工具，可以在多个领域中进行多种任务，从文本生成到图像描述。然而，transformers需要大量的训练数据，而在生物医学领域，高质量的标注数据可能困难或昂贵。这项研究探讨了Compact Convolutional Transformers（CCT）在医学图像分类中的可靠性，解决了传统的视图变换器（Vision Transformers）的问题，即它们需要大量的数据。CCT是将transformers和卷积层结合在一起的混合模型。我们使用了一个标准的血液细胞图像数据集，包含8种不同的血液细胞类型，每种类型都有约2000个低分辨率（28x28x3像素）的样本。尽管数据集的大小较小于传统使用的Vision Transformers，但我们达到了92.49%的分类精度和0.9935的微均ROC AUC。CCT还快速学习，在5个epoch后超过80%的验证精度。我们的分析表明，CCT在每个类型中的精度、回归、F1和ROC都具有强大的表现。我们的发现证明了CCT的可靠性，表明它们在数据缺乏的情况下可以作为解决方案。我们鼓励进一步的研究和应用CCTs。

Internal-External Boundary Attention Fusion for Glass Surface Segmentation

paper_url: http://arxiv.org/abs/2307.00212
repo_url: None
paper_authors: Dongshen Han, Seungkyu Lee
for: 本研究旨在提出一种基于深度学习的玻璃表面特征描述方法，以便从单色图像中提取玻璃区域。
methods: 我们提出了一种基于Semantic Segmentation的方法，其中包括内部和外部边界注意模块，通过分别学习和选择玻璃表面内和外部的视觉特征来描述玻璃区域。
results: 我们在六个公共评测 dataset 上进行了比较，与现状的方法进行了比较，得到了出色的结果。

Abstract
Glass surfaces of transparent objects and mirrors are not able to be uniquely and explicitly characterized by their visual appearances because they contain the visual appearance of other reflected or transmitted surfaces as well. Detecting glass regions from a single-color image is a challenging task. Recent deep-learning approaches have paid attention to the description of glass surface boundary where the transition of visual appearances between glass and non-glass surfaces are observed. In this work, we analytically investigate how glass surface boundary helps to characterize glass objects. Inspired by prior semantic segmentation approaches with challenging image types such as X-ray or CT scans, we propose separated internal-external boundary attention modules that individually learn and selectively integrate visual characteristics of the inside and outside region of glass surface from a single color image. Our proposed method is evaluated on six public benchmarks comparing with state-of-the-art methods showing promising results.

摘要
玻璃表面和镜子的透明物体表面无法uniquely和explicitly characterized by their visual appearances，因为它们包含了其他反射或传输的表面的视觉特征。从单一颜色图像中检测玻璃区域是一个具有挑战性的任务。现代深度学习方法已经强调描述玻璃表面边界， где视觉特征的转变可以观察到。在这项工作中，我们分析了如何使用玻璃表面边界来 caracterize glass objects。受过去的semantic segmentation方法的启发，我们提议了分开的内部外部边界注意力模块，它们分别学习和选择ively integrate玻璃表面内部和外部区域的视觉特征从单一颜色图像中。我们的提议方法在六个公共benchmark上进行了评估，与现状的方法相比显示了可塑性的结果。

AIGCIQA2023: A Large-scale Image Quality Assessment Database for AI Generated Images: from the Perspectives of Quality, Authenticity and Correspondence

paper_url: http://arxiv.org/abs/2307.00211
repo_url: https://github.com/wangjiarui153/aigciqa2023
paper_authors: Jiarui Wang, Huiyu Duan, Jing Liu, Shi Chen, Xiongkuo Min, Guangtao Zhai
for: 这研究的目的是为了更好地了解人类对AI生成图像的视觉偏好。
methods: 这些研究使用了6种现状最佳的文本到图像生成模型，生成了2000多个图像，并通过一组有组织的主观试验评估人类对每个图像的质量、真实性和准确性。
results: 这些研究结果显示了人类对AI生成图像的视觉偏好，并对现有的IQA指标进行了评估。

Abstract
In this paper, in order to get a better understanding of the human visual preferences for AIGIs, a large-scale IQA database for AIGC is established, which is named as AIGCIQA2023. We first generate over 2000 images based on 6 state-of-the-art text-to-image generation models using 100 prompts. Based on these images, a well-organized subjective experiment is conducted to assess the human visual preferences for each image from three perspectives including quality, authenticity and correspondence. Finally, based on this large-scale database, we conduct a benchmark experiment to evaluate the performance of several state-of-the-art IQA metrics on our constructed database.

摘要
在这篇论文中，为了更好地理解人类对AIGI的视觉偏好，我们建立了一个大规模的AIGC图像评价数据库，命名为AIGCIQA2023。我们首先生成了超过2000个图像，使用100个文本生成模型，并对这些图像进行了三个视觉偏好的主观测试，包括图像质量、准确性和匹配度。最后，基于这个大规模数据库，我们进行了多种现代IQA指标的 benchmark测试，以评估它们在我们构建的数据库中的表现。

Filter Pruning for Efficient CNNs via Knowledge-driven Differential Filter Sampler

paper_url: http://arxiv.org/abs/2307.00198
repo_url: https://github.com/osilly/kdfs
paper_authors: Shaohui Lin, Wenxuan Huang, Jiao Xie, Baochang Zhang, Yunhang Shen, Zhou Yu, Jungong Han, David Doermann
for: 这 paper 是为了提高 Convolutional Neural Networks (CNNs) 的计算速度和内存占用量，并且可以应用于 Edge 设备和云服务。
methods: 这 paper 提出了一种新的 Knowledge-driven Differential Filter Sampler (KDFS) 框架，用于对 CNNs 进行filter pruning。KDFS 使用一个 learnable 抽象 sampler 来生成一个 binary mask vector для每个层，以确定该层的 filters 是否为 redundant。
results: 对多个 dataset 进行了广泛的实验，显示了 KDFS 的效果性。例如，使用 KDFS 压缩 Base Model 在 ImageNet 上 achieve $55.36%$ 计算减少，$42.86%$ 参数减少，仅 dropped $0.35%$ Top-1 准确率，significantly outperforming 现有方法。

Abstract
Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs, which can be effectively applied to edge devices and cloud services. In this paper, we propose a novel Knowledge-driven Differential Filter Sampler~(KDFS) with Masked Filter Modeling~(MFM) framework for filter pruning, which globally prunes the redundant filters based on the prior knowledge of a pre-trained model in a differential and non-alternative optimization. Specifically, we design a differential sampler with learnable sampling parameters to build a binary mask vector for each layer, determining whether the corresponding filters are redundant. To learn the mask, we introduce masked filter modeling to construct PCA-like knowledge by aligning the intermediate features from the pre-trained teacher model and the outputs of the student decoder taking sampling features as the input. The mask and sampler are directly optimized by the Gumbel-Softmax Straight-Through Gradient Estimator in an end-to-end manner in combination with global pruning constraint, MFM reconstruction error, and dark knowledge. Extensive experiments demonstrate the proposed KDFS's effectiveness in compressing the base models on various datasets. For instance, the pruned ResNet-50 on ImageNet achieves $55.36\%$ computation reduction, and $42.86\%$ parameter reduction, while only dropping $0.35\%$ Top-1 accuracy, significantly outperforming the state-of-the-art methods. The code is available at \url{https://github.com/Osilly/KDFS}.

摘要
<>通过同时进行过滤排序和减少计算和内存开销， convolutional neural networks (CNNs) 可以更加快速和高效地进行计算。在这篇论文中，我们提出了一种新的知识驱动的差分过滤排序（KDFS）框架，用于过滤排序。该框架基于预训练模型的先验知识，并在差分和非交换的优化下进行全局减少重复的过滤器。我们设计了一种差分抽样器，用于生成每层的二进制掩码向量，以确定对应的过滤器是否为重复的。为了学习掩码，我们引入了带有掩码的过滤器模型（MFM），以构建类似于Principal Component Analysis（PCA）的知识。我们使用Gumbel-Softmax Straight-Through Gradient Estimator来直接优化抽样器和掩码，并与全局减少约束、MFM重建误差和黑知识相结合。我们对多个数据集进行了广泛的实验，并证明了我们的KDFS有效地压缩基本模型。例如，采用KDFS压缩的ResNet-50在ImageNet上达到了$55.36\%$ 的计算减少和$42.86\%$ 的参数减少，而且只有$0.35\%$ 的Top-1准确率下降，与当前的方法显著超越。代码可以在 \url{https://github.com/Osilly/KDFS} 上获取。Note: The translation is in Simplified Chinese, which is one of the two standardized Chinese languages used in mainland China. The other standardized Chinese language is Traditional Chinese, which is used in Taiwan and other parts of the world.

Long-Tailed Continual Learning For Visual Food Recognition

paper_url: http://arxiv.org/abs/2307.00183
repo_url: None
paper_authors: Jiangpeng He, Luotao Lin, Jack Ma, Heather A. Eicher-Miller, Fengqing Zhu
for: 本研究旨在解决深度学习基于食物识别中的两个主要问题，即随着时间的推移，模型需要不断学习新的食物类型而不导致已知的食物类型忘记，以及食物图像在实际生活中的分布是长尾分布，其中一些食物类型的图像比较罕见。
methods: 本研究提出了一种新的终端向量 continual learning 方法，用于解决上述两个问题。该方法包括一个额外的预测器来实现知识储存，以避免在 kontinual learning 过程中的表现不一致，以及一种新的数据增强技术， integrate class-activation-map (CAM) 和 CutMix，以提高对instance-rare食物类型的泛化能力。
results: 对比 existed 方法，提出的方法显示了大幅度的提升性能，特别是在instance-rare食物类型上。

Abstract
Deep learning based food recognition has achieved remarkable progress in predicting food types given an eating occasion image. However, there are two major obstacles that hinder deployment in real world scenario. First, as new foods appear sequentially overtime, a trained model needs to learn the new classes continuously without causing catastrophic forgetting for already learned knowledge of existing food types. Second, the distribution of food images in real life is usually long-tailed as a small number of popular food types are consumed more frequently than others, which can vary in different populations. This requires the food recognition method to learn from class-imbalanced data by improving the generalization ability on instance-rare food classes. In this work, we focus on long-tailed continual learning and aim to address both aforementioned challenges. As existing long-tailed food image datasets only consider healthy people population, we introduce two new benchmark food image datasets, VFN-INSULIN and VFN-T2D, which exhibits on the real world food consumption for insulin takers and individuals with type 2 diabetes without taking insulin, respectively. We propose a novel end-to-end framework for long-tailed continual learning, which effectively addresses the catastrophic forgetting by applying an additional predictor for knowledge distillation to avoid misalignment of representation during continual learning. We also introduce a novel data augmentation technique by integrating class-activation-map (CAM) and CutMix, which significantly improves the generalization ability for instance-rare food classes to address the class-imbalance issue. The proposed method show promising performance with large margin improvements compared with existing methods.

摘要
深度学习基于食物识别已经取得了非常出色的进步，可以预测给定食物吃掉的场景图像中的食物类型。然而，现实世界中的部署受到两个主要障碍：首先，随着时间的推移，已经学习的食物类型需要不断学习新的类型，而不会导致已经学习的食物类型的恐慌性忘记。其次，食物图像在实际生活中的分布是长尾分布，食物类型的极少数是常见的，而其他类型的食物往往很少被 consume，这会影响食物识别方法的学习。在这种情况下，我们将注意力集中在长尾 continual learning 上，以解决上述两个挑战。现有的长尾食物图像数据集只考虑了健康人口，因此我们引入了两个新的标准食物图像数据集：VFN-INSULIN和VFN-T2D，它们分别表示没有服用 инсулин的糖尿病患者和服用 инсулин的糖尿病患者。我们提出了一种新的端到端框架，用于长尾 continual learning，以避免在 continual learning 过程中的恐慌性忘记。我们还引入了一种新的数据增强技术，通过 integrating class-activation-map (CAM) 和 CutMix，可以显著提高对不常见食物类型的泛化能力，以解决类别不均衡问题。我们的方法在与现有方法进行比较时表现出了大幅提升的表现。

Single-Stage Heavy-Tailed Food Classification

paper_url: http://arxiv.org/abs/2307.00182
repo_url: None
paper_authors: Jiangpeng He, Fengqing Zhu
for: 这个论文旨在解决实际应用中食物分类问题的两个主要障碍。
methods: 我们提出了一种新的单阶段（即端到端）食物分类框架，以解决严重的类别不均衡问题和单例问题。
results: 我们在两个重点食物数据集上进行了评估，并实现了与现有方法相比的超过5%的提升。

Abstract
Deep learning based food image classification has enabled more accurate nutrition content analysis for image-based dietary assessment by predicting the types of food in eating occasion images. However, there are two major obstacles to apply food classification in real life applications. First, real life food images are usually heavy-tailed distributed, resulting in severe class-imbalance issue. Second, it is challenging to train a single-stage (i.e. end-to-end) framework under heavy-tailed data distribution, which cause the over-predictions towards head classes with rich instances and under-predictions towards tail classes with rare instance. In this work, we address both issues by introducing a novel single-stage heavy-tailed food classification framework. Our method is evaluated on two heavy-tailed food benchmark datasets, Food101-LT and VFN-LT, and achieves the best performance compared to existing work with over 5% improvements for top-1 accuracy.

摘要
深度学习基于食物图像分类已经使得食物内含营养物质的分析变得更加精准，通过预测餐厅图像中的食物类型。然而，在实际应用中存在两个主要障碍：首先，实际食物图像通常具有极端分布，导致类型分布不均匀的问题。其次，在极端数据分布下训练单 Stage（即端到端）框架是困难的，这会导致头类（即常见食物）的过度预测和尾类（即罕见食物）的下预测。在这种情况下，我们提出了一种新的单Stage极端食物分类框架。我们的方法在两个极端食物标准测试集Food101-LT和VFN-LT上进行评估，并达到了现有方法的最高性能，相对于前一个最佳方法，我们的方法实现了超过5%的提升。

Unsupervised Coordinate-Based Video Denoising

paper_url: http://arxiv.org/abs/2307.00179
repo_url: None
paper_authors: Mary Damilola Aiyetigbo, Dineshchandar Ravichandran, Reda Chalhoub, Peter Kalivas, Nianyi Li
for: 这个论文旨在提出一种新的无监督视频干扰深度学习方法，可以帮助解决数据缺乏问题，并且对不同干扰模式具有耐性，因此它的应用范围广泛。
methods: 该方法包括三个模块：特征生成器生成特征地图，干扰约束网络生成清晰但微妙的参照帧，以及重新引入高频环境细节。通过坐标基网络，我们可以大幅简化网络结构，保留高频环境细节在干扰视频帧中。
results: 我们在 simulated 和实际捕获的 calcium 影像视频序列上进行了广泛的实验，并证明了我们的方法可以有效地去干扰真实世界的 calcium 影像视频序列，不需要对干扰模型和数据增强进行学习和训练。

Abstract
In this paper, we introduce a novel unsupervised video denoising deep learning approach that can help to mitigate data scarcity issues and shows robustness against different noise patterns, enhancing its broad applicability. Our method comprises three modules: a Feature generator creating features maps, a Denoise-Net generating denoised but slightly blurry reference frames, and a Refine-Net re-introducing high-frequency details. By leveraging the coordinate-based network, we can greatly simplify the network structure while preserving high-frequency details in the denoised video frames. Extensive experiments on both simulated and real-captured demonstrate that our method can effectively denoise real-world calcium imaging video sequences without prior knowledge of noise models and data augmentation during training.

摘要
在这篇论文中，我们介绍了一种新的无监督视频干净深度学习方法，可以帮助解决数据缺乏问题，并且对不同噪声模式具有较好的Robustness，从而扩大其应用范围。我们的方法包括三个模块：一个特征生成器生成特征地图，一个Denoi-Net生成干净但有些模糊的参照帧，以及一个Refin-Net重新引入高频环境细节。通过利用坐标基于网络，我们可以大大简化网络结构，保留高频环境细节在干净视频帧中。我们在模拟和实际捕捉的视频序列上进行了广泛的实验，证明我们的方法可以有效地干净实际 calcium 影像视频序列，不需要先知噪声模型和数据扩展 durante 训练。

Multiscale Progressive Text Prompt Network for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2307.00174
repo_url: https://github.com/codehxj/MPTPN-for--Medical-Image-Segmentation
paper_authors: Xianjun Han, Qianqian Chen, Zhaoyang Xie, Xuejun Li, Hongyu Yang
for:本文旨在提出一种使用进步文本提示来引导医学图像分割任务的方法，以便降低需要大量标注数据的需求，从而提高分割结果的准确性。methods:本方法包括两个阶段。第一阶段使用对自然图像进行对比学习，先使得Prompt Encoder（PPE）学习出力强大的多样性特征。第二阶段将医学图像和文本提示送入PPE，以实现下游医学图像分割任务。一个多尺度特征融合块（MSFF）将PPE中的特征融合，生成多尺度多样性特征。results:与使用仅图像不同，我们的模型可以在低数据标注成本下获得高质量结果。此外，我们的模型不仅在医学图像上具有优秀可靠性和有效性，还在自然图像上表现良好。在不同的图像数据集上进行了实验，结果表明我们的模型是有效和可靠的。

Abstract
The accurate segmentation of medical images is a crucial step in obtaining reliable morphological statistics. However, training a deep neural network for this task requires a large amount of labeled data to ensure high-accuracy results. To address this issue, we propose using progressive text prompts as prior knowledge to guide the segmentation process. Our model consists of two stages. In the first stage, we perform contrastive learning on natural images to pretrain a powerful prior prompt encoder (PPE). This PPE leverages text prior prompts to generate multimodality features. In the second stage, medical image and text prior prompts are sent into the PPE inherited from the first stage to achieve the downstream medical image segmentation task. A multiscale feature fusion block (MSFF) combines the features from the PPE to produce multiscale multimodality features. These two progressive features not only bridge the semantic gap but also improve prediction accuracy. Finally, an UpAttention block refines the predicted results by merging the image and text features. This design provides a simple and accurate way to leverage multiscale progressive text prior prompts for medical image segmentation. Compared with using only images, our model achieves high-quality results with low data annotation costs. Moreover, our model not only has excellent reliability and validity on medical images but also performs well on natural images. The experimental results on different image datasets demonstrate that our model is effective and robust for image segmentation.

摘要
“精确的医疗影像分类是医疗影像评估中不可或缺的一步。然而，将深度神经网络训练 для这个任务需要大量的标注数据，以确保高精度结果。为解决这个问题，我们提出使用进步文本提示来导引分类过程。我们的模型由两个阶段组成。在第一个阶段，我们使用对自然影像进行对比学习，以预训一个具有强大能力的文本提示encoder（PPE）。这个PPE通过文本提示来生成多元特征。在第二个阶段，医疗影像和文本提示被送入PPE来完成下游医疗影像分类任务。一个多尺度特征融合块（MSFF）组合PPE生成的特征，以生成多尺度多元特征。这两个进步特征不仅bridges semantic gap，而且提高预测精度。最后，一个UpAttention块精确地调整预测结果，通过融合影像和文本特征。这个设计提供了一个简单而准确的方法，使用多尺度进步文本提示来进行医疗影像分类。相比于仅使用影像，我们的模型可以 дости得高品质结果，并且降低标注数据的成本。此外，我们的模型不仅在医疗影像上表现出色，还能够在自然影像上表现出良好的性能。实验结果表明，我们的模型是可靠和有效的 для影像分类。”

Hierarchical Neural Coding for Controllable CAD Model Generation

paper_url: http://arxiv.org/abs/2307.00149
repo_url: https://github.com/samxuxiang/hnc-cad
paper_authors: Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Karl D. D. Willis, Yasutaka Furukawa
for: 这 paper 的目的是提出一种新的生成模型，用于计算机支持设计 (CAD) 领域。
methods: 这 paper 使用了一种三级层次结构的神经网络代码，从全局部件安排下降到本地曲线几何。具体来说，使用了一种新的矢量量化 VAE WITH “masked skip connection” 抽取设计变化。两个阶段归一化变换学习代码树从不完整的 CAD 模型中生成代码树，然后按照设计意图完成 CAD 模型。
results: 广泛的实验表明，这种方法在Random generation 和 conditional generation 任务上具有优秀表现，同时允许用户在设计过程中进行新的交互操作。代码可以在 https://github.com/samxuxiang/hnc-cad 上下载。

Abstract
This paper presents a novel generative model for Computer Aided Design (CAD) that 1) represents high-level design concepts of a CAD model as a three-level hierarchical tree of neural codes, from global part arrangement down to local curve geometry; and 2) controls the generation or completion of CAD models by specifying the target design using a code tree. Concretely, a novel variant of a vector quantized VAE with "masked skip connection" extracts design variations as neural codebooks at three levels. Two-stage cascaded auto-regressive transformers learn to generate code trees from incomplete CAD models and then complete CAD models following the intended design. Extensive experiments demonstrate superior performance on conventional tasks such as random generation while enabling novel interaction capabilities on conditional generation tasks. The code is available at https://github.com/samxuxiang/hnc-cad.

摘要
这篇论文提出了一种新的生成模型，用于计算机支持设计（CAD），它可以1）将高级设计想法表示为三级层次的神经码树，从全局部件安排下到本地曲线几何; 2）通过指定目标设计来控制生成或完成CAD模型。具体来说，这种新的vector量化VAE中的"masked skip connection"抽取出设计变化的神经码书在三级层次。两个阶段的堆叠自动逆Transformers学习从部分完整的CAD模型中生成代码树，然后根据意图的设计完成CAD模型。广泛的实验表明，这种方法在Random Generation任务上具有优秀的性能，同时允许 Conditional Generation任务中的新式交互能力。代码可以在https://github.com/samxuxiang/hnc-cad中找到。

An End-to-End Review of Gaze Estimation and its Interactive Applications on Handheld Mobile Devices

paper_url: http://arxiv.org/abs/2307.00122
repo_url: None
paper_authors: Yaxiong Lei, Shijing He, Mohamed Khamis, Juan Ye
for: 本研究目的是审查手持式移动设备上的互动系统，以视线作为单独或辅助互动模式。
methods: 本研究使用了视线测量技术，包括深度学习等方法，以提高视线估计精度。
results: 本研究给出了互动系统中视线估计的状况，包括视线测量技术、视线估计工作流程、以及深度学习等方法。

Abstract
In recent years we have witnessed an increasing number of interactive systems on handheld mobile devices which utilise gaze as a single or complementary interaction modality. This trend is driven by the enhanced computational power of these devices, higher resolution and capacity of their cameras, and improved gaze estimation accuracy obtained from advanced machine learning techniques, especially in deep learning. As the literature is fast progressing, there is a pressing need to review the state of the art, delineate the boundary, and identify the key research challenges and opportunities in gaze estimation and interaction. This paper aims to serve this purpose by presenting an end-to-end holistic view in this area, from gaze capturing sensors, to gaze estimation workflows, to deep learning techniques, and to gaze interactive applications.

摘要
Recently, there has been an increasing number of interactive systems on handheld mobile devices that use gaze as a single or complementary interaction modality. This trend is driven by the enhanced computational power of these devices, the higher resolution and capacity of their cameras, and the improved gaze estimation accuracy obtained from advanced machine learning techniques, especially in deep learning. As the literature is rapidly advancing, there is a pressing need to review the state of the art, delineate the boundary, and identify the key research challenges and opportunities in gaze estimation and interaction. This paper aims to serve this purpose by presenting an end-to-end holistic view in this area, from gaze capturing sensors, to gaze estimation workflows, to deep learning techniques, and to gaze interactive applications.

Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.00097
repo_url: https://github.com/rb080/wss_pole
paper_authors: Balamurali Murugesan, Rukhshanda Hussain, Rajarshi Bhattacharya, Ismail Ben Ayed, Jose Dolz
for: 这篇论文是关于弱监督semantic segmentation（WSSS）问题的研究，旨在exploring whether prompt tuning可以提高WSSS的性能。
methods: 该论文使用CLIP-based模型进行预训练，然后通过 modifying text prompt来进行prompt tuning。
results: 研究发现，只需要修改类token的文本提示，可以对Class Activation Map（CAM）产生更大的影响，而且类token与图像真实分类结果不一定相符。基于这些发现， authors提出了一种新的PrOmpt cLass lEarning（POLE）策略，并通过广泛的实验表明该方法可以在一个常见的WSSS benchmark中达到SOTA性能。

Abstract
Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot learning tasks, fueled by the power of contrastive language-vision pre-training. In particular, prompt tuning has emerged as an effective strategy to adapt the pre-trained language-vision models to downstream tasks by employing task-related textual tokens. Motivated by this progress, in this work we question whether other fundamental problems, such as weakly supervised semantic segmentation (WSSS), can benefit from prompt tuning. Our findings reveal two interesting observations that shed light on the impact of prompt tuning on WSSS. First, modifying only the class token of the text prompt results in a greater impact on the Class Activation Map (CAM), compared to arguably more complex strategies that optimize the context. And second, the class token associated with the image ground truth does not necessarily correspond to the category that yields the best CAM. Motivated by these observations, we introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. Through extensive experiments we demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark. These results highlight not only the benefits of language-vision models in WSSS but also the potential of prompt learning for this problem. The code is available at https://github.com/rB080/WSS_POLE.

摘要
近期，基于CLIP的方法在泛化和少量学习任务上表现出色，受到语言视觉预训的能力的推动。特别是在任务相关的文本渠道上进行预训，以适应下游任务的灵活性。在这种进步的推动下，我们问题是否可以利用预训来解决基本的问题，如弱监督Semantic Segmentation（WSSS）。我们的发现是，只 modify文本提示中的类token，对于Class Activation Map（CAM）产生了更大的影响，相比较复杂的策略，如上下文优化。其次，与图像真实标签相关的类token不一定是导致CAM最佳化的类别。这些发现引发了我们提出一种基于Prompt Class Learning（POLE）策略的新方法。通过广泛的实验，我们证明了我们的简单 yet efficient approach可以在一个知名的WSSS标准 bencmark中达到SOTA性能。这些结果不仅 highlight了语言视觉模型在WSSS中的优势，还提醒了预训的潜在在这个问题上的潜力。代码可以在https://github.com/rB080/WSS_POLE上获取。

A Parts Based Registration Loss for Detecting Knee Joint Areas

paper_url: http://arxiv.org/abs/2307.00083
repo_url: None
paper_authors: Juha Tiirola
for: 本 paper 是为了提高膝关节区域进行微调的精度和效率。
methods: 本 paper 使用了一种基于部件损失的微调方法，其中部件是从参照图像中自动提取的抽象特征向量，并且在测试图像上鼓励相应的部件具有与参照图像中相对应的空间配置。
results: 本 paper 的实验结果表明，使用该微调方法可以提高膝关节区域的匹配精度和效率。

Abstract
In this paper, a parts based loss is considered for finetune registering knee joint areas. Here the parts are defined as abstract feature vectors with location and they are automatically selected from a reference image. For a test image the detected parts are encouraged to have a similar spatial configuration than the corresponding parts in the reference image.

摘要
在本文中，我们考虑了基于部件的损失函数进行膝关节区域软件定制。在这里，部件被定义为抽象特征向量，其中包含位置信息。这些部件将从参照图像自动选择出来。对测试图像来说，检测到的部件将被鼓励与对应的参照图像中的部件具有类似的空间配置。

Situated Cameras, Situated Knowledges: Towards an Egocentric Epistemology for Computer Vision

paper_url: http://arxiv.org/abs/2307.00064
repo_url: None
paper_authors: Samuel Goree, David Crandall
for: 讲述科学知识的 feminist epistemology 和 egocentric 计算机视觉之间的关系
methods: 使用质量的、人类中心的方法，以补充性能指标，强调人群工作者的literal和metaphorical视角
results: argued for the use of qualitative, human-centric methods as a complement to performance benchmarks, to center both the literal and metaphorical perspective of human crowd workers in CV.

Abstract
In her influential 1988 paper, Situated Knowledges, Donna Haraway uses vision and perspective as a metaphor to discuss scientific knowledge. Today, egocentric computer vision discusses many of the same issues, except in a literal vision context. In this short position paper, we collapse that metaphor, and explore the interactions between feminist epistemology and egocentric CV as "Egocentric Epistemology." Using this framework, we argue for the use of qualitative, human-centric methods as a complement to performance benchmarks, to center both the literal and metaphorical perspective of human crowd workers in CV.

摘要
哈拉韦（Donna Haraway）在1988年的著名论文《固有知识》中使用视野和视点作为比喻来讨论科学知识。今天， Egocentric Computer Vision（EGOCV）讨论了这些问题，但在literal视点上。在这篇短论文中，我们将这个比喻 collapse，并探讨 feminist epistemology 和 EGOCV 之间的交互。我们 argue 使用质量的、人类中心的方法作为性能指标的补充，以中心 literal 和比喻的人群工作者的视点在 CV 中。

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

paper_url: http://arxiv.org/abs/2306.17848
repo_url: None
paper_authors: Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz
For: The paper aims to investigate whether CNNs can be trained to have the same ability as ViTs to ignore out-of-context information and handle occlusions using a data augmentation method called Patch Mixing.* Methods: The paper uses Patch Mixing to train both CNNs and ViTs and assesses their performance on occlusion benchmarks.* Results: The paper finds that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks.Here are the three points in Simplified Chinese:* For: 这个研究的目标是看看 whether CNNs 可以通过 Patch Mixing 数据增强方法来模拟 VITs 的忽略外 context 信息和处理 occlusion 能力。* Methods: 这个研究使用 Patch Mixing 来训练 CNNs 和 VITs，并评估它们在 occlusion benchmarck 上的表现。* Results: 研究发现，VITs 不会因 Patch Mixing 训练而改善或下降，但是 CNNs 通过 Patch Mixing 训练获得了新的忽略外 context 信息和改善 occlusion benchmark 的能力。

Abstract
Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. We hypothesize that this power to ignore out-of-context information (which we name $\textit{patch selectivity}$), while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion. In this study, our aim is to see whether we can have CNNs $\textit{simulate}$ this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes. Specifically, we use Patch Mixing to train state-of-the-art ViTs and CNNs, assessing its impact on their ability to ignore out-of-context patches and handle natural occlusions. We find that ViTs do not improve nor degrade when trained using Patch Mixing, but CNNs acquire new capabilities to ignore out-of-context information and improve on occlusion benchmarks, leaving us to conclude that this training method is a way of simulating in CNNs the abilities that ViTs already possess. We will release our Patch Mixing implementation and proposed datasets for public use. Project page: https://arielnlee.github.io/PatchMixing/

摘要
计算机视觉领域中，视传播器（ViTs）已经发生了重大的变革，并且在视觉任务中 periodic 表现出了更高的性能比 CNNs。虽然哪个模型类型是更好的仍然是一个问题，但每种模型都具有独特的推导偏见，这些偏见会影响它们的学习和泛化性能。例如，ViTs在 early layer 非本地特征依赖方面有许多有趣的性质，同时具有自注意机制，这使得它们可以更好地忽略不相关的图像信息，并且可以更好地处理 occlusion。在本研究中，我们想要看看我们可以使 CNNs 通过有效地硬编码这种偏见来模拟 ViTs 的能力。特别是，我们使用 Patch Mixing 数据增强来训练 state-of-the-art ViTs 和 CNNs，并评估其对不相关的 patch 和自然遮挡的影响。我们发现，ViTs 不会因 Patch Mixing 训练而改善或恶化，但 CNNs 可以通过 Patch Mixing 训练获得忽略不相关信息的新能力，并在 occlusion 标准测试中提高表现。因此，我们结论认为，这种训练方法可以在 CNNs 中模拟 ViTs 所拥有的能力。我们将在项目页面上发布我们的 Patch Mixing 实现和建议的数据集。项目页面：https://arielnlee.github.io/PatchMixing/

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

paper_url: http://arxiv.org/abs/2306.17843
repo_url: https://github.com/guochengqian/magic123
paper_authors: Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem
for: 生成从单个未处理图像中的高质量、有Texture的3D模型
methods: 使用两个stage的Coarse-to-fineapproach，首先优化神经辐射场生成粗糙Geometry，然后采用具有内存效率的可微分网格表示方法生成高分辨率网格
results: 证明在Synthetic benchmark和多种真实图像上具有显著改进，并且提供了可用于下载代码、模型和生成的3D资产的链接

Abstract
We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors. We introduce a single trade-off parameter between the 2D and 3D priors to control exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, we employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images. Our code, models, and generated 3D assets are available at https://github.com/guochengqian/Magic123.

摘要
我们介绍Magic123，一种两阶段粗糙到细节的方法，用单张未处理图像在野外生成高质量、纹理化3D mesh。在第一阶段，我们优化神经辐射场以生成粗糙的几何体。在第二阶段，我们采用内存效率高的可 diffeomorphic mesh表示来生成高分辨率的网格，并且使用参考视图监督和2D和3D扩散约束来学习3D内容。我们引入一个单一的变量来控制2D和3D约束之间的平衡，以便在生成几何体时进行更好的探索和利用。此外，我们采用文本反转和单目深度正则化来促进视图之间的一致性和避免崩溃解决方案。Magic123在前一代图像到3D技术的基础上显著提高了性能，经过了广泛的synthetic benchmark和多种实际图像的测试。我们的代码、模型和生成的3D资产可以在https://github.com/guochengqian/Magic123中下载。

Federated Ensemble YOLOv5 - A Better Generalized Object Detection Algorithm

paper_url: http://arxiv.org/abs/2306.17829
repo_url: None
paper_authors: Vinit Hegiste, Tatjana Legler, Martin Ruskowski
for: 这篇研究旨在探讨 Federated Learning（FL）在物件探测中的应用，以提高模型的通用性，并与中央训练方法进行比较。
methods: 本研究使用 Federated Averaging（FED Avg）和 Federated SGD（FED SGD）来训练 YOLOv5 模型，并使用随机抽样法无替换地将客户端上的数据分配给不同的客户端。
results: 实验结果显示，使用 FL 训练的 YOLOv5 模型在测试集上显示出更高的精度和更好的通用性，尤其是在测试集中包含了两个不同的客户端数据，并且这些数据不在训练集中。这些结果表明，FL 可以被视为一种聚合算法，类似于 Bagging 和 Boosting 技术的聚合。因此，FL 不只是一种关于隐私的方法，还是一种可以提高机器学习模型的性能的方法。

Abstract
Federated learning (FL) has gained significant traction as a privacy-preserving algorithm, but the underlying resembles of federated learning algorithm like Federated averaging (FED Avg) or Federated SGD (FED SGD) to ensemble learning algorithms has not been fully explored. The purpose of this paper is to examine the application of FL to object detection as a method to enhance generalizability, and to compare its performance against a centralized training approach for an object detection algorithm. Specifically, we investigate the performance of a YOLOv5 model trained using FL across multiple clients and employ a random sampling strategy without replacement, so each client holds a portion of the same dataset used for centralized training. Our experimental results showcase the superior efficiency of the FL object detector's global model in generating accurate bounding boxes for unseen objects, with the test set being a mixture of objects from two distinct clients not represented in the training dataset. These findings suggest that FL can be viewed from an ensemble algorithm perspective, akin to a synergistic blend of Bagging and Boosting techniques. As a result, FL can be seen not only as a method to enhance privacy, but also as a method to enhance the performance of a machine learning model.

摘要

Look, Remember and Reason: Visual Reasoning with Grounded Rationales

paper_url: http://arxiv.org/abs/2306.17778
repo_url: None
paper_authors: Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic
for: 本研究旨在探讨大型自然语言模型在视觉理解任务中的表现，以及如何使用人类视觉解决方法来改进模型的表现。
methods: 本研究使用了人类视觉解决方法，即“看、记忆、理解”的三步过程，将视觉信息逐步提取并融合到理解过程中。此外，研究还引入了视觉输入的理由，以便将低级视觉能力，如物体识别和跟踪，作为副任务来增强模型的表现。
results: 研究表明，通过引入人类视觉解决方法和低级视觉能力，可以使现有的大型自然语言模型在多种视觉理解任务中表现竞争力强，并且在CLEVR、CATER和ACRE数据集上达到了state-of-the-art水平。

Abstract
Large language models have recently shown human level performance on a variety of reasoning tasks. However, the ability of these models to perform complex visual reasoning has not been studied in detail yet. A key challenge in many visual reasoning tasks is that the visual information needs to be tightly integrated in the reasoning process. We propose to address this challenge by drawing inspiration from human visual problem solving which depends on a variety of low-level visual capabilities. It can often be cast as the three step-process of ``Look, Remember, Reason'': visual information is incrementally extracted using low-level visual routines in a step-by-step fashion until a final answer is reached. We follow the same paradigm to enable existing large language models, with minimal changes to the architecture, to solve visual reasoning problems. To this end, we introduce rationales over the visual input that allow us to integrate low-level visual capabilities, such as object recognition and tracking, as surrogate tasks. We show competitive performance on diverse visual reasoning tasks from the CLEVR, CATER, and ACRE datasets over state-of-the-art models designed specifically for these tasks.

摘要
We propose a similar approach to enable existing large language models to solve visual reasoning problems with minimal changes to the architecture. To integrate low-level visual capabilities, such as object recognition and tracking, we introduce rationales over the visual input. Our approach allows us to demonstrate competitive performance on diverse visual reasoning tasks from the CLEVR, CATER, and ACRE datasets, outperforming state-of-the-art models designed specifically for these tasks.

MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying

paper_url: http://arxiv.org/abs/2306.17770
repo_url: https://github.com/sshaoshuai/mtr
paper_authors: Shaoshuai Shi, Li Jiang, Dengxin Dai, Bernt Schiele
for: 提高自动驾驶系统的预测能力，更好地理解交通参与者的行为和环境 context。
methods: 提出了 Motion TRansformer（MTR）框架，使用变换器编码器-解码器结构，并增加了学习目的查询，以提高fficient和准确地预测未来轨迹。
results: 在多种 benchmark 上达到了状态zig-zag的表现，而 MTR++ 框架在多个 agent 同时预测多modal motion 方面表现出色，超越了 MTR 框架。

Abstract
Motion prediction is crucial for autonomous driving systems to understand complex driving scenarios and make informed decisions. However, this task is challenging due to the diverse behaviors of traffic participants and complex environmental contexts. In this paper, we propose Motion TRansformer (MTR) frameworks to address these challenges. The initial MTR framework utilizes a transformer encoder-decoder structure with learnable intention queries, enabling efficient and accurate prediction of future trajectories. By customizing intention queries for distinct motion modalities, MTR improves multimodal motion prediction while reducing reliance on dense goal candidates. The framework comprises two essential processes: global intention localization, identifying the agent's intent to enhance overall efficiency, and local movement refinement, adaptively refining predicted trajectories for improved accuracy. Moreover, we introduce an advanced MTR++ framework, extending the capability of MTR to simultaneously predict multimodal motion for multiple agents. MTR++ incorporates symmetric context modeling and mutually-guided intention querying modules to facilitate future behavior interaction among multiple agents, resulting in scene-compliant future trajectories. Extensive experimental results demonstrate that the MTR framework achieves state-of-the-art performance on the highly-competitive motion prediction benchmarks, while the MTR++ framework surpasses its precursor, exhibiting enhanced performance and efficiency in predicting accurate multimodal future trajectories for multiple agents.

摘要
<> translate("Motion prediction is crucial for autonomous driving systems to understand complex driving scenarios and make informed decisions. However, this task is challenging due to the diverse behaviors of traffic participants and complex environmental contexts. In this paper, we propose Motion TRansformer (MTR) frameworks to address these challenges. The initial MTR framework utilizes a transformer encoder-decoder structure with learnable intention queries, enabling efficient and accurate prediction of future trajectories. By customizing intention queries for distinct motion modalities, MTR improves multimodal motion prediction while reducing reliance on dense goal candidates. The framework comprises two essential processes: global intention localization, identifying the agent's intent to enhance overall efficiency, and local movement refinement, adaptively refining predicted trajectories for improved accuracy. Moreover, we introduce an advanced MTR++ framework, extending the capability of MTR to simultaneously predict multimodal motion for multiple agents. MTR++ incorporates symmetric context modeling and mutually-guided intention querying modules to facilitate future behavior interaction among multiple agents, resulting in scene-compliant future trajectories. Extensive experimental results demonstrate that the MTR framework achieves state-of-the-art performance on the highly-competitive motion prediction benchmarks, while the MTR++ framework surpasses its precursor, exhibiting enhanced performance and efficiency in predicting accurate multimodal future trajectories for multiple agents.")Here's the translation:自动驾驶系统需要有效预测动向，以便更好地理解复杂的驾驶场景和做出 Informed Decisions。然而，这个任务受到交通参与者的多样化行为和环境上下文的复杂性的限制。在这篇论文中，我们提出了 Motion TRansformer（MTR）框架，以解决这些挑战。MTR使用了可学习的意图查询，以提高效率和准确性。通过对不同的动态模式进行定制意图查询，MTR可以提高多模态动向预测，同时减少依赖于稠密目标候选人。MTR框架包括两个基本过程：全局意图划定，用于提高总体效率，以及本地运动细化，用于提高预测的准确性。此外，我们还提出了 MTR++ 框架，它可以同时预测多个 Agent 的多模态动向。MTR++ 框架包括对称上下文模型和互相引导意图查询模块，以便在多个 Agent 之间互动，并生成符合场景的未来轨迹。实验结果表明，MTR 框架在高度竞争的动向预测Benchmark上 achieve state-of-the-art 性能，而 MTR++ 框架在其前一代的基础上，具有更高的性能和效率，可以准确预测多个 Agent 的多模态未来轨迹。