2023-07-09

cs.CV

cs.CV - 2023-07-09

Histopathology Whole Slide Image Analysis with Heterogeneous Graph Representation Learning

paper_url: http://arxiv.org/abs/2307.04189
repo_url: https://github.com/hku-medai/wsi-hgnn
paper_authors: Tsai Hor Chan, Fernando Julio Cendra, Lan Ma, Guosheng Yin, Lequan Yu
for: 本研究旨在提出一种基于非同质图的抽象方法，以利用染色体图像中不同类型的核体之间的复杂结构关系来进行抽象分析。
methods: 本研究提出了一种基于非同质图的抽象方法，包括形成WTSI为非同质图，使用HEAT模型进行消息协同汇聚，并提出一种假标签基于含义相似性的 pooling 机制来获取图级特征。
results: 对三个TCGA公共数据集进行了广泛的实验，并证明了该方法可以在不同任务上具有显著的优势，比如识别率、抑阻率等。

Abstract
Graph-based methods have been extensively applied to whole-slide histopathology image (WSI) analysis due to the advantage of modeling the spatial relationships among different entities. However, most of the existing methods focus on modeling WSIs with homogeneous graphs (e.g., with homogeneous node type). Despite their successes, these works are incapable of mining the complex structural relations between biological entities (e.g., the diverse interaction among different cell types) in the WSI. We propose a novel heterogeneous graph-based framework to leverage the inter-relationships among different types of nuclei for WSI analysis. Specifically, we formulate the WSI as a heterogeneous graph with "nucleus-type" attribute to each node and a semantic similarity attribute to each edge. We then present a new heterogeneous-graph edge attribute transformer (HEAT) to take advantage of the edge and node heterogeneity during massage aggregating. Further, we design a new pseudo-label-based semantic-consistent pooling mechanism to obtain graph-level features, which can mitigate the over-parameterization issue of conventional cluster-based pooling. Additionally, observing the limitations of existing association-based localization methods, we propose a causal-driven approach attributing the contribution of each node to improve the interpretability of our framework. Extensive experiments on three public TCGA benchmark datasets demonstrate that our framework outperforms the state-of-the-art methods with considerable margins on various tasks. Our codes are available at https://github.com/HKU-MedAI/WSI-HGNN.

摘要
渐变图方法已广泛应用于整个染色体图像（WSI）分析，这是因为渐变图可以模型染色体图像中的空间关系。然而，大多数现有方法都是使用同质graph（例如，具有同质节点类型）来模型WSI。尽管它们在一定程度上取得了成功，但它们无法捕捉染色体图像中不同生物实体之间复杂的结构关系（例如，不同细胞类型之间的多样化互动）。我们提议一种新的多态渐变图基于框架，以利用染色体图像中不同类型细胞的关系。具体来说，我们将WSI转化为一个多态渐变图，其中每个节点具有“细胞类型”特性，以及每个边具有semantic similarity特性。然后，我们提出一种新的多态渐变图边属性变换器（HEAT），以利用边和节点多样性进行消息汇聚。此外，我们设计了一种新的 pseudo-label-based semantic-consistent pooling机制，以获取图 уров减少过拟合问题。此外，我们发现现有的协同localization方法存在局限性，我们提出一种 causal-driven 方法，以解释我们的框架的解释性。我们的实验结果表明，我们的框架在三个公共 TCGA 测试数据集上比现状态方法具有较大的优势，在不同任务上具有显著的提升。我们的代码可以在中找到。

Predictive Coding For Animation-Based Video Compression

paper_url: http://arxiv.org/abs/2307.04187
repo_url: None
paper_authors: Goluck Konuko, Stéphane Lathuilière, Giuseppe Valenzise
for: 提高视频压缩效率，适用于视频会议类应用。
methods: 基于图像动画的新方法，使用预测编码，对待标题帧进行很好的重建。
results: 对比HEVC和VVC，实现了70%以上的比特率减少和30%以上的比特率减少，在语音视频数据集上。

Abstract
We address the problem of efficiently compressing video for conferencing-type applications. We build on recent approaches based on image animation, which can achieve good reconstruction quality at very low bitrate by representing face motions with a compact set of sparse keypoints. However, these methods encode video in a frame-by-frame fashion, i.e. each frame is reconstructed from a reference frame, which limits the reconstruction quality when the bandwidth is larger. Instead, we propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame. The residuals can be in turn coded in a predictive manner, thus removing efficiently temporal dependencies. Our experiments indicate a significant bitrate gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC, on a datasetof talking-head videos

摘要
我们处理对 conferencing 型应用程序进行高效压缩影像的问题。我们基于最近的图像动画方法，可以在非常低比特率下 achieve good 重建质量。但这些方法在每帧基于参考帧进行重建，因此在带宽较大时会限制重建质量。我们提议一种预测编码方案，使用图像动画作为预测器，并将差分码到目标帧。这些差分可以在预测性下进行高效地删除时间相依性。我们的实验结果显示，与 HEVC 影像标准和 VVC 相比，我们的方法可以获得高达70% 以上的比特率优化，在 talking-head 影像集上。

Reducing False Alarms in Video Surveillance by Deep Feature Statistical Modeling

paper_url: http://arxiv.org/abs/2307.04159
repo_url: None
paper_authors: Xavier Bou, Aitor Artola, Thibaud Ehret, Gabriele Facciolo, Jean-Michel Morel, Rafael Grompone von Gioi
for: 降低视频监测中False Alarm的数量
methods: 基于深度特征高维统计模型的无监督可靠性验证过程
results: 对六种方法和多个数据集中的多个序列进行Pixel和Object级别的评估，实验结果表明提议的a-contrario验证可以大幅减少False Alarm数量。

Abstract
Detecting relevant changes is a fundamental problem of video surveillance. Because of the high variability of data and the difficulty of properly annotating changes, unsupervised methods dominate the field. Arguably one of the most critical issues to make them practical is to reduce their false alarm rate. In this work, we develop a method-agnostic weakly supervised a-contrario validation process, based on high dimensional statistical modeling of deep features, to reduce the number of false alarms of any change detection algorithm. We also raise the insufficiency of the conventionally used pixel-wise evaluation, as it fails to precisely capture the performance needs of most real applications. For this reason, we complement pixel-wise metrics with object-wise metrics and evaluate the impact of our approach at both pixel and object levels, on six methods and several sequences from different datasets. Experimental results reveal that the proposed a-contrario validation is able to largely reduce the number of false alarms at both pixel and object levels.

摘要
检测有关变化是视频监测领域的基本问题。由于数据的高度变化和难以正确地标注变化，因此无监督方法在该领域占据主导地位。然而，减少假警告率是实现这些方法的实用性的核心问题。在这项工作中，我们开发了一种方法不偏的弱监睹验证过程，基于深度特征的高维统计模型，以减少任何变化检测算法的假警告率。此外，我们指出了通用的像素精度评价方法的不足，因为它无法准确地捕捉实际应用中的性能需求。因此，我们补充了像素精度 metric 的对象精度 metric，并对六种方法和多个数据集中的多个序列进行了评估。实验结果表明，我们的弱监睹验证方法能够大幅减少像素和对象级别的假警告率。

DIFF-NST: Diffusion Interleaving For deFormable Neural Style Transfer

paper_url: http://arxiv.org/abs/2307.04157
repo_url: None
paper_authors: Dan Ruta, Gemma Canet Tarrés, Andrew Gilbert, Eli Shechtman, Nicholas Kolkin, John Collomosse
for: 本研究探讨如何使用神经网络技术来修改内容图像的艺术外观，以符合参照样式图像的风格。
methods: 本研究使用了新型的扩散模型，如稳定扩散，可以访问更强大的图像生成技术，以实现新的可能性。
results: 本研究提出了一种新的方法，可以在扩散模型的基础之上实现可变式风格传输，这是前一代模型无法实现的。我们还证明了在推理时可以通过模型的假设来获得新的艺术控制，并在这个新方向下进行了探索。

Abstract
Neural Style Transfer (NST) is the field of study applying neural techniques to modify the artistic appearance of a content image to match the style of a reference style image. Traditionally, NST methods have focused on texture-based image edits, affecting mostly low level information and keeping most image structures the same. However, style-based deformation of the content is desirable for some styles, especially in cases where the style is abstract or the primary concept of the style is in its deformed rendition of some content. With the recent introduction of diffusion models, such as Stable Diffusion, we can access far more powerful image generation techniques, enabling new possibilities. In our work, we propose using this new class of models to perform style transfer while enabling deformable style transfer, an elusive capability in previous models. We show how leveraging the priors of these models can expose new artistic controls at inference time, and we document our findings in exploring this new direction for the field of style transfer.

摘要

A Survey on Figure Classification Techniques in Scientific Documents

paper_url: http://arxiv.org/abs/2307.05694
repo_url: None
paper_authors: Anurag Dhote, Mohammed Javed, David S Doermann
for: 本文主要用于系统地把图像分类为五类：表格、照片、图表、地图和图表，并对现有的方法和数据集进行报告和评论。
methods: 本文使用了不同的人工智能和机器学习技术来从图像中提取数据，包括图像分类、图像描述、图像识别等方法。
results: 本文对现有的方法和数据集进行了批判性评估，并发现了一些研究缺失，提出了可能的未来研究方向。

Abstract
Figures visually represent an essential piece of information and provide an effective means to communicate scientific facts. Recently there have been many efforts toward extracting data directly from figures, specifically from tables, diagrams, and plots, using different Artificial Intelligence and Machine Learning techniques. This is because removing information from figures could lead to deeper insights into the concepts highlighted in the scientific documents. In this survey paper, we systematically categorize figures into five classes - tables, photos, diagrams, maps, and plots, and subsequently present a critical review of the existing methodologies and data sets that address the problem of figure classification. Finally, we identify the current research gaps and provide possible directions for further research on figure classification.

摘要
figuress 可以视觉表达重要信息，提供有效的科学信息传递方式。近些年来，有许多努力在抽取图表数据方面，特别是从表格、图表、图例和地图等方面，使用不同的人工智能和机器学习技术。这是因为从图表中提取信息可能会导致更深入的理解科学文献中所描述的概念。在本评论 paper中，我们系统地分类图表为五类 - 表格、照片、图例、地图和图表，并随后提供了现有方法和数据集的批判性评审。最后，我们确定了当前的研究漏洞和提供了进一步研究图表分类的可能方向。

ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

paper_url: http://arxiv.org/abs/2307.04136
repo_url: https://github.com/zylbuaa/ecl
paper_authors: Yilan Zhang, Jianqi Chen, Ke Wang, Fengying Xie
for: 这个研究旨在解决肤色图像数据集中的数据分布偏好问题，使得计算机支持的皮肤疾病诊断更加困难。methods: 我们提出了一种叫做类增强对照学习（ECL）的方法，它可以增强少数类别中的信息，并对不同类别进行平等对待。为了实现信息增强，我们设计了一种混合代理模型，并提出了一种循环更新策略来优化参数。我们还设计了一种类别依赖的混合代理损失函数，以利用样本和代理之间的关系，并对不同类别进行平等对待。results: 我们的方法在处理肤色图像数据集中的分类任务中达到了最高的性能。我们还通过评估学习曲线来证明我们的方法可以适应不同的学习环境，并且在不同的数据分布下都能够保持高度的稳定性。

Abstract
Skin image datasets often suffer from imbalanced data distribution, exacerbating the difficulty of computer-aided skin disease diagnosis. Some recent works exploit supervised contrastive learning (SCL) for this long-tailed challenge. Despite achieving significant performance, these SCL-based methods focus more on head classes, yet ignoring the utilization of information in tail classes. In this paper, we propose class-Enhancement Contrastive Learning (ECL), which enriches the information of minority classes and treats different classes equally. For information enhancement, we design a hybrid-proxy model to generate class-dependent proxies and propose a cycle update strategy for parameters optimization. A balanced-hybrid-proxy loss is designed to exploit relations between samples and proxies with different classes treated equally. Taking both "imbalanced data" and "imbalanced diagnosis difficulty" into account, we further present a balanced-weighted cross-entropy loss following curriculum learning schedule. Experimental results on the classification of imbalanced skin lesion data have demonstrated the superiority and effectiveness of our method.

摘要
皮肤图像数据集经常受到数据分布不均衡的影响，使计算机辅助皮肤病诊断变得更加困难。一些最近的研究利用Supervised Contrastive Learning（SCL）来解决这个长尾挑战。尽管这些SCL基于方法达到了显著的性能，但是它们更关注头等级类，忽略了使用尾等级类信息。在这篇论文中，我们提出了类增强对比学习（ECL）方法，它可以增强少数量级类信息并对不同类型进行平等对待。为了增强信息，我们设计了一种混合代理模型，生成类具有不同代理模型，并提出了一种循环更新策略来优化参数。为了利用样本和代理之间的关系，我们设计了一种权重平衡损失函数。考虑到“不均衡数据”和“不均衡诊断难度”两个因素，我们还提出了一种平衡权重十进制架构。实验结果表明，我们的方法在皮肤病患数据分类任务中具有superiority和有效性。

Ultrasonic Image’s Annotation Removal: A Self-supervised Noise2Noise Approach

paper_url: http://arxiv.org/abs/2307.04133
repo_url: https://github.com/grandarth/ultrasonicimage-n2n-approach
paper_authors: Yuanheng Zhang, Nan Jiang, Zhaoheng Xie, Junying Cao, Yueyang Teng
for: 这篇论文的目的是创建一个自动标注医疗影像的方法，以减少医疗影像标注的人工审核时间。
methods: 这篇论文使用了一个自我指定预备任务，将标注视为噪音，并使用了一个基于噪音的模型，将影像重新构成为清洁的形式。
results: 这篇论文的结果显示，使用了自我指定预备任务和噪音的模型，可以对医疗影像进行高精度的标注，并且比使用噪音-清洁数据对的模型更好。

Abstract
Accurately annotated ultrasonic images are vital components of a high-quality medical report. Hospitals often have strict guidelines on the types of annotations that should appear on imaging results. However, manually inspecting these images can be a cumbersome task. While a neural network could potentially automate the process, training such a model typically requires a dataset of paired input and target images, which in turn involves significant human labour. This study introduces an automated approach for detecting annotations in images. This is achieved by treating the annotations as noise, creating a self-supervised pretext task and using a model trained under the Noise2Noise scheme to restore the image to a clean state. We tested a variety of model structures on the denoising task against different types of annotation, including body marker annotation, radial line annotation, etc. Our results demonstrate that most models trained under the Noise2Noise scheme outperformed their counterparts trained with noisy-clean data pairs. The costumed U-Net yielded the most optimal outcome on the body marker annotation dataset, with high scores on segmentation precision and reconstruction similarity. We released our code at https://github.com/GrandArth/UltrasonicImage-N2N-Approach.

摘要
高品质医疗报告中的精准阴影图像是不可或缺的元素。医院通常有严格的指引，要求医疗影像报告中的标注项目。然而，手动检查这些图像可能是一个费时的任务。 neural network 可能可以自动进行这个任务，但是训练这种模型通常需要一个对应的数据集，这需要大量的人工劳动。本研究提出了一个自动标注图像的方法。这是通过将标注视为噪音，创建一个自我监督任务，并使用以噪音为学习目标的模型来恢复图像的清洁状态。我们对恢复任务进行了多种模型结构的测试，包括体部标注、径向线标注等。我们的结果显示，大多数以噪音为学习目标的模型比以噪音-清洁数据对的模型来得到更好的结果。自适应U-Net传播网络在体部标注数据集上获得最佳效果，高于分类精度和重建相似性。我们将我们的代码公开在 GitHub 上，请参考 https://github.com/GrandArth/UltrasonicImage-N2N-Approach.

paper_url: http://arxiv.org/abs/2307.04129
repo_url: https://github.com/ZHU-Zhiyu/High-Rank_RGB-Event_Tracker
paper_authors: Zhiyu Zhu, Junhui Hou, Dapeng Oliver Wu
for: 这个论文解决了RGB视频和事件数据之间的跨模态对象跟踪问题。
methods: 我们不是构建复杂的跨模态融合网络，而是探索pre-trained视觉转换器（ViT）的潜在能力。我们特别是通过插件和游戏训练增强技术来鼓励ViTbridging两种模态之间的巨大分布差，从而启用全面的跨模态信息互动，提高其能力。我们提议一种面积模型策略，随机遮盖某些批处理的特定模式，以促进不同模式之间的互动。
results: 我们的插件和游戏训练增强技术可以明显提高现有的一气道和两气道跟踪器的跟踪精度和成功率。我们的新视角和发现可能会为跨模态数据模型领域带来新的思路和发现。代码将公开发布。

Abstract
This paper addresses the problem of cross-modal object tracking from RGB videos and event data. Rather than constructing a complex cross-modal fusion network, we explore the great potential of a pre-trained vision Transformer (ViT). Particularly, we delicately investigate plug-and-play training augmentations that encourage the ViT to bridge the vast distribution gap between the two modalities, enabling comprehensive cross-modal information interaction and thus enhancing its ability. Specifically, we propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively. To mitigate network oscillations resulting from the masking strategy and further amplify its positive effect, we then theoretically propose an orthogonal high-rank loss to regularize the attention matrix. Extensive experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and twostream trackers to a large extent in terms of both tracking precision and success rate. Our new perspective and findings will potentially bring insights to the field of leveraging powerful pre-trained ViTs to model cross-modal data. The code will be publicly available.

摘要

Marine Debris Detection in Satellite Surveillance using Attention Mechanisms

paper_url: http://arxiv.org/abs/2307.04128
repo_url: None
paper_authors: Ao Shen, Yijie Zhu, Richard Jiang
for: 本研究旨在提高marine debris的位置Localization效率和应用范围，通过结合YOLOv7的实例分割和不同的注意机制。
methods: 本研究使用了一个标注 dataset，包括含有海洋垃圾的卫星图像，并评估了三种注意模型，包括轻量级坐标注意、CBAM（结合空间和通道注意）以及瓶颈transformer（基于自注意）。
results: Box detection 评估显示，CBAM 得到了最好的结果（F1 分数为 77%），比 coordinate attention （F1 分数为 71%）和 YOLOv7/瓶颈transformer （两者 F1 分数为 around 66%）更好。Mask 评估显示 CBAM 再次领先，F1 分数为 73%，而 coordinate attention 和 YOLOv7 的表现相似（around F1 分数为 68%/69%），瓶颈transformer 落后，F1 分数为 56%。这些结果表明，CBAM 适合 Marine debris 的检测。但是，瓶颈transformer 可能具有更好的实际性能，因为它检测到了一些人工标注 missed 的区域，并且具有较好的面积精度。

Abstract
Marine debris is an important issue for environmental protection, but current methods for locating marine debris are yet limited. In order to achieve higher efficiency and wider applicability in the localization of Marine debris, this study tries to combine the instance segmentation of YOLOv7 with different attention mechanisms and explores the best model. By utilizing a labelled dataset consisting of satellite images containing ocean debris, we examined three attentional models including lightweight coordinate attention, CBAM (combining spatial and channel focus), and bottleneck transformer (based on self-attention). Box detection assessment revealed that CBAM achieved the best outcome (F1 score of 77%) compared to coordinate attention (F1 score of 71%) and YOLOv7/bottleneck transformer (both F1 scores around 66%). Mask evaluation showed CBAM again leading with an F1 score of 73%, whereas coordinate attention and YOLOv7 had comparable performances (around F1 score of 68%/69%) and bottleneck transformer lagged behind at F1 score of 56%. These findings suggest that CBAM offers optimal suitability for detecting marine debris. However, it should be noted that the bottleneck transformer detected some areas missed by manual annotation and displayed better mask precision for larger debris pieces, signifying potentially superior practical performance.

摘要
海洋垃圾是环境保护的重要问题，但当前用于找到海洋垃圾的方法仍有限制。为了提高效率和应用范围，这项研究尝试将YOLOv7的实例 segmentation技术与不同的注意机制结合，并探索最佳模型。通过使用含有海洋垃圾的卫星图像标注 datasets，我们评估了三种注意模型，包括轻量级坐标注意、CBAM（结合空间和通道注意）和瓶颈变换器（基于自注意）。 Box 检测评估表明，CBAM得到了最佳结果（F1分数为77%），比 coordinate attention（F1分数为71%）和 YOLOv7/瓶颈变换器（两者F1分数都在66%左右）更高。面 mask 评估表明，CBAM再次领先，F1分数为73%；coordinate attention 和 YOLOv7 的表现相似（约F1分数为68%/69%），而瓶颈变换器则落后，F1分数为56%。这些结果表明，CBAM 是最适合检测海洋垃圾的选择。然而，瓶颈变换器可能在实际应用中表现更好，因为它检测了一些人工标注 missed 的区域，并且对大型垃圾件的面 mask 准确率较高。

HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding

paper_url: http://arxiv.org/abs/2307.05721
repo_url: None
paper_authors: Hao Zheng, Regina Lee, Yuqian Lu
for: 这 paper 是为了掌握现代制造业中的人工 Assembly 知识而做的，以便实现技术突破。
methods: 这 paper 使用了人工视频数据集 HA-ViD，该数据集包含了真实世界中的 Assembly 场景，以及人类在 Assembly 过程中的自然行为和学习过程。
results: 这 paper 通过分析不同的 Assembly 场景、人类行为和学习过程，对现代制造业中的 Assembly 知识进行了全面的掌握和分析。

Abstract
Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD - the first human assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance for comprehending knowledge in assembly progress, process efficiency, task collaboration, skill parameters and human intention. Details of HA-ViD is available at: https://iai-hrc.github.io/ha-vid.

摘要
理解全面 montage 知识从视频是未来超智能工业的关键。为实现技术突破，我们提出了HA-ViD，首个人工 montage 视频集，它包含了真实工业 assembly 场景，自然的学习过程，以及人机共享标注。HA-ViD 捕捉了真实 assembly 中的多样化合作模式，人类行为和学习过程，并为每个动作分别提供了主体、动作词、操作对象、目标对象和工具的精细标注。我们提供了3222个多视角、多种 modalities 视频（每个视频包含一个 assembly 任务），150万帧，96000个时间标签和200万个空间标签。我们对四个基本视频理解任务进行了测试：行动识别、动作分割、对象检测和多对象跟踪。我们还分析了它们在理解 assembly 进度、生产效率、任务合作、技能参数和人类意图方面的性能。HA-ViD 的详细信息可以在以下网址中找到：https://iai-hrc.github.io/ha-vid。

Enhancing Low-Light Images Using Infrared-Encoded Images

paper_url: http://arxiv.org/abs/2307.04122
repo_url: https://github.com/wyf0912/ELIEI
paper_authors: Shulin Tian, Yufei Wang, Renjie Wan, Wenhan Yang, Alex C. Kot, Bihan Wen
For: 提高低光照图像的可见度和细节表示，增强低光照图像的感知效果。* Methods: 根据各个像素的值，除去各个像素的IR滤波器，从而提高图像的信号吞吐量和细节表示。* Results: 经验结果表明，提议的方法可以更好地提高低光照图像的可见度和细节表示，并且和参考图像的对比表明，提议的方法可以更好地保留图像的细节和含义。

Abstract
Low-light image enhancement task is essential yet challenging as it is ill-posed intrinsically. Previous arts mainly focus on the low-light images captured in the visible spectrum using pixel-wise loss, which limits the capacity of recovering the brightness, contrast, and texture details due to the small number of income photons. In this work, we propose a novel approach to increase the visibility of images captured under low-light environments by removing the in-camera infrared (IR) cut-off filter, which allows for the capture of more photons and results in improved signal-to-noise ratio due to the inclusion of information from the IR spectrum. To verify the proposed strategy, we collect a paired dataset of low-light images captured without the IR cut-off filter, with corresponding long-exposure reference images with an external filter. The experimental results on the proposed dataset demonstrate the effectiveness of the proposed method, showing better performance quantitatively and qualitatively. The dataset and code are publicly available at https://wyf0912.github.io/ELIEI/

摘要
低光照图像增强任务是必备又挑战的，因为它是内在无法定义的。先前的艺术主要在可见光谱上使用像素损失来处理低光照图像，这限制了恢复照度、对比度和Texture详细的能力，因为可见光谱中的光子数量太少。在这种工作中，我们提出了一种新的方法，通过从增强摄像头中除掉内部红外（IR）遮盖器，以获取更多的光子信息，从而提高信号噪声比。为验证提议的策略，我们收集了一个对应的低光照图像和长暂曝光参考图像的对照数据集。实验结果表明，提议的方法具有较好的量化和质量性能。数据集和代码在https://wyf0912.github.io/ELIEI/可公共下载。

Mitosis Detection from Partial Annotation by Dataset Generation via Frame-Order Flipping

paper_url: http://arxiv.org/abs/2307.04113
repo_url: https://github.com/naivete5656/mdpafof
paper_authors: Kazuya Nishimura, Ami Katanaya, Shinichiro Chuma, Ryoma Bise
for: 提高生物医学研究中的细胞分化检测精度
methods: 使用部分标注序列训练深度学习模型
results: 对四个数据集进行测试，比较其性能和其他比较方法，得到了更高的检测精度

Abstract
Detection of mitosis events plays an important role in biomedical research. Deep-learning-based mitosis detection methods have achieved outstanding performance with a certain amount of labeled data. However, these methods require annotations for each imaging condition. Collecting labeled data involves time-consuming human labor. In this paper, we propose a mitosis detection method that can be trained with partially annotated sequences. The base idea is to generate a fully labeled dataset from the partial labels and train a mitosis detection model with the generated dataset. First, we generate an image pair not containing mitosis events by frame-order flipping. Then, we paste mitosis events to the image pair by alpha-blending pasting and generate a fully labeled dataset. We demonstrate the performance of our method on four datasets, and we confirm that our method outperforms other comparisons which use partially labeled sequences.

摘要
检测细胞分裂事件在生物医学研究中扮演着重要的角色。基于深度学习的细胞分裂检测方法已经实现了非常出色的表现，但这些方法需要每种检测条件的注释。收集标注数据需要很长时间的人工劳动。在本文中，我们提出了一种不需要完全标注的细胞分裂检测方法。我们的基本想法是通过将部分标注序列转化成完全标注序列，然后使用生成的序列来训练细胞分裂检测模型。我们的方法包括将帧顺序翻转生成不含细胞分裂事件的图像对，然后使用alpha拟合粘贴细胞分裂事件到图像对中，并生成了一个完全标注的数据集。我们在四个数据集上展示了我们的方法的性能，并证明了我们的方法在与其他部分标注序列比较的情况下表现更好。

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird’s Eye View

paper_url: http://arxiv.org/abs/2307.04106
repo_url: None
paper_authors: Jiayu Yang, Enze Xie, Miaomiao Liu, Jose M. Alvarez
for: 本研究旨在提高视觉 только perceive autonomous driving模型的性能，通过编码多视图图像特征到 Bird’s-Eye-View（BEV）空间中。
methods: 我们提出使用parametric depth distribution modeling来模型特征转换。我们首先通过每个像素在每个视图中预测parametric depth distribution来提升2D图像特征到3D空间中，然后将3D特征体系归一化到BEV幂。
results: 我们的方法在object detection和semantic segmentation任务上表现出色，比existings方法更高效。此外，我们还提出了一种新的可视性感知评价指标，可以减少halucination问题的影响。

Abstract
Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works rely on non-parametric depth distribution modeling leading to significant memory consumption, or ignore the geometry information to address this problem. In contrast, we propose to use parametric depth distribution modeling for feature transformation. We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view. Then, we aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame. Finally, we use the transformed features for downstream tasks such as object detection and semantic segmentation. Existing semantic segmentation methods do also suffer from an hallucination problem as they do not take visibility information into account. This hallucination can be particularly problematic for subsequent modules such as control and planning. To mitigate the issue, our method provides depth uncertainty and reliable visibility-aware estimations. We further leverage our parametric depth modeling to present a novel visibility-aware evaluation metric that, when taken into account, can mitigate the hallucination problem. Extensive experiments on object detection and semantic segmentation on the nuScenes datasets demonstrate that our method outperforms existing methods on both tasks.

摘要
近期无视图识别模型在自动驾驶领域取得了有前途的结果，其中包括编码多视图图像特征到 Bird's-Eye-View（BEV）空间。这个步骤是无视图识别模型的关键步骤，同时也是主要的瓶颈。现有的方法通过非 Parametric depth 分布模型来解决这个问题，导致巨大的内存占用，或者完全忽略geometry信息。相比之下，我们提议使用 Parametric depth 分布模型来模型特征转换。我们首先通过每个像素在每个视图中预测的 Parametric depth 分布将二dimensional的图像特征提升到 egovehicle 定义的三维空间中。然后，我们根据 depth 空间占用 derivation 集成三维特征体volume到 BEV 帧中。最后，我们使用转换后的特征来进行下游任务，如物体检测和semantic segmentation。现有的semantic segmentation方法也受到了一种halucination问题的困扰，他们不考虑视ibilty信息。这种halucination可能对后续模块，如控制和规划，造成 particualrly problematic。为了缓解这个问题，我们的方法提供了depth uncertainty和可靠的视ibilty感知。此外，我们还利用我们的 Parametric depth 模型来提出一种新的可见性感知 metric，当被考虑时，可以缓解halucination问题。我们在 nuScenes 数据集上进行了广泛的对象检测和semantic segmentation实验，结果表明我们的方法在两个任务上都高于现有方法。

CA-CentripetalNet: A novel anchor-free deep learning framework for hardhat wearing detection

paper_url: http://arxiv.org/abs/2307.04103
repo_url: None
paper_authors: Zhijian Liu, Nian Cai, Wensheng Ouyang, Chengbin Zhang, Nili Tian, Han Wang
for: 强化建筑工地安全管理，提高建筑工人安全度
methods: 使用CA-CentripetalNet深度学习框架，并提出了两种新的策略，即垂直水平角块挤压和约束中心注意力，以提高特征提取和利用能力
results: 实验结果显示，CA-CentripetalNet在准确率和内存占用之间取得了更好的平衡，具体来说是86.63%的MAP值，特别是在小型帽子和非穿着帽子的情况下表现更佳，而且比既有深度学习方法更快速、更低占用内存。

Abstract
Automatic hardhat wearing detection can strengthen the safety management in construction sites, which is still challenging due to complicated video surveillance scenes. To deal with the poor generalization of previous deep learning based methods, a novel anchor-free deep learning framework called CA-CentripetalNet is proposed for hardhat wearing detection. Two novel schemes are proposed to improve the feature extraction and utilization ability of CA-CentripetalNet, which are vertical-horizontal corner pooling and bounding constrained center attention. The former is designed to realize the comprehensive utilization of marginal features and internal features. The latter is designed to enforce the backbone to pay attention to internal features, which is only used during the training rather than during the detection. Experimental results indicate that the CA-CentripetalNet achieves better performance with the 86.63% mAP (mean Average Precision) with less memory consumption at a reasonable speed than the existing deep learning based methods, especially in case of small-scale hardhats and non-worn-hardhats.

摘要
自动帽子穿戴检测可以增强建筑现场的安全管理，但是由于复杂的视频监测场景，这还是一项挑战。为了解决先前深度学习基于方法的泛化不佳问题，一种新的无锚点深度学习框架called CA-CentripetalNet被提议用于帽子穿戴检测。两种新的方案被提出来提高CA-CentripetalNet的特征提取和利用能力，即水平垂直角池和约束中心注意力。前者是为了实现全面利用边缘特征和内部特征。后者是为了让底层在训练时强制注意内部特征，但是只在训练中使用，而不是在检测中使用。实验结果显示，CA-CentripetalNet可以在小规模帽子和非穿戴帽子情况下达到86.63%的MAP（平均准确率），并且具有较低的内存占用和合理的速度，比先前的深度学习基于方法更好。

Enhancing Building Semantic Segmentation Accuracy with Super Resolution and Deep Learning: Investigating the Impact of Spatial Resolution on Various Datasets

paper_url: http://arxiv.org/abs/2307.04101
repo_url: None
paper_authors: Zhiling Guo, Xiaodan Shi, Haoran Zhang, Dou Huang, Xiaoya Song, Jinyue Yan, Ryosuke Shibasaki
for: 本研究旨在 investigate the impact of spatial resolution on deep learning based building semantic segmentation, and provide insights for data selection and preparation.methods: 本研究使用 remote sensing images 在三个研究区域中创造多个空间分辨率，通过超分辨和降采样。然后，选择了 two 个代表性的深度学习架构：UNet 和 FPN，进行模型训练和测试。results: 实验结果显示，空间分辨率对建筑分类结果产生很大影响，0.3米级别的空间分辨率具有更高的成本效果。

Abstract
The development of remote sensing and deep learning techniques has enabled building semantic segmentation with high accuracy and efficiency. Despite their success in different tasks, the discussions on the impact of spatial resolution on deep learning based building semantic segmentation are quite inadequate, which makes choosing a higher cost-effective data source a big challenge. To address the issue mentioned above, in this study, we create remote sensing images among three study areas into multiple spatial resolutions by super-resolution and down-sampling. After that, two representative deep learning architectures: UNet and FPN, are selected for model training and testing. The experimental results obtained from three cities with two deep learning models indicate that the spatial resolution greatly influences building segmentation results, and with a better cost-effectiveness around 0.3m, which we believe will be an important insight for data selection and preparation.

摘要
<>Remote sensing技术和深度学习技术的发展，使得建筑 semantic segmentation 的准确率和效率得到了大幅提高。 DESPITE 这些成功在不同任务中，关于深度学习基于建筑 semantic segmentation 的空间分辨率影响的讨论相对较少，这使得选择更高成本效益的数据源变得具有挑战性。为解决上述问题，在本研究中，我们将Remote sensing 图像分割成多个空间分辨率，通过超分辨和降分辨。然后，我们选择了两种代表性的深度学习架构：UNet 和 FPN，进行模型训练和测试。实验结果表明，在三个城市的两种深度学习模型下，空间分辨率对建筑分割结果产生了很大影响，并且在约0.3米的成本效益下，我们认为这将成为数据选择和准备的重要视角。Note: The text has been translated using Google Translate, and some minor adjustments may be necessary to ensure accuracy.

Visible and infrared self-supervised fusion trained on a single example

paper_url: http://arxiv.org/abs/2307.04100
repo_url: None
paper_authors: Nati Ofir
for: 这个论文解决了RGB和 Near-Infrared（NIR）图像 fusión问题。
methods: 该方法使用了一个Convolutional-Neural-Network（CNN），通过Self-Supervised-Learning（SSL）在一个示例上训练。
results: 该方法可以 preserve each spectral channel的相关细节，而不需要大量的训练过程。在实验部分，该方法比其他最近的方法得到了更好的量化和质量的多спектраль融合结果。

Abstract
This paper addresses the problem of visible (RGB) to Near-Infrared (NIR) image fusion. Multispectral imaging is an important task relevant to image processing and computer vision, even more, since the development of the RGBT sensor. While the visible image sees color and suffers from noise, haze, and clouds, the NIR channel captures a clearer picture and it is significantly required by applications such as dehazing or object detection. The proposed approach fuses these two aligned channels by training a Convolutional-Neural-Network (CNN) by a Self-Supervised-Learning (SSL) on a single example. For each such pair, RGB and IR, the network is trained for seconds to deduce the final fusion. The SSL is based on Sturcture-of-Similarity (SSIM) loss combined with Edge-Preservation (EP) loss. The labels for the SSL are the input channels themselves. This fusion preserves the relevant detail of each spectral channel while not based on a heavy training process. In the experiments section, the proposed approach achieves better qualitative and quantitative multispectral fusion results with respect to other recent methods, that are not based on large dataset training.

摘要

GNP Attack: Transferable Adversarial Examples via Gradient Norm Penalty

paper_url: http://arxiv.org/abs/2307.04099
repo_url: None
paper_authors: Tao Wu, Tie Luo, Donald C. Wunsch
for: 增强敌意例的跨模型可转移性，使得实际黑盒攻击可以在不同的目标模型上进行。
methods: 使用Gradient Norm Penalty（GNP）进行攻击，GNP会让优化过程逐渐落入丢失函数的平坦区域，从而提高敌意例的可转移性。
results: 通过对11个state-of-the-art深度学习模型和6个高级防御方法进行攻击，实证表明GNP可以非常有效地生成高可转移性的敌意例。此外，GNP也可以与其他梯度基于方法结合使用，以实现更强大的跨模型攻击。

Abstract
Adversarial examples (AE) with good transferability enable practical black-box attacks on diverse target models, where insider knowledge about the target models is not required. Previous methods often generate AE with no or very limited transferability; that is, they easily overfit to the particular architecture and feature representation of the source, white-box model and the generated AE barely work for target, black-box models. In this paper, we propose a novel approach to enhance AE transferability using Gradient Norm Penalty (GNP). It drives the loss function optimization procedure to converge to a flat region of local optima in the loss landscape. By attacking 11 state-of-the-art (SOTA) deep learning models and 6 advanced defense methods, we empirically show that GNP is very effective in generating AE with high transferability. We also demonstrate that it is very flexible in that it can be easily integrated with other gradient based methods for stronger transfer-based attacks.

摘要
“敌对例”（AE）具有良好的转移性，可以实现实际的黑盒攻击，不需要内部知识关于目标模型。先前的方法往往将AE生成成为特定架构和特征表示的适应器，即使生成的AE对目标黑盒模型具有很少或根本无效的转移性。在本文中，我们提出了一种增强AE转移性的方法，使用梯度 нор penalty（GNP）。它驱动损失函数优化程序落在损失图形中的极小区域中落点。我们透过对11个现代深度学习模型和6个高级防护方法进行实验，证明了GNP具有很高的转移性。我们还证明了它可以与其他梯度基本方法结合使用，以实现更强的转移基于攻击。

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.04091
repo_url: https://github.com/Jun-CEN/CMDFusion
paper_authors: Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, Qifeng Chen
for: 本研究提出了一种bidirectional fusion network with cross-modality knowledge distillation（CMDFusion），用于解决自动驾驶车辆的LIDAR semantic segmentation任务中的2D和3D混合问题。
methods: 我们的CMDFusion方法有两个贡献：首先，我们的对称混合方案可以同时利用2D和3D信息，从而超越单个混合方案；其次，我们通过知识传授来帮助3D网络生成2D信息，从而解决RGB图像不可预知的问题。
results: 我们在SemanticKITTI和nuScenes数据集上测试了CMDFusion方法，并证明其在所有混合基于方法中达到最好的性能。

Abstract
2D RGB images and 3D LIDAR point clouds provide complementary knowledge for the perception system of autonomous vehicles. Several 2D and 3D fusion methods have been explored for the LIDAR semantic segmentation task, but they suffer from different problems. 2D-to-3D fusion methods require strictly paired data during inference, which may not be available in real-world scenarios, while 3D-to-2D fusion methods cannot explicitly make full use of the 2D information. Therefore, we propose a Bidirectional Fusion Network with Cross-Modality Knowledge Distillation (CMDFusion) in this work. Our method has two contributions. First, our bidirectional fusion scheme explicitly and implicitly enhances the 3D feature via 2D-to-3D fusion and 3D-to-2D fusion, respectively, which surpasses either one of the single fusion schemes. Second, we distillate the 2D knowledge from a 2D network (Camera branch) to a 3D network (2D knowledge branch) so that the 3D network can generate 2D information even for those points not in the FOV (field of view) of the camera. In this way, RGB images are not required during inference anymore since the 2D knowledge branch provides 2D information according to the 3D LIDAR input. We show that our CMDFusion achieves the best performance among all fusion-based methods on SemanticKITTI and nuScenes datasets. The code will be released at https://github.com/Jun-CEN/CMDFusion.

摘要
二维RGB图像和三维激光雷达点云提供补充知识 для自动驾驶车辆的识别系统。许多二维和三维融合方法已经为LIDAR语义分割任务进行研究，但它们受到不同的问题困扰。二维到三维融合方法需要在推理过程中具有匹配的数据，而三维到二维融合方法无法直接充分利用二维信息。因此，我们在本工作中提出了一种架构协同融合网络（CMDFusion）。我们的方法有两个贡献。首先，我们的对称融合方案可以同时利用二维和三维信息，通过二维到三维融合和三维到二维融合来强化三维特征，这超过了单一的融合方案。其次，我们将二维知识从二维网络（Camera分支）传承给三维网络（二维知识分支），以便三维网络可以根据三维激光雷达输入生成二维信息，不需要RGB图像在推理过程中 anymore。我们表明，我们的CMDFusion在SemanticKITTI和nuScenes数据集上的性能比所有融合基于方法更好。代码将于https://github.com/Jun-CEN/CMDFusion中发布。

SVIT: Scaling up Visual Instruction Tuning

paper_url: http://arxiv.org/abs/2307.04087
repo_url: https://github.com/baai-dcai/visual-instruction-tuning
paper_authors: Bo Zhao, Boya Wu, Tiejun Huang
for: 提高多Modal模型的可视理解和计划能力
methods: 使用Visual Instruction Tuning（SVIT）集成3.2万个视觉指令调整数据，包括1.6万对话问答（QA）对和1.6万个复杂的推理QA对和106个详细的图像描述。
results: 训练多Modal模型 на SVIT可以显著提高多Modal性能，包括视觉理解、推理和规划能力。

Abstract
Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

摘要
Thanks to the emergence of foundation models, large language and vision models are integrated to acquire multimodal capabilities such as visual captioning, dialogue, question answering, etc. Although existing multimodal models have shown impressive performance in visual understanding and reasoning, their limitations are still largely unexplored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we constructed a dataset of 3.2 million visual instruction tuning data, including 1.6 million conversation question-answer (QA) pairs and 1.6 million complex reasoning QA pairs, as well as 106,000 detailed image descriptions. Besides the volume, the proposed dataset is also characterized by high quality and rich diversity, generated by prompting GPT-4 with abundant manual annotations of images. We empirically verified that training multimodal models on SVIT can significantly improve multimodal performance in terms of visual perception, reasoning, and planning.

Score-based Conditional Generation with Fewer Labeled Data by Self-calibrating Classifier Guidance

paper_url: http://arxiv.org/abs/2307.04081
repo_url: None
paper_authors: Paul Kuo-Ming Huang, Si-An Chen, Hsuan-Tien Lin
for: 提高基于分类器的生成模型（SGMs）的 conditional generation 质量，特别是使用 fewer labeled data。
methods: 利用 energy-based models 将分类器视为另一种视角，然后使用现有的损失函数来准确地调整分类器。
results: 提出的方法可以在不同百分比的标注数据量下提高 conditional generation 质量，并且在使用 fewer labeled data 时比其他 conditional SGMs 表现更佳。

Abstract
Score-based Generative Models (SGMs) are a popular family of deep generative models that achieves leading image generation quality. Earlier studies have extended SGMs to tackle class-conditional generation by coupling an unconditional SGM with the guidance of a trained classifier. Nevertheless, such classifier-guided SGMs do not always achieve accurate conditional generation, especially when trained with fewer labeled data. We argue that the issue is rooted in unreliable gradients of the classifier and the inability to fully utilize unlabeled data during training. We then propose to improve classifier-guided SGMs by letting the classifier calibrate itself. Our key idea is to use principles from energy-based models to convert the classifier as another view of the unconditional SGM. Then, existing loss for the unconditional SGM can be adopted to calibrate the classifier using both labeled and unlabeled data. Empirical results validate that the proposed approach significantly improves the conditional generation quality across different percentages of labeled data. The improved performance makes the proposed approach consistently superior to other conditional SGMs when using fewer labeled data. The results confirm the potential of the proposed approach for generative modeling with limited labeled data.

摘要
Score-based生成模型（SGM）是一种深度生成模型，其可以实现领先的图像生成质量。早期研究者将SGM扩展到实现类别条件生成，通过与训练过的分类器结合来实现。然而，这种类器导向的SGM并不总能够实现准确的条件生成，特别是当使用少量标注数据训练时。我们认为这问题的根本原因是分类器的梯度不可靠和无法完全利用无标注数据 during training。我们然后提议通过让分类器自己准确化来改进类器导向的SGM。我们的关键思想是使用能量基模型的原理来转换分类器，然后使用现有的损失函数来准确化分类器，并使用标注数据和无标注数据进行准确化。实验结果表明，我们的方法可以明显提高条件生成质量，并在使用少量标注数据时保持优越性。这些结果证明了我们的方法在生成模型中具有有限标注数据的潜在优势。

Random Position Adversarial Patch for Vision Transformers

paper_url: http://arxiv.org/abs/2307.04066
repo_url: None
paper_authors: Mingzhen Shao
for: This paper aims to propose a novel method for generating adversarial patches that can launch targeted attacks on vision transformers, overcoming the alignment constraint of previous studies.methods: The proposed method employs a GAN-like structure to generate the adversarial patch, instead of directly optimizing the patch using gradients.results: The generated adversarial patch exhibits effectiveness in achieving universal attacks on vision transformers in both digital and physical-world scenarios, and shows robustness to brightness restriction, color transfer, and random noise. Real-world attack experiments validate the effectiveness of the proposed method.Here’s the Chinese version:for: 这篇论文目的是提出一种新的对视转换器进行攻击的方法，使得攻击可以在任何位置进行。methods: 该方法使用GAN-like结构生成攻击 patch，而不是直接使用梯度来优化 patch。results: 生成的攻击 patch能够在数字和物理世界中实现对视转换器的通用攻击，并且具有对比明亮、颜色转换和随机噪声的Robustness。实际攻击实验证明了提案的有效性。

Abstract
Previous studies have shown the vulnerability of vision transformers to adversarial patches, but these studies all rely on a critical assumption: the attack patches must be perfectly aligned with the patches used for linear projection in vision transformers. Due to this stringent requirement, deploying adversarial patches for vision transformers in the physical world becomes impractical, unlike their effectiveness on CNNs. This paper proposes a novel method for generating an adversarial patch (G-Patch) that overcomes the alignment constraint, allowing the patch to launch a targeted attack at any position within the field of view. Specifically, instead of directly optimizing the patch using gradients, we employ a GAN-like structure to generate the adversarial patch. Our experiments show the effectiveness of the adversarial patch in achieving universal attacks on vision transformers, both in digital and physical-world scenarios. Additionally, further analysis reveals that the generated adversarial patch exhibits robustness to brightness restriction, color transfer, and random noise. Real-world attack experiments validate the effectiveness of the G-Patch to launch robust attacks even under some very challenging conditions.

摘要
Here's the Simplified Chinese translation:先前的研究已经显示了视transformer的易受到攻击的潜在性，但这些研究都基于一个严格的假设：攻击 patches必须与视transformer中使用的 linear projection patches完全匹配。由于这个严格的要求，在实际世界中部署攻击 patches for vision transformers是不实际的，与 CNNs 不同。这篇论文提出了一种新的方法来生成攻击 patch (G-Patch)，以 overcome 这个匹配 constraint，让 patch 可以在视野中任意位置发起攻击。具体来说，我们不直接使用梯度来优化 patch，而是使用 GAN 结构来生成攻击 patch。我们的实验显示，G-Patch 可以在数字和物理世界中实现对 vision transformers 的通用攻击。此外，进一步的分析发现，生成的攻击 patch 具有对 brightness restriction、color transfer 和随机噪声的Robustness。实际攻击实验证明 G-Patch 可以在一些非常困难的条件下发起Robust攻击。

Combining transmission speckle photography and convolutional neural network for determination of fat content in cow’s milk – an exercise in classification of parameters of a complex suspension

paper_url: http://arxiv.org/abs/2307.15069
repo_url: None
paper_authors: Kwasi Nyandey, Daniel Jakubczyk
for: direct classification and recognition of milk fat content classes
methods: combined transmission speckle photography and machine learning
results: achieved 100% test and independent classification accuracies

Abstract
We have combined transmission speckle photography and machine learning for direct classification and recognition of milk fat content classes. Our aim was hinged on the fact that parameters of scattering particles (and the dispersion medium) can be linked to the intensity distribution (speckle) observed when coherent light is transmitted through a scattering medium. For milk, it is primarily the size distribution and concentration of fat globules, which constitutes the total fat content. Consequently, we trained convolutional neural network to recognise and classify laser speckle from different fat content classes (0.5, 1.5, 2.0 and 3.2%). We investigated four exposure-time protocols and obtained the highest performance for shorter exposure times, in which the intensity histograms are kept similar for all images and the most probable intensity in the speckle pattern is close to zero. Our neural network was able to recognize the milk fat content classes unambiguously and we obtained the highest test and independent classification accuracies of 100 and ~99% respectively. It indicates that the parameters of other complex realistic suspensions could be classified with similar methods.

摘要
我们将传输扑杂照相与机器学习结合用于直接分类和识别牛奶脂肪含量类别。我们的目标是基于散射体（以及杂散媒体）参数与干扰光束传输过程中观察到的INTENSITY分布（扑杂），以确定牛奶中脂肪含量的总体分布。因此，我们训练了卷积神经网络，以识别和分类不同脂肪含量类别（0.5、1.5、2.0和3.2%）的扑杂照片。我们研究了四种不同的曝光时间协议，并获得了最高性能，其中短曝光时间下，图像的INTENSITY分布保持相似，而最有可能的扑杂干扰强度在图像中几乎为零。我们的神经网络能够无ambiguously识别牛奶脂肪含量类别，并在测试和独立分类任务中获得了100%和~99%的准确率。这表示我们可以使用类似方法来分类其他复杂的实际涂杂体系。

Deep Unsupervised Learning Using Spike-Timing-Dependent Plasticity

paper_url: http://arxiv.org/abs/2307.04054
repo_url: None
paper_authors: Sen Lu, Abhronil Sengupta
for: 这个论文是为了提出一种基于STDP的深度学习框架，以提高SNNs的性能和可扩展性。
methods: 该论文使用了一种组合了STDP clustering和深度学习的方法，通过在网络输出上生成pseudo-标签来训练深度网络。
results: 相比于使用$k$-means clustering方法，该方法可以达到$24.56%$的高精度和$3.5\times$快的 convergespeed，在Tiny ImageNet dataset上的10类子集上实现。

Abstract
Spike-Timing-Dependent Plasticity (STDP) is an unsupervised learning mechanism for Spiking Neural Networks (SNNs) that has received significant attention from the neuromorphic hardware community. However, scaling such local learning techniques to deeper networks and large-scale tasks has remained elusive. In this work, we investigate a Deep-STDP framework where a convolutional network is trained in tandem with pseudo-labels generated by the STDP clustering process on the network outputs. We achieve $24.56\%$ higher accuracy and $3.5\times$ faster convergence speed at iso-accuracy on a 10-class subset of the Tiny ImageNet dataset in contrast to a $k$-means clustering approach.

摘要
短时间依赖形变学习（STDP）是一种无监督学习机制，用于神经网络（SNN），它在神经机器学习社区中受到了重视。但是，将这种本地学习技术扩展到更深的网络和大规模任务中，仍然是一个棘手的问题。在这项工作中，我们调查了一种含有深度学习的STDP框架，其中一个卷积网络与STDP归类过程的输出生成的 Pseudo-标签一起进行训练。我们在一个Tiny ImageNet数据集的10类子集上达到了$24.56\%$的高精度和$3.5\times$快于iso-精度的整合速度。相比之下，使用$k$-means归类法，我们的精度提高了$24.56\%$，并且需要$3.5\times$多的训练时间。

Calibration-Aware Margin Loss: Pushing the Accuracy-Calibration Consistency Pareto Frontier for Deep Metric Learning

paper_url: http://arxiv.org/abs/2307.04047
repo_url: None
paper_authors: Qin Zhang, Linghan Xu, Qingming Tang, Jun Fang, Ying Nian Wu, Joe Tighe, Yifan Xing
for: 本文旨在提出一种新的评估 metric learning 模型准确性和一致性的方法，以便在不同的测试分布中实现轻松的部署。
methods: 本文提出了一种名为 Operating-Point-Incosistency-Score (OPIS) 的新指标，用于衡量不同类别在目标准化范围内的运行特性之间的差异。同时，本文还提出了一种名为 Calibration-Aware Margin (CAM) 的新正则项，用于在训练过程中鼓励类别间的表示结构更加一致。
results: 实验结果表明，CAM 正则项可以提高模型的准确性和一致性，并且可以在保持或者超越现有深度 metric learning 方法的情况下，提高模型的一致性。

Abstract
The ability to use the same distance threshold across different test classes / distributions is highly desired for a frictionless deployment of commercial image retrieval systems. However, state-of-the-art deep metric learning losses often result in highly varied intra-class and inter-class embedding structures, making threshold calibration a non-trivial process in practice. In this paper, we propose a novel metric named Operating-Point-Incosistency-Score (OPIS) that measures the variance in the operating characteristics across different classes in a target calibration range, and demonstrate that high accuracy of a metric learning embedding model does not guarantee calibration consistency for both seen and unseen classes. We find that, in the high-accuracy regime, there exists a Pareto frontier where accuracy improvement comes at the cost of calibration consistency. To address this, we develop a novel regularization, named Calibration-Aware Margin (CAM) loss, to encourage uniformity in the representation structures across classes during training. Extensive experiments demonstrate CAM's effectiveness in improving calibration-consistency while retaining or even enhancing accuracy, outperforming state-of-the-art deep metric learning methods.

摘要
<>将文本翻译成简化中文。<>使用同一个距离阈值来部署商业图像检索系统是非常愿望的。然而，现状的深度度量学损失经常导致类之间和类之间的嵌入结构具有很高的变化程度，从而使得阈值调整成为实践中的非常困难问题。在这篇论文中，我们提出了一个新的度量名为操作点不一致分数（OPIS），该度量测量不同类别在目标调整范围内的操作特性的变化程度，并证明了高精度度量学嵌入模型不一定能够保证所见和未见类别之间的均衡一致。我们发现，在高精度 régime 下，存在一个Pareto前沿，其中精度提高来到了均衡一致的代价。为解决这个问题，我们开发了一种新的正则化方法，即均衡感知损失（CAM），以促进类别之间的表示结构具有更好的一致性。广泛的实验证明了CAM的效果，可以提高均衡一致性，同时保持或者提高精度，超越当前的深度度量学方法。

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

paper_url: http://arxiv.org/abs/2307.05541
repo_url: https://github.com/tyluann/freqhand
paper_authors: Tianyu Luan, Yuanhao Zhai, Jingjing Meng, Zhong Li, Zhang Chen, Yi Xu, Junsong Yuan
for: 高精度手模型生成
methods: 频率拆分网络+频域分解损失
results: preserved 高频精度个性化手模型Here’s a brief explanation of each point:* “高精度手模型生成” (high-fidelity hand modeling) - The paper aims to generate highly detailed 3D hand models using a novel network architecture.* “频率拆分网络+” (frequency split network) - The proposed network uses a coarse-to-fine strategy to generate 3D hand meshes, with different frequency bands used in each resolution level.* “频域分解损失” (frequency decomposition loss) - The network uses a novel loss function to supervise each frequency component, allowing it to capture high-frequency personalized details.

Abstract
Despite the impressive performance obtained by recent single-image hand modeling techniques, they lack the capability to capture sufficient details of the 3D hand mesh. This deficiency greatly limits their applications when high-fidelity hand modeling is required, e.g., personalized hand modeling. To address this problem, we design a frequency split network to generate 3D hand mesh using different frequency bands in a coarse-to-fine manner. To capture high-frequency personalized details, we transform the 3D mesh into the frequency domain, and propose a novel frequency decomposition loss to supervise each frequency component. By leveraging such a coarse-to-fine scheme, hand details that correspond to the higher frequency domain can be preserved. In addition, the proposed network is scalable, and can stop the inference at any resolution level to accommodate different hardware with varying computational powers. To quantitatively evaluate the performance of our method in terms of recovering personalized shape details, we introduce a new evaluation metric named Mean Signal-to-Noise Ratio (MSNR) to measure the signal-to-noise ratio of each mesh frequency component. Extensive experiments demonstrate that our approach generates fine-grained details for high-fidelity 3D hand reconstruction, and our evaluation metric is more effective for measuring mesh details compared with traditional metrics.

摘要
尽管最近的单图手模型技术具有印象人的表现，但它们缺乏能够捕捉足够细节的3D手模型的能力。这个问题限制了它们在需要高精度手模型时的应用。为解决这个问题，我们设计了一个频分网络，通过不同的频谱带来生成3D手模型。为了捕捉高频个性化细节，我们将3D网格转换到频域中，并提出了一种新的频分loss来监督每个频谱成分。通过这种层次结构，可以保持手中的高频细节。此外，我们的网络可扩展，可以根据不同的硬件计算能力 stopping inference 在不同的分辨率级别。为了量化评估我们方法在个性化手模型中恢复细节的性能，我们引入了一个新的评估指标，即 Mean Signal-to-Noise Ratio (MSNR)，来度量每个频谱成分的信号噪比。我们的实验表明，我们的方法可以生成高精度的3D手模型，而我们引入的评估指标比传统指标更有效地度量网格细节。

paper_url: http://arxiv.org/abs/2307.04014
repo_url: None
paper_authors: Amirhossein Askari-Farsangi, Ali Sharifi-Zarchi, Mohammad Hossein Rohban
for: 这份研究的目的是为了提供一个基于深度学习的ALL诊断方法，并且解决了小型医疗训练数据所导致的模型简化现象。
methods: 我们的方法是基于专家的工作流程，使用多个图像进行诊断，并且将模型组织成一个多个例子学习问题，以提高诊断精度。
results: 我们的模型在ALL IDB 1上取得了96.15%的准确率，94.24%的F1分数，97.56%的感知率和90.91%的特征率，并且在对应外部数据集进行评估时也有了可接受的表现。

Abstract
Acute Lymphoblastic Leukemia (ALL) is one of the most common types of childhood blood cancer. The quick start of the treatment process is critical to saving the patient's life, and for this reason, early diagnosis of this disease is essential. Examining the blood smear images of these patients is one of the methods used by expert doctors to diagnose this disease. Deep learning-based methods have numerous applications in medical fields, as they have significantly advanced in recent years. ALL diagnosis is not an exception in this field, and several machine learning-based methods for this problem have been proposed. In previous methods, high diagnostic accuracy was reported, but our work showed that this alone is not sufficient, as it can lead to models taking shortcuts and not making meaningful decisions. This issue arises due to the small size of medical training datasets. To address this, we constrained our model to follow a pipeline inspired by experts' work. We also demonstrated that, since a judgement based on only one image is insufficient, redefining the problem as a multiple-instance learning problem is necessary for achieving a practical result. Our model is the first to provide a solution to this problem in a multiple-instance learning setup. We introduced a novel pipeline for diagnosing ALL that approximates the process used by hematologists, is sensitive to disease biomarkers, and achieves an accuracy of 96.15%, an F1-score of 94.24%, a sensitivity of 97.56%, and a specificity of 90.91% on ALL IDB 1. Our method was further evaluated on an out-of-distribution dataset, which posed a challenging test and had acceptable performance. Notably, our model was trained on a relatively small dataset, highlighting the potential for our approach to be applied to other medical datasets with limited data availability.

摘要
急性limephoblastic leukemia（ALL）是儿童血液癌的最常见种类。快速开始治疗过程是患者生存的关键，因此早期诊断这种疾病非常重要。 examine the blood smear images of these patients is one of the methods used by expert doctors to diagnose this disease。 Deep learning-based methods have numerous applications in medical fields, and ALL diagnosis is no exception. However, previous machine learning-based methods for this problem have been criticized for relying too heavily on shortcuts and not making meaningful decisions. This issue arises due to the small size of medical training datasets. To address this, we constrained our model to follow a pipeline inspired by experts' work. We also demonstrated that redefining the problem as a multiple-instance learning problem is necessary for achieving a practical result. Our model is the first to provide a solution to this problem in a multiple-instance learning setup. We introduced a novel pipeline for diagnosing ALL that approximates the process used by hematologists, is sensitive to disease biomarkers, and achieves an accuracy of 96.15%, an F1-score of 94.24%, a sensitivity of 97.56%, and a specificity of 90.91% on ALL IDB 1. Our method was further evaluated on an out-of-distribution dataset, which posed a challenging test and had acceptable performance. Notably, our model was trained on a relatively small dataset, highlighting the potential for our approach to be applied to other medical datasets with limited data availability.

BPNet: Bézier Primitive Segmentation on 3D Point Clouds

paper_url: http://arxiv.org/abs/2307.04013
repo_url: https://github.com/bizerfr/bpnet
paper_authors: Rao Fu, Cheng Wen, Qian Li, Xiao Xiao, Pierre Alliez
for: 本文提出了BPNet，一种基于深度学习的端到端框架，用于在3D点云上学习B'ezier基本形态分割。现有的方法往往只处理不同的基本类型分割，因此它们只能处理有限的形态类型。为了解决这个问题，我们寻求一种通用的点云基本分割方法。
methods: 我们将B'ezier分解 transferred to point cloud segmentation，并在缓冲 architecture 上进行联合优化来学习B'ezier基本分割和形态适应同时。我们还引入了一个软投票正则化来提高基本分割，并提出了一个自适应嵌入模块来聚合点Cloud特征，使网络更加稳定和通用。
results: 我们在ABC dataset和实际扫描数据集上进行了广泛的实验，并与不同的基准方法进行比较。实验结果表明，我们的方法在基本分割方面具有superior performance，并且在推理速度方面表现出了显著的提高。

Abstract
This paper proposes BPNet, a novel end-to-end deep learning framework to learn B\'ezier primitive segmentation on 3D point clouds. The existing works treat different primitive types separately, thus limiting them to finite shape categories. To address this issue, we seek a generalized primitive segmentation on point clouds. Taking inspiration from B\'ezier decomposition on NURBS models, we transfer it to guide point cloud segmentation casting off primitive types. A joint optimization framework is proposed to learn B\'ezier primitive segmentation and geometric fitting simultaneously on a cascaded architecture. Specifically, we introduce a soft voting regularizer to improve primitive segmentation and propose an auto-weight embedding module to cluster point features, making the network more robust and generic. We also introduce a reconstruction module where we successfully process multiple CAD models with different primitives simultaneously. We conducted extensive experiments on the synthetic ABC dataset and real-scan datasets to validate and compare our approach with different baseline methods. Experiments show superior performance over previous work in terms of segmentation, with a substantially faster inference speed.

摘要
Inspired by B\'ezier decomposition on NURBS models, we transfer this concept to guide point cloud segmentation and eliminate the need for separate treatment of different primitive types. Our approach uses a joint optimization framework to learn B\'ezier primitive segmentation and geometric fitting simultaneously on a cascaded architecture.To improve primitive segmentation, we introduce a soft voting regularizer and an auto-weight embedding module to cluster point features. These components make our network more robust and generic, allowing it to handle a wide range of shapes and scenarios.In addition, we introduce a reconstruction module that enables us to process multiple CAD models with different primitives simultaneously. Our approach is validated through extensive experiments on the synthetic ABC dataset and real-scan datasets, demonstrating superior performance over previous methods in terms of segmentation and inference speed.

2023-07-09

Histopathology Whole Slide Image Analysis with Heterogeneous Graph Representation Learning

Predictive Coding For Animation-Based Video Compression

Reducing False Alarms in Video Surveillance by Deep Feature Statistical Modeling

DIFF-NST: Diffusion Interleaving For deFormable Neural Style Transfer

A Survey on Figure Classification Techniques in Scientific Documents

ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Ultrasonic Image’s Annotation Removal: A Self-supervised Noise2Noise Approach

Cross-modal Orthogonal High-rank Augmentation for RGB-Event Transformer-trackers

Marine Debris Detection in Satellite Surveillance using Attention Mechanisms

HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding

Enhancing Low-Light Images Using Infrared-Encoded Images

Mitosis Detection from Partial Annotation by Dataset Generation via Frame-Order Flipping

Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird’s Eye View

CA-CentripetalNet: A novel anchor-free deep learning framework for hardhat wearing detection

Enhancing Building Semantic Segmentation Accuracy with Super Resolution and Deep Learning: Investigating the Impact of Spatial Resolution on Various Datasets

Visible and infrared self-supervised fusion trained on a single example

GNP Attack: Transferable Adversarial Examples via Gradient Norm Penalty

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

SVIT: Scaling up Visual Instruction Tuning

Score-based Conditional Generation with Fewer Labeled Data by Self-calibrating Classifier Guidance

Random Position Adversarial Patch for Vision Transformers

Combining transmission speckle photography and convolutional neural network for determination of fat content in cow’s milk – an exercise in classification of parameters of a complex suspension

Deep Unsupervised Learning Using Spike-Timing-Dependent Plasticity

Calibration-Aware Margin Loss: Pushing the Accuracy-Calibration Consistency Pareto Frontier for Deep Metric Learning

High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition

Novel Pipeline for Diagnosing Acute Lymphoblastic Leukemia Sensitive to Related Biomarkers

BPNet: Bézier Primitive Segmentation on 3D Point Clouds