cs.CV - 2023-07-31

From Generation to Suppression: Towards Effective Irregular Glow Removal for Nighttime Visibility Enhancement

  • paper_url: http://arxiv.org/abs/2307.16783
  • repo_url: None
  • paper_authors: Wanyu Wu, Wei Wang, Zheng Wang, Kui Jiang, Xin Xu
  • for: 提高夜间图像的亮度和透光效果,并解决人工灯光的散射扩散问题
  • methods: 基于多散射估计和大气点扩散函数(APSF)学习物理散射生成,并在不同灯光强度和源形状下实现扩散灯光抑制
  • results: 提出了一种可扩展的灯光无知抑制网络(LBDN),并通过灯光抑制后,使用Retinex模块进行增强,实现了低光照图像的提高和灯光抑制任务
    Abstract Most existing Low-Light Image Enhancement (LLIE) methods are primarily designed to improve brightness in dark regions, which suffer from severe degradation in nighttime images. However, these methods have limited exploration in another major visibility damage, the glow effects in real night scenes. Glow effects are inevitable in the presence of artificial light sources and cause further diffused blurring when directly enhanced. To settle this issue, we innovatively consider the glow suppression task as learning physical glow generation via multiple scattering estimation according to the Atmospheric Point Spread Function (APSF). In response to the challenges posed by uneven glow intensity and varying source shapes, an APSF-based Nighttime Imaging Model with Near-field Light Sources (NIM-NLS) is specifically derived to design a scalable Light-aware Blind Deconvolution Network (LBDN). The glow-suppressed result is then brightened via a Retinex-based Enhancement Module (REM). Remarkably, the proposed glow suppression method is based on zero-shot learning and does not rely on any paired or unpaired training data. Empirical evaluations demonstrate the effectiveness of the proposed method in both glow suppression and low-light enhancement tasks.
    摘要 现有的低光照图像改进方法主要是提高黑暗区域的亮度,但这些方法受到夜间图像中的耀树效果的限制。耀树效果是人工灯光源的存在导致的,并且会对直接加强的图像进行更多的杂化干扰。为解决这个问题,我们创新地将耀树降低任务作为学习物理耀树生成的多散杂折扣计算,根据大气点 рассеи函数(APSF)。为了应对耀树强度不均和灯光源形状的变化,我们特地 derivate了一种基于APSF的夜间图像模型(NIM-NLS),用于设计可扩展的光照无知抽象网络(LBDN)。经验证表明,我们提出的耀树降低方法不需要任何配对或无配对训练数据,并且在耀树降低和低光照改进任务中表现出色。

Lightweight Super-Resolution Head for Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.16765
  • repo_url: https://github.com/haonanwang0522/srpose
  • paper_authors: Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu
  • for: 该研究旨在解决热图基本方法中的量化错误问题,提高热图 pose estimation 的性能。
  • methods: 该研究提出了 SR 头,该头可以预测高 resolution 的热图,从而减少量化错误和进一步处理的需求。 SRPose 方法在每个阶段使用 SR 头来慢慢地恢复高 resolution 的热图,并且在每个阶段使用 SR 头来监督中间特征。
  • results: 对 COCO、MPII 和 CrowdPose 等数据集进行了广泛的实验,显示 SRPose 方法比对应的热图基本方法有更好的性能。
    Abstract Heatmap-based methods have become the mainstream method for pose estimation due to their superior performance. However, heatmap-based approaches suffer from significant quantization errors with downscale heatmaps, which result in limited performance and the detrimental effects of intermediate supervision. Previous heatmap-based methods relied heavily on additional post-processing to mitigate quantization errors. Some heatmap-based approaches improve the resolution of feature maps by using multiple costly upsampling layers to improve localization precision. To solve the above issues, we creatively view the backbone network as a degradation process and thus reformulate the heatmap prediction as a Super-Resolution (SR) task. We first propose the SR head, which predicts heatmaps with a spatial resolution higher than the input feature maps (or even consistent with the input image) by super-resolution, to effectively reduce the quantization error and the dependence on further post-processing. Besides, we propose SRPose to gradually recover the HR heatmaps from LR heatmaps and degraded features in a coarse-to-fine manner. To reduce the training difficulty of HR heatmaps, SRPose applies SR heads to supervise the intermediate features in each stage. In addition, the SR head is a lightweight and generic head that applies to top-down and bottom-up methods. Extensive experiments on the COCO, MPII, and CrowdPose datasets show that SRPose outperforms the corresponding heatmap-based approaches. The code and models are available at https://github.com/haonanwang0522/SRPose.
    摘要 对于 pose 估计,热图方法已经成为主流方法,但热图方法受到下测热图的量化误差的限制,导致表现有限和中途监督的副作用。过往的热图方法将重点放在额外处理来减轻量化误差。一些热图方法会将特征图像的分辨率提高,使用多个昂贵的upsampling层来提高本地化精度。为了解决这些问题,我们创新地视Backbone网络为压缩过程,并将热图预测 reformulate 为超解析(SR)任务。我们首先提出SR Head,这个预测热图的数据点高于输入特征图像的分辨率,或者和输入图像的分辨率相同,以便实现量化误差的减少和后期处理的依赖。此外,我们提出SRPose,这个渐进从粗糙到细致的方法,可以从LR热图和受损特征图像中搜寻HR热图。为了降低HR热图的训练困难,SRPose 应用SR Head 来监督每个阶段的中间特征。此外,SR Head 是一个轻量级和通用的头,适用于顶部遍历和底部遍历方法。实验结果显示,SRPose 在 COCO、MPII 和 CrowdPose 数据集上具有较高的表现。代码和模型可以在 获取。

High-Performance Fine Defect Detection in Artificial Leather Using Dual Feature Pool Object Detection

  • paper_url: http://arxiv.org/abs/2307.16751
  • repo_url: None
  • paper_authors: Lin Huang, Weisheng Li, Linlin Shen, Xue Xiao, Suihan Xiao
  • for: 这个研究主要针对论文的精细杂 defect detection task,特别是在人工皮革中精细杂 defect的检测。
  • methods: 本研究提出了四种新的结构,即DFP、IFF、AMP和EOS,以解决 YOLOv5 模型中的结构问题。
  • results: YOLOD 模型在人工皮革杂 defect数据集上表现出色,与 YOLOv5 比较而言,AP_50 提高了 11.7% - 13.5%,并同时降低了 5.2% - 7.2% 的错误检测率。在普通的 MS-COCO 数据集上,YOLOD 也表现出优异的表现,与 YOLOv5 比较而言,AP 提高了 0.4% - 2.6%,AP_S 提高了 2.5% - 4.1%。这些结果表明 YOLOD 在人工皮革杂 defect detection 和通用物体检测任务中具有出色的效果和可靠性,适用于实际应用。
    Abstract In this study, the structural problems of the YOLOv5 model were analyzed emphatically. Based on the characteristics of fine defects in artificial leather, four innovative structures, namely DFP, IFF, AMP, and EOS, were designed. These advancements led to the proposal of a high-performance artificial leather fine defect detection model named YOLOD. YOLOD demonstrated outstanding performance on the artificial leather defect dataset, achieving an impressive increase of 11.7% - 13.5% in AP_50 compared to YOLOv5, along with a significant reduction of 5.2% - 7.2% in the error detection rate. Moreover, YOLOD also exhibited remarkable performance on the general MS-COCO dataset, with an increase of 0.4% - 2.6% in AP compared to YOLOv5, and a rise of 2.5% - 4.1% in AP_S compared to YOLOv5. These results demonstrate the superiority of YOLOD in both artificial leather defect detection and general object detection tasks, making it a highly efficient and effective model for real-world applications.
    摘要 在这个研究中,YOLOv5模型的结构问题得到了强调性分析。基于人工皮革细 defect的特点,提出了四种创新结构:DFP、IFF、AMP和EOS。这些创新导致了一种高性能的人工皮革细 defect检测模型,即YOLOD。YOLOD在人工皮革细 defect数据集上达到了11.7%-13.5%的AP_50提升,同时有5.2%-7.2%的错误检测率降低。此外,YOLOD还在通用的 MS-COCO 数据集上表现出色,与 YOLOv5 相比,AP 提升0.4%-2.6%,AP_S 提升2.5%-4.1%。这些结果表明 YOLOD 在人工皮革细 defect检测和通用对象检测任务中具有出色的性能,使其在实际应用中成为高效率的选择。

Multi-Spectral Image Stitching via Spatial Graph Reasoning

  • paper_url: http://arxiv.org/abs/2307.16741
  • repo_url: https://github.com/Jzy2017/SGR-MSIS
  • paper_authors: Zhiying Jiang, Zengxi Zhang, Jinyuan Liu, Xin Fan, Risheng Liu
  • for: 这个论文的目的是提出一种基于图 convolutional neural networks (GCNs) 的多spectral图像组合方法,以实现多视点场景的稳定和可靠的整体场景组合。
  • methods: 该方法首先将多spectral图像转化为图形,然后通过图 convolutional neural networks (GCNs) 模型来学习图形之间的关系,并通过 dense feature embeddings 来实现视角之间的匹配。
  • results: 根据实验结果,该方法可以有效地处理多视点场景的折叠和组合问题,并且比现有的方法更加稳定和可靠。同时,该方法还可以在实际世界和生成的 synthetic 数据集上进行广泛的评估和验证。
    Abstract Multi-spectral image stitching leverages the complementarity between infrared and visible images to generate a robust and reliable wide field-of-view (FOV) scene. The primary challenge of this task is to explore the relations between multi-spectral images for aligning and integrating multi-view scenes. Capitalizing on the strengths of Graph Convolutional Networks (GCNs) in modeling feature relationships, we propose a spatial graph reasoning based multi-spectral image stitching method that effectively distills the deformation and integration of multi-spectral images across different viewpoints. To accomplish this, we embed multi-scale complementary features from the same view position into a set of nodes. The correspondence across different views is learned through powerful dense feature embeddings, where both inter- and intra-correlations are developed to exploit cross-view matching and enhance inner feature disparity. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features, generating informative and reliable wide FOV scenes. Moreover, we release a challenging dataset named ChaMS, comprising both real-world and synthetic sets with significant parallax, providing a new option for comprehensive evaluation. Extensive experiments demonstrate that our method surpasses the state-of-the-arts.
    摘要 Simplified Chinese translation:多spectral图像缝合利用不同波长图像之间的共同性来生成一个可靠和可靠的广角场景。主要挑战是探索不同视角图像之间的关系,以便对多视图场景进行对接和整合。我们利用图像卷积神经网络的优势来模型特征关系,并提出一种基于空间图 reasoning的多spectral图像缝合方法。通过嵌入不同缩放级别的相关特征,并通过强大的密集特征嵌入来学习不同视角之间的相关性。通过引入空间和通道维度的长距离相关性,我们可以利用像素关系和通道间关系来重建对齐的多视图特征,生成可靠和可靠的广角场景。此外,我们发布了一个名为ChaMS的挑战性数据集,包括真实世界和 sintetic 集合,提供了一个新的评估选项。广泛的实验表明,我们的方法超过了当前的状态。

UniVTG: Towards Unified Video-Language Temporal Grounding

  • paper_url: http://arxiv.org/abs/2307.16715
  • repo_url: https://github.com/showlab/univtg
  • paper_authors: Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou
  • for: 这个论文旨在解决视频社交媒体上的视频浏览问题,特别是根据自定义语言查询(如句子或词语)找到视频中的目标片段。
  • methods: 该论文提出了一种统一视频暂时附加(VTG)标签和任务的方法,包括三个方向:首先,将各种VTG标签和任务重新定义为统一的形式;其次,开发一种高效和灵活的附加模型,可以处理各种任务和使用各种标签;最后,通过统一框架,使用大规模多样化标签进行预训练,从而提高附加能力。
  • results: 实验结果表明,该方法在三个任务(时间间隔检索、精彩检索和视频摘要) across seven datasets(QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS)中具有显著的效果和灵活性。
    Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.
    摘要 视频时间固定(VTG),目标是根据自定义语言查询(如句子或词语)将视频中的clipgrounding(时间间隔或不连续shot)固定,对社交媒体视频浏览非常重要。大多数方法在这个方向都是通过特定任务的模型来进行训练,这限制了它们的普适性和扩展性。在这篇论文中,我们提出了一种统一多种VTG标签和任务的方法,称为UniVTG。以下是我们的三个方向:1. 我们对VTG标签和任务进行了广泛的复审,并定义了一个统一的表述。基于这个表述,我们开发了可扩展的数据注解方案,以创建可扩展的伪数据监督。2. 我们开发了一种高效和灵活的固定模型,能够Address每个任务和使用每个标签。3. 由于我们的统一框架,我们能够在大规模多种标签上进行时间固定预训练,并发展出更强的固定能力,例如零shot固定。我们的实验表明,我们的提议的框架在三个任务(时刻回忆、突出点检测和视频概要)Across seven datasets(QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS)中具有极高的效果和灵活性。我们的代码可以在https://github.com/showlab/UniVTG中获取。

Investigating and Improving Latent Density Segmentation Models for Aleatoric Uncertainty Quantification in Medical Imaging

  • paper_url: http://arxiv.org/abs/2307.16694
  • repo_url: None
  • paper_authors: M. M. Amaan Valiuddin, Christiaan G. A. Viviers, Ruud J. G. van Sloun, Peter H. N. de With, Fons van der Sommen
  • for: 这篇论文主要针对的是如何使用 latent density models Address 数据uncertainty 问题,具体来说是 Image Segmentation 中的 aleatoric uncertainty。
  • methods: 这篇论文使用的方法是 Probabilistic U-Net (PU-Net),它使用 latent Normal densities 来优化 conditional data log-likelihood Evidence Lower Bound。
  • results: 研究发现,PU-Net 的 latent space 具有严重的不均衡性问题,这会使得 gradient descent 的效果受到限制,并且模型变得极敏感于 latent space 的地方化样本localization,导致预测不准确。为解决这个问题,这篇论文提出了 Sinkhorn PU-Net (SPU-Net),使用 Sinkhorn Divergence 来促进 latent space 的均衡性,从而改善 gradient-descent 更新和模型的Robustness。实验表明,在公共数据集上,SPU-Net 与前一代 latent variable models 相比,在 Hungarian-Matched metric 上得到了最高达 11% 的性能提升。结果表明,通过促进 latent space 的均衡性,可以在医学图像分割中显著提高 latent density modeling 的性能。
    Abstract Data uncertainties, such as sensor noise or occlusions, can introduce irreducible ambiguities in images, which result in varying, yet plausible, semantic hypotheses. In Machine Learning, this ambiguity is commonly referred to as aleatoric uncertainty. Latent density models can be utilized to address this problem in image segmentation. The most popular approach is the Probabilistic U-Net (PU-Net), which uses latent Normal densities to optimize the conditional data log-likelihood Evidence Lower Bound. In this work, we demonstrate that the PU- Net latent space is severely inhomogenous. As a result, the effectiveness of gradient descent is inhibited and the model becomes extremely sensitive to the localization of the latent space samples, resulting in defective predictions. To address this, we present the Sinkhorn PU-Net (SPU-Net), which uses the Sinkhorn Divergence to promote homogeneity across all latent dimensions, effectively improving gradient-descent updates and model robustness. Our results show that by applying this on public datasets of various clinical segmentation problems, the SPU-Net receives up to 11% performance gains compared against preceding latent variable models for probabilistic segmentation on the Hungarian-Matched metric. The results indicate that by encouraging a homogeneous latent space, one can significantly improve latent density modeling for medical image segmentation.
    摘要 数据不确定性,如探测器噪声或遮挡,可以导致图像中的不可避免歧义,从而导致多种可能的Semantic Hypothesis。在机器学习中,这种歧义称为 aleatoric uncertainty。秘密密度模型可以用来解决这个问题。最流行的方法是Probabilistic U-Net(PU-Net),它使用秘密的Normal密度来优化 conditional data log-likelihood Evidence Lower Bound。在这种工作中,我们发现PU-Net的latent空间很强不均衡。这导致了梯度下降的效果被妨碍,并且模型变得非常敏感于秘密空间样本的localization,导致预测不准确。为了解决这个问题,我们提出了Sinkhorn PU-Net(SPU-Net),它使用Sinkhorn Divergence来促进所有秘密维度之间的均衡,从而提高梯度下降更新和模型的Robustness。我们的结果显示,通过在公共数据集上应用SPU-Net,可以在Hungarian-Matched metric上获得11%的性能提升,相比之前的秘密变量模型。结果表明,通过促进秘密空间的均衡,可以在医疗图像分割中显著提高秘密密度模型的性能。

DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.16687
  • repo_url: None
  • paper_authors: Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, Hyung Jin Chang
    for: 这个论文主要针对的是多帧人体pose estimation问题,即使用扩展的 diffusion probabilistic models 来提高人体pose estimation的准确性。methods: 这个论文提出了一种新的 diffusion 架构,称为 DiffPose,它将视频基于人体pose estimation问题转化为一个 conditional heatmap 生成问题。在这个架构中,提出了一种 SpatioTemporal Representation Learner 来集成视觉证据,以及一种 Lookup-based MultiScale Feature Interaction 来确定局部关节和全局上下文之间的相关性。results: 这个论文在三个 benchmark 上达到了新的州OF-the-art 结果,包括 PoseTrack2017、PoseTrack2018 和 PoseTrack21。此外,DiffPose 还能够结合多个 pose estimate 来提高预测准确性,特别是对挑战性的关节。此外,DiffPose 还能够调整特定的 iterative step 来提高特征精细化,无需重新训练模型。
    Abstract Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21.
    摘要 DiffPose是一种新的扩展了 diffusion 模型,用于视频基于人体姿态估计。这种模型通过将视频基于人体姿态估计转化为一个条件热图生成问题,以提高 temporal 信息的利用。此外,我们还提出了一种新的机制called Lookup-based MultiScale Feature Interaction,用于确定局部关节和全局上下文之间的相关性。这种机制能够生成细腻的表示,特别是关注关节区域。通过扩展 diffusion 模型,DiffPose 显示出了以下两个独特特征:首先,能够将多个 pose 估计集合以提高预测精度,特别是难于估计的关节。其次,能够在不需要重新训练模型的情况下,调整特征细化步骤的数量。DiffPose 在 PoseTrack2017、PoseTrack2018 和 PoseTrack21 三个benchmark上达到了新的state-of-the-art 结果。

Guiding Image Captioning Models Toward More Specific Captions

  • paper_url: http://arxiv.org/abs/2307.16686
  • repo_url: None
  • paper_authors: Simon Kornblith, Lala Li, Zirui Wang, Thao Nguyen
  • for: 这个研究是为了提高图像描述的精度和准确性,并且不需要参考文本。
  • methods: 研究人员使用了一种称为”自由指导”的方法,将模型 Fine-tune 以便估计图像和描述的共享分布。
  • results: 研究人员发现,使用这种自由指导方法可以提高图像描述的精度和准确性,但是会有些差异于标准的参考文本基准。
    Abstract Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing $p(\mathrm{caption}|\mathrm{image})$ and $p(\mathrm{image}|\mathrm{caption})$. Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption$\to$image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the quality of captions generated from a model trained only on minimally curated web data.
    摘要 Image captioning 是通常用来生成图像的描述文本,但标准的参考描述集合可能不够精准地描述图像。这些问题更加严重当模型直接从互联网上收集图像-描述文本对进行训练。在这种情况下,我们显示了如何通过微调模型来生成更为特定的描述文本,而无需更改训练过程。我们实现了无类标签导向的推导,使得模型在解码过程中估计图像和描述文本之间的 conditional 和 unconditional 分布。指导缩放应用于解码控制了图像和描述文本之间的质量。与标准的批量解码相比,使用指导缩放可以提高无参考度量metric(CLIPScore)的值(0.808 vs. 0.775),以及在CLIP空间中的描述文本到图像 retrieve 性能(recall@1 44.6% vs. 26.5%),但是会降低标准的参考基础度量(例如 CIDEr 78.6 vs 126.1)。我们进一步探讨了使用语言模型来引导解码过程,可以在无参考度量 vs. 参考度量的 Pareto Frontier 上获得小幅提升,并在使用互联网上收集的最小 curaated 数据进行训练时,显著提高生成的描述文本质量。

Conditioning Generative Latent Optimization to solve Imaging Inverse Problems

  • paper_url: http://arxiv.org/abs/2307.16670
  • repo_url: None
  • paper_authors: Thomas Braure, Kévin Ginsburger
  • for: 解决医学成像 inverse problem (IIP) 中的数据驱动方法在受限的测量设置下的表现问题,如 CT 成像中的稀疏 X-ray 投射。
  • methods: 使用 score-based 生成模型,不需要大量的指导数据,可以在测试时根据成像设置进行灵活应用。
  • results: 与现有指导学习方法相比,可以达到更好的重建质量,并且不需要 backwards 运算器,可以扩展用 caso 到非线性 IIP 问题。
    Abstract Computed Tomography (CT) is a prominent example of Imaging Inverse Problem (IIP), highlighting the unrivalled performances of data-driven methods in degraded measurements setups like sparse X-ray projections. Although a significant proportion of deep learning approaches benefit from large supervised datasets to directly map experimental measurements to medical scans, they cannot generalize to unknown acquisition setups. In contrast, fully unsupervised techniques, most notably using score-based generative models, have recently demonstrated similar or better performances compared to supervised approaches to solve IIPs while being flexible at test time regarding the imaging setup. However, their use cases are limited by two factors: (a) they need considerable amounts of training data to have good generalization properties and (b) they require a backward operator, like Filtered-Back-Projection in the case of CT, to condition the learned prior distribution of medical scans to experimental measurements. To overcome these issues, we propose an unsupervised conditional approach to the Generative Latent Optimization framework (cGLO), in which the parameters of a decoder network are initialized on an unsupervised dataset. The decoder is then used for reconstruction purposes, by performing Generative Latent Optimization with a loss function directly comparing simulated measurements from proposed reconstructions to experimental measurements. The resulting approach, tested on sparse-view CT using multiple training dataset sizes, demonstrates better reconstruction quality compared to state-of-the-art score-based strategies in most data regimes and shows an increasing performance advantage for smaller training datasets and reduced projection angles. Furthermore, cGLO does not require any backward operator and could expand use cases even to non-linear IIPs.
    摘要 计算Tomography(CT)是一个典型的图像反问题(IIP)的例子,表明数据驱动方法在受限度量测量设置下表现出比其他方法更高的能力。虽然大多数深度学习方法需要大量的超参数数据来直接将实验室测量映射到医学扫描图像,但它们无法泛化到未知的捕获设置。相反,完全无监督的技术,主要是使用得分数据生成模型,在最近几年内已经展现了与监督方法相当或更好的性能,而且具有可变测试时间的灵活性。然而,它们的应用场景受到两种因素的限制:(a)它们需要大量的训练数据来有好的泛化性能,(b)它们需要一个后向运算器,如滤波后投影,来定制学习的医学扫描图像的先前分布。为了突破这些问题,我们提议一种不监督的冲ombinairal方法,在这里使用一个抽象的decoder网络。decoder网络的参数在一个无监督数据集上进行初始化,然后用Generative Latent Optimization(GLO)进行重构。在这种方法中,我们直接将 simulated measurements from proposed reconstructions 与实际测量进行比较,并使用这个loss函数来优化decoder网络。我们在多个训练数据集大小进行测试,并证明在大多数数据域下,cGLO比Score-based策略具有更高的重构质量,并且在小于 projection angles 和训练数据集大小时表现出逐渐增长的性能优势。此外,cGLO不需要任何后向运算器,因此可以扩展应用场景到非线性 IIPs。

Domain Adaptation for Medical Image Segmentation using Transformation-Invariant Self-Training

  • paper_url: http://arxiv.org/abs/2307.16660
  • repo_url: https://github.com/negin-ghamsarian/transformation-invariant-self-training-miccai23
  • paper_authors: Negin Ghamsarian, Javier Gamazo Tejero, Pablo Márquez Neila, Sebastian Wolf, Martin Zinkernagel, Klaus Schoeffmann, Raphael Sznitman
  • for: 这个研究的目的是提出一种基于pseudo-labeling的自适应频率降阶法,以便在不同的成像设备和配置下,通过自适应频率降阶法来学习抽象表示。
  • methods: 本研究使用了pseudo-labeling技术,并对不确定的pseudo标签进行评估和筛选,以提高自适应频率降阶法的性能。
  • results: 实验结果表明,提出的transformation-invariant self-training(TI-ST)方法可以有效地mitigate the lack of target domain annotation,并提高目标频率降阶法的性能。
    Abstract Models capable of leveraging unlabelled data are crucial in overcoming large distribution gaps between the acquired datasets across different imaging devices and configurations. In this regard, self-training techniques based on pseudo-labeling have been shown to be highly effective for semi-supervised domain adaptation. However, the unreliability of pseudo labels can hinder the capability of self-training techniques to induce abstract representation from the unlabeled target dataset, especially in the case of large distribution gaps. Since the neural network performance should be invariant to image transformations, we look to this fact to identify uncertain pseudo labels. Indeed, we argue that transformation invariant detections can provide more reasonable approximations of ground truth. Accordingly, we propose a semi-supervised learning strategy for domain adaptation termed transformation-invariant self-training (TI-ST). The proposed method assesses pixel-wise pseudo-labels' reliability and filters out unreliable detections during self-training. We perform comprehensive evaluations for domain adaptation using three different modalities of medical images, two different network architectures, and several alternative state-of-the-art domain adaptation methods. Experimental results confirm the superiority of our proposed method in mitigating the lack of target domain annotation and boosting segmentation performance in the target domain.
    摘要 (Simplified Chinese translation)模型可以利用无标注数据是重要的,以推行大型分布差问题的解决。在这种情况下,基于 pseudo-labeling 的自我训练技术是高效的 semi-supervised 领域适应。然而, Pseudo 标签的不可靠性可能会阻碍自我训练技术在无标注目标集中生成抽象表示。特别是在大型分布差情况下。由于神经网络性能应该对图像变换不变,我们利用这一点来确定不可靠的 Pseudo 标签。因此,我们提出了一种基于变换不变的自我训练方法(TI-ST),该方法在自我训练期间评估每个像素的 Pseudo 标签可靠性,并将不可靠的检测排除。我们进行了三种不同的医疗影像模式、两种不同的网络架构和多种现有领域适应方法的比较测试。实验结果表明,我们的提议方法可以减少目标领域的标注缺乏,并提高目标领域中的分类性能。

CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification

  • paper_url: http://arxiv.org/abs/2307.16634
  • repo_url: None
  • paper_authors: Rabab Abdelfattah, Qing Guo, Xiaoguang Li, Xiaofeng Wang, Song Wang
  • for: 这paper是为了开发一种无监督学习方法,用于无标签图像分类。
  • methods: 该方法包括三个阶段:初始化、训练和推理。在初始化阶段,我们利用CLIP模型的能力,并提出了一种扩展CLIP的新方法,以实现基于全局-本地图像文本相似性汇聚的多标签预测。在训练阶段,我们提出了一个优化框架,用于训练分类网络和修正未观察标签。在推理阶段,只需使用分类网络来预测输入图像的标签。
  • results: 对MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012和NUS datasets进行了广泛的实验,并达到了当前无监督学习方法的最高性能水平,甚至与弱监督分类方法具有相似的性能。
    Abstract This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods.
    摘要 In the initialization stage, we leverage the powerful CLIP model to generate a similarity vector for each image, both globally and locally. We then use a similarity aggregator to combine the global and local similarity vectors. These aggregated similarity scores serve as initial pseudo labels for training.In the training stage, we propose an optimization framework to refine the pseudo labels for unobserved labels. We use the aggregated similarity scores as the initial pseudo labels and optimize the parameters of the classification network to predict the correct labels.In the inference stage, we only use the trained classification network to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets, and even achieves comparable results to weakly supervised classification methods.Here's the translation in Simplified Chinese:这篇论文提出了一种基于 CLIP 的无监督学习方法,用于无监督多个标签图像分类,包括三个阶段:初始化、训练和推断。在初始化阶段,我们利用 CLIP 模型的强大能力,并提出了一种新的方法,通过全图像和每个片断的相似性聚合来扩展 CLIP для多个标签预测。我们将每个图像分割成多个片断,然后使用 CLIP 生成每个片断的相似性向量,以及整个图像的相似性向量。然后,我们引入一个相似性聚合器,将全图像和每个片断的相似性向量聚合。这些聚合的相似性分数被用作训练阶段的初始 Pseudo 标签。在训练阶段,我们提出了一个优化框架,用于更新 Pseudo 标签的参数。我们使用聚合的相似性分数作为初始 Pseudo 标签,然后使用优化的分类网络来预测未知标签。在推断阶段,我们只使用已训练的分类网络来预测输入图像的标签。广泛的实验显示,我们的方法在 MS-COCO、PASCAL VOC 2007、PASCAL VOC 2012 和 NUS 数据集上具有优秀的表现,甚至与弱监督分类方法具有相似的表现。

Can Self-Supervised Representation Learning Methods Withstand Distribution Shifts and Corruptions?

  • paper_url: http://arxiv.org/abs/2308.02525
  • repo_url: https://github.com/prakashchhipa/robsutness-evaluation-of-self-supervised-methods-distribution-shifts-and-corruptions
  • paper_authors: Prakash Chandra Chhipa, Johan Rodahl Holmgren, Kanjar De, Rajkumar Saini, Marcus Liwicki
  • for: 本研究旨在探讨自动监督学习在计算机视觉领域中的稳定性和可靠性问题,以及自动监督学习方法在不同类型的数据分布下的表现。
  • methods: 本研究使用了各种自动监督学习方法,包括对比学习、知识储存、对谱最大化和聚类等方法,以研究这些方法在不同类型的数据分布下的表现。
  • results: 研究发现,自动监督学习方法在不同类型的数据分布下的表现存在明显的敏感性,尤其是在数据分布偏移和图像损害等情况下。这些结果提醒我们在实际应用中需要更好地处理数据分布偏移和图像损害问题,以确保自动监督学习方法的稳定性和可靠性。
    Abstract Self-supervised learning in computer vision aims to leverage the inherent structure and relationships within data to learn meaningful representations without explicit human annotation, enabling a holistic understanding of visual scenes. Robustness in vision machine learning ensures reliable and consistent performance, enhancing generalization, adaptability, and resistance to noise, variations, and adversarial attacks. Self-supervised paradigms, namely contrastive learning, knowledge distillation, mutual information maximization, and clustering, have been considered to have shown advances in invariant learning representations. This work investigates the robustness of learned representations of self-supervised learning approaches focusing on distribution shifts and image corruptions in computer vision. Detailed experiments have been conducted to study the robustness of self-supervised learning methods on distribution shifts and image corruptions. The empirical analysis demonstrates a clear relationship between the performance of learned representations within self-supervised paradigms and the severity of distribution shifts and corruptions. Notably, higher levels of shifts and corruptions are found to significantly diminish the robustness of the learned representations. These findings highlight the critical impact of distribution shifts and image corruptions on the performance and resilience of self-supervised learning methods, emphasizing the need for effective strategies to mitigate their adverse effects. The study strongly advocates for future research in the field of self-supervised representation learning to prioritize the key aspects of safety and robustness in order to ensure practical applicability. The source code and results are available on GitHub.
    摘要 自我监督学习在计算机视觉领域目的是利用数据内在的结构和关系来学习有意义的表示,无需显式的人类标注,以实现全面的视觉场景理解。在视觉机器学习中,可靠性和稳定性是关键的,它们可以提高总体化、适应性和对噪、变化和攻击的抵抗能力。自我监督方法,包括对比学习、知识传递、最大化mutual information和聚合,已经被认为可以实现不变学习表示。本研究探讨了自我监督学习方法中learned表示的Robustness,特别是对分布转移和图像损害的影响。通过详细的实验,我们发现 distribution shifts和图像损害会对learned表示的性能产生明显的影响,并且随着分布转移和损害的严重程度的增加,learned表示的Robustness会逐渐下降。这些发现强调了自我监督学习方法中分布转移和图像损害的重要性,并且提醒我们需要开发有效的缓解方法,以确保其实际可靠性。本研究强烈建议未来在自我监督表示学习领域的研究应该更加注重安全性和Robustness,以确保其实际应用。源代码和结果可以在GitHub上找到。

Detecting diabetic retinopathy severity through fundus images using an ensemble of classifiers

  • paper_url: http://arxiv.org/abs/2307.16622
  • repo_url: None
  • paper_authors: Eduard Popescu, Adrian Groza, Ioana Damian
  • For: The paper is written for diagnosing diabetic retinopathy and determining its severity levels.* Methods: The paper proposes a method for detecting diabetic retinopathy using fundus images, which includes data preprocessing, image segmentation, and an ensemble of classifiers.* Results: The paper achieves high accuracy in detecting diabetic retinopathy and its severity levels using the proposed method.Here’s the information in Simplified Chinese text:
  • for: 本文用于诊断糖尿病视网膜病和其严重程度。
  • methods: 本文提出了基于视网膜图像的糖尿病诊断方法,包括数据预处理、图像分割和 ensemble 分类器。
  • results: 本文通过该方法实现了高精度的糖尿病诊断和严重程度评估。
    Abstract Diabetic retinopathy is an ocular condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic diabetic severity levels. First, a set of data-prerpocessing is applied to available data: adaptive equalisation, color normalisation, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.
    摘要 diabetic retinopathy 是一种眼病理condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic retinopathy severity levels. First, a set of data-preprocessing is applied to available data: adaptive equalization, color normalization, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.Here's the translation of the text into Traditional Chinese:diabetic retinopathy 是一种眼病理condition that affects individuals with diabetes mellitus. It is a common complication of diabetes that can impact the eyes and lead to vision loss. One method for diagnosing diabetic retinopathy is the examination of the fundus of the eye. An ophthalmologist examines the back part of the eye, including the retina, optic nerve, and the blood vessels that supply the retina. In the case of diabetic retinopathy, the blood vessels in the retina deteriorate and can lead to bleeding, swelling, and other changes that affect vision. We proposed a method for detecting diabetic retinopathy severity levels. First, a set of data-preprocessing is applied to available data: adaptive equalization, color normalization, Gaussian filter, removal of the optic disc and blood vessels. Second, we perform image segmentation for relevant markers and extract features from the fundus images. Third, we apply an ensemble of classifiers and we assess the trust in the system.

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

  • paper_url: http://arxiv.org/abs/2307.16620
  • repo_url: None
  • paper_authors: Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang, Xin Yu
  • for: 本研究旨在解决现有AVS方法受数据集偏袋问题,即尝试将所有视频中的各种对象都与音频信号相关联。
  • methods: 本研究提出了一种音频视频实例感知分割方法,首先在视频中检测潜在的声音对象,然后将声音对象候选者与给定的音频相关联。
  • results: 实验结果表明,本研究可以有效地分割声音对象,不受数据集偏袋的影响。
    Abstract The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent one in another video. This would bring ambiguity in training our object segmentation network as only sounding objects have corresponding segmentation masks. We thus propose a silent object-aware segmentation objective to alleviate the ambiguity. Moreover, since the category information of audio is unknown, especially for multiple sounding sources, we propose to explore the audio-visual semantic correlation and then associate audio with potential objects. Specifically, we attend predicted audio category scores to potential instance masks and these scores will highlight corresponding sounding instances while suppressing inaudible ones. When we enforce the attended instance masks to resemble the ground-truth mask, we are able to establish audio-visual semantics correlation. Experimental results on the AVS benchmarks demonstrate that our method can effectively segment sounding objects without being biased to salient objects.
    摘要 audio-visual segmentation (AVS) 任务的目标是将视频中的声音对象 segmented 出来。现有的工作主要是将视频的声音和视觉特征相结合以实现声音对象的面积。然而,我们发现现有的方法很容易因为数据集中的偏见而 segment 出错误的声音对象。这是因为声音对象在视频中经常是最明显的对象。因此,现有的 AVS 方法可能会失败地 Localize 真正的声音对象,因为它们可能会受到数据集的偏见。在这种情况下,我们提出了一种 audio-visual 实例检测方法,以解决数据集的偏见。我们的方法首先在视频中检测 potential 声音对象,然后将这些对象与给定的声音相关联。我们注意到,一个对象可能在一个视频中是声音对象,而在另一个视频中是无声对象。这会在我们的对象 segmentation 网络的训练中引入歧义。我们因此提出了一种静音对象检测目标,以解决这个歧义。此外,由于声音的类别信息unknown,特别是多个声音源,我们提出了探索 audio-visual semantic 相关性,并将声音与潜在对象相关联。具体来说,我们attend 预测的声音类别分数到潜在实例面积中,这些分数会高亮相应的声音实例,而suppress 无声实例。当我们强制潜在实例面积与实际面积匹配时,我们能够建立 audio-visual semantic 相关性。我们的方法在 AVS benchmark 上进行了实验,结果表明,我们可以不受数据集的偏见,Effectively segment 声音对象。

FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration

  • paper_url: http://arxiv.org/abs/2307.16617
  • repo_url: None
  • paper_authors: Zhijian Huang, Sihao Lin, Guiyu Liu, Mukun Luo, Chaoqiang Ye, Hang Xu, Xiaojun Chang, Xiaodan Liang
  • for: 提高3D自动驾驶场景中多Modalità多任务学习的稳定性和计算效率, Mitigate the notorious modality bias and task conflict.
  • methods: 提出了一种新的多级梯度准确学习框架,在优化过程中对任务和模态之间的梯度进行准确做出了协调。
  • results: 在大规模的 benchmark nuScenes 上实验表明,提出的方法能够提高 map segmentation 的精度,提高 3D 检测的精度,使3D自动驾驶在多模态多任务学习领域得到了进一步的应用。
    Abstract Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario, considering robust prediction and computation budget. However, naively extending the existing framework to the domain of multi-modality multi-task learning remains ineffective and even poisonous due to the notorious modality bias and task conflict. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. To mitigate the issue, we propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization. Specifically, the gradients, produced by the task heads and used to update the shared backbone, will be calibrated at the backbone's last layer to alleviate the task conflict. Before the calibrated gradients are further propagated to the modality branches of the backbone, their magnitudes will be calibrated again to the same level, ensuring the downstream tasks pay balanced attention to different modalities. Experiments on large-scale benchmark nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute 14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D detection, advancing the application of 3D autonomous driving in the domain of multi-modality fusion and multi-task learning. We also discuss the links between modalities and tasks.
    摘要 多Modalità融合和多任务学习在3D自动驾驶场景中变得流行,以提高预测和计算预算的稳定性。然而,直接将现有框架应用到多Modalità多任务学习场景中可能会导致不优化和甚至有毒的模态偏见和任务冲突。前一些工作通过手动协调学习框架与实际知识来解决问题,可能会导致优化。为了解决这个问题,我们提出了一种新的、简单的多级梯度准确学习框架,在优化过程中跨任务和模态进行协调。具体来说,在 Shared Backbone 的最后层生成的梯度将被准确了,以降低任务冲突。然后,这些准确后的梯度将被再次准确到同一水平,确保下游任务对不同的模态进行平衡的注意力。我们在大规模的 benchmark nuScenes 上进行了实验,得到了提升的结果,例如,在 Map Segmentation 中提高了14.4%的精度,在 3D 检测中提高了1.4%的精度,这有助于推动3D自动驾驶在多Modalità融合和多任务学习场景中的应用。我们还讨论了模态和任务之间的关系。

Sampling to Distill: Knowledge Transfer from Open-World Data

  • paper_url: http://arxiv.org/abs/2307.16601
  • repo_url: None
  • paper_authors: Yuzheng Wang, Zhaoyu Chen, Jie Zhang, Dingkang Yang, Zuhao Ge, Yang Liu, Siao Liu, Yunquan Sun, Wenqiang Zhang, Lizhe Qi
  • for: 本研究旨在训练高性能学生模型,无需原始训练数据。
  • methods: 我们提出了一种新的开放世界数据采样策略(ODSD),不需要重复生成过程。首先,我们使用适应采样模块采集开放世界数据,然后引入低噪表示来alleviate域shift问题,并建立多个数据示例之间的结构关系,以利用数据知识。
  • results: 我们在CIFAR-10、CIFAR-100、NYUv2和ImageNet等数据集上进行了广泛的实验,并取得了状态理论的性能。尤其是在ImageNet数据集上,我们提高了1.50%-9.59%的准确率。
    Abstract Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of supervision information. Moreover, knowledge is transferred through each example, ignoring the implicit relationship among multiple examples. To this end, we propose a novel Open-world Data Sampling Distillation (ODSD) method without a redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module. Then, we introduce a low-noise representation to alleviate the domain shifts and build a structured relationship of multiple data examples to exploit data knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the existing results.
    摘要 <> translate "Data-Free Knowledge Distillation (DFKD) is a novel task that aims to train high-performance student models using only the teacher network without original training data. Despite encouraging results, existing DFKD methods rely heavily on generation modules with high computational costs. Meanwhile, they ignore the fact that the generated and original data exist domain shifts due to the lack of supervision information. Moreover, knowledge is transferred through each example, ignoring the implicit relationship among multiple examples. To this end, we propose a novel Open-world Data Sampling Distillation (ODSD) method without a redundant generation process. First, we try to sample open-world data close to the original data's distribution by an adaptive sampling module. Then, we introduce a low-noise representation to alleviate the domain shifts and build a structured relationship of multiple data examples to exploit data knowledge. Extensive experiments on CIFAR-10, CIFAR-100, NYUv2, and ImageNet show that our ODSD method achieves state-of-the-art performance. Especially, we improve 1.50\%-9.59\% accuracy on the ImageNet dataset compared with the existing results."<>以下是文本的简化中文翻译:“数据 свобод知识储备(DFKD)是一项新任务,旨在使用教师网络训练高性能的学生模型,无需原始训练数据。虽然已经得到了激励的结果,但现有的DFKD方法具有高计算成本的生成模块,同时忽略了生成和原始数据之间的频繁域转移,以及每个示例之间的隐式关系。为解决这个问题,我们提出了一种新的开放世界数据采样储备(ODSD)方法,不需要重复的生成过程。我们首先尝试通过适应采样模块采集开放世界数据,以便更接近原始数据的分布。然后,我们引入低噪表示,以降低域转移并建立多个数据示例之间的结构关系,以利用数据知识。我们对CIFAR-10、CIFAR-100、NYUv2和ImageNet进行了广泛的实验,结果表明,我们的ODSD方法在性能上达到了领先水平。尤其是在ImageNet dataset上,我们提高了1.50%-9.59%的准确率,与现有结果相比。”

SAMFlow: Eliminating Any Fragmentation in Optical Flow with Segment Anything Model

  • paper_url: http://arxiv.org/abs/2307.16586
  • repo_url: None
  • paper_authors: Shili Zhou, Ruian He, Weimin Tan, Bo Yan
    for:* 这种方法是为了解决现有方法强调地方准确性的问题,通过大视野模型的预训练和Segment Anything Model(SAM)的嵌入来提高物体识别。methods:* 这种方法使用了预训练的大视野模型和SAM图像编码器,并提出了一种Optical Flow Task-Specific Adaption scheme来适应非分割任务中使用SAM。results:* 该模型在Sintel和KITTI-15训练集上达到了0.86/2.10清洁/最终EPE和3.55/12.32EPE/F1-all的最佳性能,比Flowformer高出8.5%/9.9%和13.2%/16.3%。此外,该模型在Sintel和KITTI-15测试集上达到了状态最佳的性能,在所有二帧方法中在Sintel清洁通过中排名第一。
    Abstract Optical Flow Estimation aims to find the 2D dense motion field between two frames. Due to the limitation of model structures and training datasets, existing methods often rely too much on local clues and ignore the integrity of objects, resulting in fragmented motion estimation. Through theoretical analysis, we find the pre-trained large vision models are helpful in optical flow estimation, and we notice that the recently famous Segment Anything Model (SAM) demonstrates a strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.
    摘要 Optical Flow Estimation 目标是找到两帧之间的2D紧密运动场景。由于模型结构和训练数据的限制,现有方法frequently rely onlocal clue,忽略对象的完整性,resulting in fragmented motion estimation. Through theoretical analysis, we find that pre-trained large vision models are helpful in optical flow estimation, and the recently famous Segment Anything Model (SAM) demonstrates strong ability to segment complete objects, which is suitable for solving the fragmentation problem. We thus propose a solution to embed the frozen SAM image encoder into FlowFormer to enhance object perception. To address the challenge of in-depth utilizing SAM in non-segmentation tasks like optical flow estimation, we propose an Optical Flow Task-Specific Adaption scheme, including a Context Fusion Module to fuse the SAM encoder with the optical flow context encoder, and a Context Adaption Module to adapt the SAM features for optical flow task with Learned Task-Specific Embedding. Our proposed SAMFlow model reaches 0.86/2.10 clean/final EPE and 3.55/12.32 EPE/F1-all on Sintel and KITTI-15 training set, surpassing Flowformer by 8.5%/9.9% and 13.2%/16.3%. Furthermore, our model achieves state-of-the-art performance on the Sintel and KITTI-15 benchmarks, ranking #1 among all two-frame methods on Sintel clean pass.

Audio-visual video-to-speech synthesis with synthesized input audio

  • paper_url: http://arxiv.org/abs/2307.16584
  • repo_url: None
  • paper_authors: Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic
  • for: investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
  • methods: use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model using both the silent video and the synthesized speech as inputs.
  • results: successful with both raw waveforms and mel spectrograms as target outputs.Here is the full text in Simplified Chinese:
  • for: 这个研究的目的是investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
  • methods: 他们使用预训练的视频到语音模型来synthesize the missing speech signals,并然后使用 Both the silent video and the synthesized speech as inputs来train an audio-visual-to-speech synthesis model.
  • results: 他们的实验结果表明这种方法可以成功地使用 raw waveforms和mel spectrograms as target outputs.
    Abstract Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.
    摘要 <>视频到语音合成涉及重建说话人的语音信号从一个无声视频中。这个隐式假设是,语音信号 Either missing or contains a high amount of noise/corruption, so it is not useful for processing. 先前的文献中的工作 Either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work, we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. Specifically, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.中文简体版

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

  • paper_url: http://arxiv.org/abs/2307.16579
  • repo_url: None
  • paper_authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai
  • for: 用于audio-visual segmentation(AVS)任务,探索音频的贡献。
  • methods: 使用干扰模型和对比学习来学习semantic-correlated表示学习。
  • results: 实验结果表明,我们的解决方案有效地提高了AVS任务的性能。Here’s the full text in Simplified Chinese:
  • for: 本研究用于audio-visual segmentation(AVS)任务,探索音频的贡献。
  • methods: 我们提出了一种干扰模型,并使用对比学习来学习semantic-correlated表示学习。
  • results: 实验结果表明,我们的解决方案有效地提高了AVS任务的性能。Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
    Abstract We propose a latent diffusion model with contrastive learning for audio-visual segmentation (AVS) to extensively explore the contribution of audio. We interpret AVS as a conditional generation task, where audio is defined as the conditional variable for sound producer(s) segmentation. With our new interpretation, it is especially necessary to model the correlation between audio and the final segmentation map to ensure its contribution. We introduce a latent diffusion model to our framework to achieve semantic-correlated representation learning. Specifically, our diffusion model learns the conditional generation process of the ground-truth segmentation map, leading to ground-truth aware inference when we perform the denoising process at the test stage. As a conditional diffusion model, we argue it is essential to ensure that the conditional variable contributes to model output. We then introduce contrastive learning to our framework to learn audio-visual correspondence, which is proven consistent with maximizing the mutual information between model prediction and the audio data. In this way, our latent diffusion model via contrastive learning explicitly maximizes the contribution of audio for AVS. Experimental results on the benchmark dataset verify the effectiveness of our solution. Code and results are online via our project page: https://github.com/OpenNLPLab/DiffusionAVS.
    摘要 我们提出了一种含批处理模型,通过对比学习来探索音频的贡献。我们解释音频视频分割(AVS)为一个条件生成任务,其中音频被定义为音频生产者(或者音频源)的条件变量。在这种新的解释下,模型需要学习音频和最终分割图的相关性,以确保音频的贡献。为此,我们引入了一种潜在扩散模型到我们的框架中,以实现相关性学习。具体来说,我们的扩散模型学习了真实分割图的生成过程,从而在测试阶段进行了真实分割图恢复,这是一种条件扩散模型。我们认为,在条件扩散模型中,需要确保条件变量对模型输出的贡献。我们然后引入了对比学习,以学习音频和视频之间的对应关系,这与最大化模型预测和音频数据之间的共同信息量相同。通过这种方式,我们的潜在扩散模型通过对比学习显著地提高了音频对AVS的贡献。我们的实验结果表明,我们的解决方案效果很好。代码和结果可以在我们项目页面上找到:https://github.com/OpenNLPLab/DiffusionAVS。

Transferable Attack for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.16572
  • repo_url: https://github.com/anucvers/tass
  • paper_authors: Mengqi He, Jing Zhang, Zhaoyuan Yang, Mingyi He, Nick Barnes, Yuchao Dai
  • for: 本研究探讨了 semantic segmentation 模型受到针对攻击的性能影响,并发现了传统攻击方法(如 PGD 和 FGSM)无法轻松地攻击目标模型。
  • methods: 本研究提出了两个主要因素来实现传输攻击:首先,攻击应该具有有效的数据增强和翻译不变特征,以适应未见模型;其次,稳定的优化策略是必要的,以找到最佳攻击方向。
  • results: 基于上述观察和发现,本研究提出了一种ensemble攻击方法,可以实现更有效的攻击和更高的传输性。实验结果表明,这种ensemble攻击方法可以在不同的 semantic segmentation 模型上达到更高的攻击成功率。
    Abstract We analysis performance of semantic segmentation models wrt. adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models. i.e The conventional attack methods, such as PGD and FGSM, do not transfer well to target models, making it necessary to study the transferable attacks, especially transferable attacks for semantic segmentation. We find two main factors to achieve transferable attack. Firstly, the attack should come with effective data augmentation and translation-invariant features to deal with unseen models. Secondly, stabilized optimization strategies are needed to find the optimal attack direction. Based on the above observations, we propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability. The source code and experimental results are publicly available via our project page: https://github.com/anucvers/TASS.
    摘要 我们分析了 semantic segmentation 模型对 adversarial attack 的性能,发现源模型生成的 adversarial examples 无法攻击目标模型。即常见的攻击方法,如 PGD 和 FGSM,无法传输到目标模型,因此需要研究可传输的攻击。我们发现两个主要因素可以实现可传输攻击:首先,攻击应该包含有效的数据增强和翻译不变的特征来处理未见模型。其次,稳定的优化策略是必要的,以找到最佳攻击方向。基于以上观察,我们提出了一种 ensemble attack 来实现更有效的攻击和更高的传输性。源代码和实验结果可以通过我们的项目页面:https://github.com/anucvers/TASS 获取。

Towards Unbalanced Motion: Part-Decoupling Network for Video Portrait Segmentation

  • paper_url: http://arxiv.org/abs/2307.16565
  • repo_url: None
  • paper_authors: Tianshu Yu, Changqun Xia, Jia Li
  • for: 这个研究的目的是为了提高类别视频肖像分割的精度和可靠性,并且提供一个大规模的多scene类别视频肖像分割数据集(MVPS),以便进一步探索这个任务的研究。
  • methods: 本研究提出了一个名为Part-Decoupling Network(PDNet)的新网络架构,用于解决类别视频肖像分割的问题。PDNet使用了一个内部框架分割器(IPDA)模组,用于不supervised的肖像分割,并且运用了不同的注意力强度来捕捉不同部分的特征。
  • results: 实验结果显示,PDNet在与现有方法比较之下,实现了类别视频肖像分割的更高精度和可靠性。
    Abstract Video portrait segmentation (VPS), aiming at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, motion of different parts of the portraits is unbalanced. Towards this unbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisely segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with unbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.
    摘要 视频肖像分割(VPS),targeting at segmenting prominent foreground portraits from video frames, has received much attention in recent years. However, the simplicity of existing VPS datasets leads to a limitation on extensive research of the task. In this work, we propose a new intricate large-scale Multi-scene Video Portrait Segmentation dataset MVPS consisting of 101 video clips in 7 scenario categories, in which 10,843 sampled frames are finely annotated at pixel level. The dataset has diverse scenes and complicated background environments, which is the most complex dataset in VPS to our best knowledge. Through the observation of a large number of videos with portraits during dataset construction, we find that due to the joint structure of human body, the motion of portraits is part-associated, which leads that different parts are relatively independent in motion. That is, the motion of different parts of the portraits is unbalanced. Towards this unbalance, an intuitive and reasonable idea is that different motion states in portraits can be better exploited by decoupling the portraits into parts. To achieve this, we propose a Part-Decoupling Network (PDNet) for video portrait segmentation. Specifically, an Inter-frame Part-Discriminated Attention (IPDA) module is proposed which unsupervisely segments portrait into parts and utilizes different attentiveness on discriminative features specified to each different part. In this way, appropriate attention can be imposed to portrait parts with unbalanced motion to extract part-discriminated correlations, so that the portraits can be segmented more accurately. Experimental results demonstrate that our method achieves leading performance with the comparison to state-of-the-art methods.

Simultaneous column-based deep learning progression analysis of atrophy associated with AMD in longitudinal OCT studies

  • paper_url: http://arxiv.org/abs/2307.16559
  • repo_url: None
  • paper_authors: Adi Szeskin, Roei Yehuda, Or Shmueli, Jaime Levy, Leo Joskowicz
  • For: The paper aims to accurately quantify retinal atrophy changes associated with dry age-related macular degeneration (AMD) on longitudinal OCT studies.* Methods: The proposed method uses a novel simultaneous multi-channel column-based deep learning model trained on registered pairs of OCT scans to detect and segment retinal atrophy segments in consecutive OCT scans.* Results: The proposed method achieved a mean atrophy segments detection precision of 0.90+-0.09 and a recall of 0.95+-0.06, outperforming standalone classification methods by 30+-62% and 27+-0% for atrophy segments and lesions.Here is the information in Simplified Chinese text:* For: 这篇论文目的是准确量化慢性肿瘤性macular degeneration (AMD) 相关的肿瘤变化的长期 OCT 图像。* Methods: 提议的方法使用了一种新的同时多通道列深度学习模型,通过注册对 OCT 图像进行训练,以同时检测并分类晚期 OCT 图像中的肿瘤变化。* Results: 提议的方法实现了mean肿瘤变化段检测精度为0.90+-0.09和检测率为0.95+-0.06,比站alone分类方法高出30+-62%和27+-0%。
    Abstract Purpose: Disease progression of retinal atrophy associated with AMD requires the accurate quantification of the retinal atrophy changes on longitudinal OCT studies. It is based on finding, comparing, and delineating subtle atrophy changes on consecutive pairs (prior and current) of unregistered OCT scans. Methods: We present a fully automatic end-to-end pipeline for the simultaneous detection and quantification of time-related atrophy changes associated with dry AMD in pairs of OCT scans of a patient. It uses a novel simultaneous multi-channel column-based deep learning model trained on registered pairs of OCT scans that concurrently detects and segments retinal atrophy segments in consecutive OCT scans by classifying light scattering patterns in matched pairs of vertical pixel-wide columns (A-scans) in registered prior and current OCT slices (B-scans). Results: Experimental results on 4,040 OCT slices with 5.2M columns from 40 scans pairs of 18 patients (66% training/validation, 33% testing) with 24.13+-14.0 months apart in which Complete RPE and Outer Retinal Atrophy (cRORA) was identified in 1,998 OCT slices (735 atrophy lesions from 3,732 segments, 0.45M columns) yield a mean atrophy segments detection precision, recall of 0.90+-0.09, 0.95+-0.06 and 0.74+-0.18, 0.94+-0.12 for atrophy lesions with AUC=0.897, all above observer variability. Simultaneous classification outperforms standalone classification precision and recall by 30+-62% and 27+-0% for atrophy segments and lesions. Conclusions: simultaneous column-based detection and quantification of retinal atrophy changes associated with AMD is accurate and outperforms standalone classification methods. Translational relevance: an automatic and efficient way to detect and quantify retinal atrophy changes associated with AMD.
    摘要 目的:检测和评估涂猪病关联Retinal Atrophy的疾病进程,需要精准量化涂猪病变化的longitudinal OCT图像。方法:我们提出了一个完全自动、端到端的管道,可同时检测和评估涂猪病变化的时间相关性。该管道使用了一种新的同时多通道列深度学习模型,通过匹配 registered pairs of OCT图像来同时检测和分割涂猪病变化。结果:我们在4,040个OCT图像中(5.2M列)的40个扫描对(18名患者,66%的训练/验证,33%的测试)中获得了24.13±14.0个月的间隔,并在1,998个OCT图像中(735个涂猪病变化,0.45M列)中发现了735个涂猪病变化。各个涂猪病变化的检测精度和回归率分别为0.90±0.09和0.95±0.06,AUC=0.897。同时分类的精度和回归率分别高于独立分类的精度和回归率 by 30±62%和27±0%。结论:同时列深度学习模型可以准确地检测和评估涂猪病变化关联AMD。翻译意义:一种自动和高效的方法来检测和评估涂猪病变化关联AMD。

Uncertainty-Guided Spatial Pruning Architecture for Efficient Frame Interpolation

  • paper_url: http://arxiv.org/abs/2307.16555
  • repo_url: None
  • paper_authors: Ri Cheng, Xuhao Jiang, Ruian He, Shili Zhou, Weimin Tan, Bo Yan
  • for: 这个论文的目的是提出一种能够高效地实现视频帧 interpolating的方法,以降低计算量而不影响视觉质量。
  • methods: 该方法使用了动态空间剔除技术,通过将低不确定性的像素标记为易动区域,以避免无用的计算。此外, authors还提出了一种自动对比分支技术,以提高UGSP的表现。
  • results: 对比基eline方法,该方法可以减少计算量34%/52%/30%,并在多个benchmark上达到最佳性能。
    Abstract The video frame interpolation (VFI) model applies the convolution operation to all locations, leading to redundant computations in regions with easy motion. We can use dynamic spatial pruning method to skip redundant computation, but this method cannot properly identify easy regions in VFI tasks without supervision. In this paper, we develop an Uncertainty-Guided Spatial Pruning (UGSP) architecture to skip redundant computation for efficient frame interpolation dynamically. Specifically, pixels with low uncertainty indicate easy regions, where the calculation can be reduced without bringing undesirable visual results. Therefore, we utilize uncertainty-generated mask labels to guide our UGSP in properly locating the easy region. Furthermore, we propose a self-contrast training strategy that leverages an auxiliary non-pruning branch to improve the performance of our UGSP. Extensive experiments show that UGSP maintains performance but reduces FLOPs by 34%/52%/30% compared to baseline without pruning on Vimeo90K/UCF101/MiddleBury datasets. In addition, our method achieves state-of-the-art performance with lower FLOPs on multiple benchmarks.
    摘要 视频帧 interpolate (VFI) 模型对所有位置进行 convolution 操作,导致在易动的区域进行重复计算。我们可以使用动态空间剔除方法来快速跳过重复计算,但这种方法无法在 VFI 任务中正确地标识易动区域。在这篇论文中,我们开发了一种不确定性指导的空间剔除 (UGSP) 架构,以便在高效的帧 interpolate 中跳过重复计算。具体来说,具有低不确定性的像素表示易动区域,可以通过减少计算而不会导致视觉效果受损。因此,我们利用不确定性生成的掩码标签来导引我们的 UGSP 在正确的易动区域进行剔除。此外,我们提出了一种自我对比训练策略,通过一个辅助的非剔除分支来提高我们的 UGSP 表现。广泛的实验表明,我们的 UGSP 可以维持性能,同时减少 FLOPs 比基eline 无剔除的34%/52%/30%。此外,我们的方法在多个标准benchmark上实现了最佳性能。

Towards General Visual-Linguistic Face Forgery Detection

  • paper_url: http://arxiv.org/abs/2307.16545
  • repo_url: None
  • paper_authors: Ke Sun, Shen Chen, Taiping Yao, Xiaoshuai Sun, Shouhong Ding, Rongrong Ji
    for:* 这篇论文的目的是提出一种新的脸部 manipulate 检测方法,以满足安全、隐私和信任等问题的需求。methods:* 该方法使用了Visual-Linguistic Face Forgery Detection(VLFFD) paradigm,这种方法使用了细化的 sentence-level prompts 作为标注。* VLFFD 首先使用 Prompt Forgery Image Generator(PFIG)生成杂合的伪造图像,然后将杂合的图像和原始图像一起在 Coarse-and-Fine Co-training 框架中进行联合训练。results:* 实验结果表明,提出的方法可以在一些复杂的 benchmark 上提高现有的检测模型性能。
    Abstract Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks.
    摘要 深层质的假像技术(Deepfakes)可能 pose serious threats to security, privacy, and trust. 现有的方法主要 treated 这个任务为二分类,使用数字标签或面署信号来训练检测模型。我们认为这些超级visions 缺乏semantic information和可读性。为了解决这些问题,在这篇论文中,我们提出了一种新的方法,即视觉语言面假造检测(VLFFD)。VLFFD使用细化的 sentence-level 提示作为annotation。由于现有的深层假像数据集没有文本标注,VLFFD首先使用Prompt Forgery Image Generator (PFIG)生成杂合假造图像和对应的细化提示。然后,杂合的混合数据和原始数据jointly 在Coarse-and-Fine Co-training framework (C2F)中训练,使模型能够更好地泛化和解释。实验结果表明,提出的方法可以提高现有的检测模型在一些复杂的benchmark上的性能。

On Transferability of Driver Observation Models from Simulated to Real Environments in Autonomous Cars

  • paper_url: http://arxiv.org/abs/2307.16543
  • repo_url: None
  • paper_authors: Walter Morales-Alvarez, Novel Certad, Alina Roitberg, Rainer Stiefelhagen, Cristina Olaverri-Monreal
  • for: 这篇论文探讨了将模拟数据传递到实际驾驶场景中的可能性,尤其是在自动驾驶领域中,模拟数据frequently用于训练因为安全问题。
  • methods: 本文使用了真实自动驾驶条件下的数据采集,并采用了Inflated 3D ConvNet(I3D)模型和Gradient-weighted Class Activation Mapping(Grad-CAM)来进行详细的模型决策分析。
  • results: 虽然模拟器上的模型表现出色,但在实际驾驶场景下,其识别率降低到46.6%,并且不同的行为类型之间存在强烈的变化。这说明了模型传输性能的挑战,并促进了我们研究更加鲜明的驾驶者观察系统,能够满足实际驾驶场景中的需求。
    Abstract For driver observation frameworks, clean datasets collected in controlled simulated environments often serve as the initial training ground. Yet, when deployed under real driving conditions, such simulator-trained models quickly face the problem of distributional shifts brought about by changing illumination, car model, variations in subject appearances, sensor discrepancies, and other environmental alterations. This paper investigates the viability of transferring video-based driver observation models from simulation to real-world scenarios in autonomous vehicles, given the frequent use of simulation data in this domain due to safety issues. To achieve this, we record a dataset featuring actual autonomous driving conditions and involving seven participants engaged in highly distracting secondary activities. To enable direct SIM to REAL transfer, our dataset was designed in accordance with an existing large-scale simulator dataset used as the training source. We utilize the Inflated 3D ConvNet (I3D) model, a popular choice for driver observation, with Gradient-weighted Class Activation Mapping (Grad-CAM) for detailed analysis of model decision-making. Though the simulator-based model clearly surpasses the random baseline, its recognition quality diminishes, with average accuracy dropping from 85.7% to 46.6%. We also observe strong variations across different behavior classes. This underscores the challenges of model transferability, facilitating our research of more robust driver observation systems capable of dealing with real driving conditions.
    摘要

Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion

  • paper_url: http://arxiv.org/abs/2307.16532
  • repo_url: None
  • paper_authors: Yang Liu, Feng Wang, Naiyan Wang, Zhaoxiang Zhang
  • for: 提高雷达探测性能,使其与其他感知器进行深度融合。
  • methods: 跳过现有的雷达信号处理管道,直接将雷达原始数据与其他感知器进行融合。使用鸟瞰视图(BEV)查询和雷达谱特征来实现。
  • results: 与现有方法相比,方法可以更好地利用雷达回射信号中的距离和速度准确信息,并且与图像中的 semantics信息进行深度融合,在RADIal数据集上表现出优于所有exist方法,并且接近LiDAR的性能。
    Abstract Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability to bad weather. Nevertheless, the radar detection performance is usually inferior because its point cloud is sparse and not accurate due to the poor azimuth and elevation resolution. Moreover, point cloud generation algorithms already drop weak signals to reduce the false targets which may be suboptimal for the use of deep fusion. In this paper, we propose a novel method named EchoFusion to skip the existing radar signal processing pipeline and then incorporate the radar raw data with other sensors. Specifically, we first generate the Bird's Eye View (BEV) queries and then take corresponding spectrum features from radar to fuse with other sensors. By this approach, our method could utilize both rich and lossless distance and speed clues from radar echoes and rich semantic clues from images, making our method surpass all existing methods on the RADIal dataset, and approach the performance of LiDAR. Codes will be available upon acceptance.
    摘要 “射频是自动驾驶系统中 ubique 的,主要因为它的成本低廉且能够适应不好的天气。然而,射频检测性能通常较差,因为它的点云 sparse 且不精准,主要是因为射频的方位和高度分辨率差。此外,点云生成算法通常会删除弱信号以减少假目标,这可能不是最佳的选择 для深度融合。在这篇论文中,我们提出了一种名为 EchoFusion 的新方法,它可以跳过现有的射频信号处理管线,然后与其他感知器进行融合。具体来说,我们首先生成 Bird's Eye View(BEV)查询,然后从射频中抽出相应的 спектル特征,与其他感知器进行融合。这种方法可以利用射频类推的丰富和无损距离和速度帮助,同时也可以充分利用图像的丰富semantic帮助,使我们的方法在 RADIal 数据集上超越所有现有的方法,并且接近 LiDAR 的性能。codes 将会在接受时发布。”

Deep Learning and Computer Vision for Glaucoma Detection: A Review

  • paper_url: http://arxiv.org/abs/2307.16528
  • repo_url: None
  • paper_authors: Mona Ashtari-Majlan, Mohammad Mahdi Dehshibi, David Masip
  • for: 这篇论文旨在探讨人工智能在诊断眼内压病中的应用,尤其是使用计算机视觉和深度学习方法进行自动诊断。
  • methods: 这篇论文综述了最近几年关于眼内压病诊断的人工智能研究,包括基于眼球图像、光子同步图像和视场图像的方法,并分析了不同的建筑学派别和源代码的影响。
  • results: 经过对广泛使用的公共数据集进行检验,这篇论文显示了不同方法之间的总体性能差异、不确定性估计和多模态融合问题,同时还提到了数据集的缺陷和限制。
    Abstract Glaucoma is the leading cause of irreversible blindness worldwide and poses significant diagnostic challenges due to its reliance on subjective evaluation. However, recent advances in computer vision and deep learning have demonstrated the potential for automated assessment. In this paper, we survey recent studies on AI-based glaucoma diagnosis using fundus, optical coherence tomography, and visual field images, with a particular emphasis on deep learning-based methods. We provide an updated taxonomy that organizes methods into architectural paradigms and includes links to available source code to enhance the reproducibility of the methods. Through rigorous benchmarking on widely-used public datasets, we reveal performance gaps in generalizability, uncertainty estimation, and multimodal integration. Additionally, our survey curates key datasets while highlighting limitations such as scale, labeling inconsistencies, and bias. We outline open research challenges and detail promising directions for future studies. This survey is expected to be useful for both AI researchers seeking to translate advances into practice and ophthalmologists aiming to improve clinical workflows and diagnosis using the latest AI outcomes.
    摘要 Glaucoma 是全球最主要的不可逆失明病种,但它的诊断却存在一定的主观性问题。然而,最新的计算机视觉和深度学习技术的发展已经显示出了自动诊断的潜力。在这篇文章中,我们对最近的人工智能基于膝盖、Optical coherence tomography和视场图像的 glaucoma 诊断研究进行了评论。我们提供了一个更新的分类方法,将方法分为建筑学派别,并提供了可以重复的源代码链接。通过对广泛使用的公共数据集进行严格的测试,我们揭示了总体化、不确定性估计和多Modal 集成的性能差距。此外,我们还筛选了重要的数据集,并强调了规模、标签不一致和偏见的局限性。我们还详细介绍了未来研究的开放问题,并提出了可能的解决方案。这篇文章预计会对计算机科学家帮助翻译最新的进展,以及医生使用最新的人工智能结果进行诊断。

Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

  • paper_url: http://arxiv.org/abs/2307.16509
  • repo_url: https://github.com/gallenszl/ucfnet
  • paper_authors: Zhelun Shen, Xibin Song, Yuchao Dai, Dingfu Zhou, Zhibo Rao, Liangjun Zhang
  • for: 提高深度匹配的跨 dataset 鲁棒性和泛化能力,尤其是在面临着域域异常和数据缺乏的情况下。
  • methods: 采用像素级 uncertainty 估计来自适应匹配空间,并通过权重学习来逐渐减少不可能的对应关系。另外,提出了基于 uncertainty 的 pseudo-标签方法,用于适应预训练模型到新域,并可以筛选高 uncertainty 像素的预测深度图并生成稀疏 yet 可靠的 pseudo-标签。
  • results: 实验表明,我们的方法在跨域、适应和共同泛化等方面具有强大的性能,并在 Robust Vision Challenge 2020 中获得了深度匹配任务的第一名。此外,我们的 uncertainty-based pseudo-标签还可以用于无监督的单目深度估计网络训练,并实现了与监督方法相当的性能。代码将在 https://github.com/gallenszl/UCFNet 上提供。
    Abstract Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.
    摘要 Due to the differences in domains and unbalanced distribution of disparities across multiple datasets, current stereo matching methods are often limited to a specific dataset and generalize poorly to others. To address this issue, we typically rely on substantial adaptation using costly target-domain ground-truth data, which is not easily accessible in practical settings. In this paper, we propose to explore uncertainty estimation for robust stereo matching. Specifically, we employ pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, thereby driving the network to progressively prune out the space of unlikely correspondences. Additionally, we propose an uncertainty-based pseudo-label to adapt the pre-trained model to a new domain, using pixel-level and area-level uncertainty estimation to filter out high-uncertainty pixels of predicted disparity maps and generate sparse yet reliable pseudo-labels to bridge the domain gap. Our experiments show strong cross-domain, adaptive, and joint generalization, with our method achieving \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Furthermore, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised manner, achieving comparable performance with supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.

Towards General Low-Light Raw Noise Synthesis and Modeling

  • paper_url: http://arxiv.org/abs/2307.16508
  • repo_url: https://github.com/fengzhang427/LRD
  • paper_authors: Feng Zhang, Bin Xu, Zhiqiang Li, Xinran Liu, Qingbo Lu, Changxin Gao, Nong Sang
  • for: 提供一种基于物理和学习的低光照环境下隐藏噪声模型,以满足计算摄影和图像处理应用的需求。
  • methods: 通过将信号依赖和信号独立噪声分别用物理和学习模型来模拟,以实现一个通用的模型,可以同时学习不同ISO水平的噪声特征并对各种感知器进行泛化。
  • results: 对于低光照环境下的隐藏噪声,我们的方法可以具有高度的同准化能力,并且在不同感知器上进行了广泛的比较,结果表明我们的方法在噪声降减方面与状态之前的方法进行了比较。
    Abstract Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors.
    摘要 <>模型和生成低光环境中的原始噪声是计算摄影和图像处理应用的基本问题。虽然大多数最近的工作采用物理基于的模型来生成噪声,但低光环境中的噪声却是非常复杂且异常强变,这超出了这些模型的描述范围。为解决这个问题,我们引入一新的视角来生成噪声。具体来说,我们同时生成噪声的信号依赖和不依赖部分,通过物理和学习基础来实现。这样,我们的方法可以视为一种通用模型,即可同时学习不同ISO水平的噪声特性,并且可以通过不同感知器进行普适化。然后,我们提出了一种有效的多尺度滤波器,即福泽transformer滤波器(FTD),以便准确地分辨噪声分布。此外,我们收集了一个新的低光 raw denoising(LRD)数据集,用于训练和测试。qualitative验证表明,生成的噪声与实际噪声的分布高度相似。此外,我们对不同感知器进行了广泛的排除实验,结果显示,我们的方法与当前状态革方法相比,在不同感知器上表现出优异的效果。

A hybrid approach for improving U-Net variants in medical image segmentation

  • paper_url: http://arxiv.org/abs/2307.16462
  • repo_url: None
  • paper_authors: Aitik Gupta, Dr. Joydip Dhar
  • for: 这项研究的目的是降低网络参数的需求,以保持一些医疗影像分割任务的性能,如皮肤损害分割使用注意系统和循环连接。
  • methods: 这项研究使用了深度分解 convolutions 和注意系统,以及循环连接来降低网络参数的需求,同时保持一些医疗影像分割任务的性能。
  • results: 研究表明,使用深度分解 convolutions 和注意系统,以及循环连接可以降低网络参数的需求,同时保持一些医疗影像分割任务的性能。
    Abstract Medical image segmentation is vital to the area of medical imaging because it enables professionals to more accurately examine and understand the information offered by different imaging modalities. The technique of splitting a medical image into various segments or regions of interest is known as medical image segmentation. The segmented images that are produced can be used for many different things, including diagnosis, surgery planning, and therapy evaluation. In initial phase of research, major focus has been given to review existing deep-learning approaches, including researches like MultiResUNet, Attention U-Net, classical U-Net, and other variants. The attention feature vectors or maps dynamically add important weights to critical information, and most of these variants use these to increase accuracy, but the network parameter requirements are somewhat more stringent. They face certain problems such as overfitting, as their number of trainable parameters is very high, and so is their inference time. Therefore, the aim of this research is to reduce the network parameter requirements using depthwise separable convolutions, while maintaining performance over some medical image segmentation tasks such as skin lesion segmentation using attention system and residual connections.
    摘要 医疗影像分割是医疗影像领域的关键技术,它使得专业人员可以更加准确地检查和理解不同的影像模式中提供的信息。这种技术的核心是将医疗影像分割成不同的区域或 interess 点。生成的分割图像可以用于诊断、手术规划和治疗评估等多种应用。在初期研究阶段,主要关注了现有的深度学习方法,包括MultiResUNet、Attention U-Net、传统的U-Net和其他变种。这些方法使用注意力特征向量或地图来动态添加重要权重,以提高准确性。然而,这些网络的参数需求较高,导致过拟合和执行时间较长。因此,本研究的目标是通过深度分割 convolution 来降低网络参数需求,保持一定的性能水平,而不是全面替换现有的深度学习方法。特别是在医疗影像分割任务中,如皮肤病变分割使用注意力系统和 residual connections。

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

  • paper_url: http://arxiv.org/abs/2307.16449
  • repo_url: https://github.com/rese1f/MovieChat
  • paper_authors: Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
  • for: 用于建立一个可以处理长视频的视频理解系统,推翻特定预先定义的视觉任务的局限性。
  • methods: 使用视频基础模型和大语言模型,并开发了一种基于Atkinson-Shiffrin记忆模型的记忆机制,包括快速更新的短期记忆和可持续的长期记忆。使用Transformers中的 токен作为记忆载体。
  • results: 实现了长视频理解的州OF-the-art表现。
    Abstract Recently, integrating video foundation models and large language models to build a video understanding system overcoming the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection are the remaining challenges. Inspired by Atkinson-Shiffrin memory model, we develop an memory mechanism including a rapidly updated short-term memory and a compact thus sustained long-term memory. We employ tokens in Transformers as the carriers of memory. MovieChat achieves state-of-the-art performace in long video understanding.
    摘要 最近,我们尝试将视频基础模型和大型语言模型结合,以建立一个能够涵盖具体定义的视频理解系统,超越特定预定的视觉任务的限制。然而,现有系统只能处理很少帧数的视频。长视频处理中的计算复杂度、内存成本和长期时间连接仍然是挑战。受阿特金逊-希夫里南记忆模型启发,我们开发了一种快速更新的短期记忆机制和一种可减少的长期记忆机制。我们使用Transformers中的标识符作为记忆载体。我们的MovieChat实现了长视频理解的州际级表现。

Interactive Neural Painting

  • paper_url: http://arxiv.org/abs/2307.16441
  • repo_url: https://github.com/pukacheen/MagicBrush
  • paper_authors: Elia Peruzzo, Willi Menapace, Vidit Goel, Federica Arrigoni, Hao Tang, Xingqian Xu, Arman Chopikyan, Nikita Orlov, Yuxiao Hu, Humphrey Shi, Nicu Sebe, Elisa Ricci
  • for: 这篇论文的目的是提出一种可以帮助用户创作的计算机机器人技术,帮助用户在绘画时提供下一步的笔触建议。
  • methods: 该方法基于一种 conditional transformer Variational AutoEncoder(VAE)架构,并在两个阶段中进行解码。
  • results: 我们的实验结果表明,我们的方法可以提供好的笔触建议,并与现有技术相比,表现更好。
    Abstract In the last few years, Neural Painting (NP) techniques became capable of producing extremely realistic artworks. This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP. Considering a setting where a user looks at a scene and tries to reproduce it on a painting, our objective is to develop a computational framework to assist the users creativity by suggesting the next strokes to paint, that can be possibly used to complete the artwork. To accomplish such a task, we propose I-Paint, a novel method based on a conditional transformer Variational AutoEncoder (VAE) architecture with a two-stage decoder. To evaluate the proposed approach and stimulate research in this area, we also introduce two novel datasets. Our experiments show that our approach provides good stroke suggestions and compares favorably to the state of the art. Additional details, code and examples are available at https://helia95.github.io/inp-website.
    摘要 最近几年,神经油画(NP)技术已经能够生成极其真实的艺术作品。这篇论文推动这个新兴研究领域的状态机器人,我们提出了首个互动神经油画(I-Paint)方法。在用户看到场景并尝试通过油画复制它的情况下,我们的目标是开发一个计算机框架,以帮助用户创作性的提高,并提供可能用于完成艺术作品的下一拍建议。为实现这个目标,我们提出了一种基于条件变换器Variational AutoEncoder(VAE)架构的新方法。为评估我们的方法和激发这个领域的研究,我们还提出了两个新的数据集。我们的实验表明,我们的方法可以提供好的拍建议,并与当前状态机器人相比较好。详细信息、代码和示例可以在https://helia95.github.io/inp-website上找到。

Towards Head Computed Tomography Image Reconstruction Standardization with Deep Learning Assisted Automatic Detection

  • paper_url: http://arxiv.org/abs/2307.16440
  • repo_url: None
  • paper_authors: Bowen Zheng, Chenxi Huang, Yuemei Luo
  • for: 提高头部Computed Tomography(CT)图像三维重建的精度和重复性,以便更加准确地诊断。
  • methods: 使用深度学习基于对象检测算法,自动检测和评估 orbitomeatal 线标志,以 reformatting 图像 перед reconstruction。
  • results: 比较了十种对象检测算法的精度、效率和Robustness,选择了轻量级的 YOLOv8,其 mAP 为 92.91%,并通过标准化重建结果的质量评估,证明方法的临床实用性和有效性。
    Abstract Three-dimensional (3D) reconstruction of head Computed Tomography (CT) images elucidates the intricate spatial relationships of tissue structures, thereby assisting in accurate diagnosis. Nonetheless, securing an optimal head CT scan without deviation is challenging in clinical settings, owing to poor positioning by technicians, patient's physical constraints, or CT scanner tilt angle restrictions. Manual formatting and reconstruction not only introduce subjectivity but also strain time and labor resources. To address these issues, we propose an efficient automatic head CT images 3D reconstruction method, improving accuracy and repeatability, as well as diminishing manual intervention. Our approach employs a deep learning-based object detection algorithm, identifying and evaluating orbitomeatal line landmarks to automatically reformat the images prior to reconstruction. Given the dearth of existing evaluations of object detection algorithms in the context of head CT images, we compared ten methods from both theoretical and experimental perspectives. By exploring their precision, efficiency, and robustness, we singled out the lightweight YOLOv8 as the aptest algorithm for our task, with an mAP of 92.91% and impressive robustness against class imbalance. Our qualitative evaluation of standardized reconstruction results demonstrates the clinical practicability and validity of our method.
    摘要 三维重建头部计算机断层影像(CT)图可以描述脏器结构的细节关系,从而帮助精准诊断。然而,在临床 Settings中获得优质头部CT扫描数据 без deviation 是困难的,因为技术人员的位置不准确、病人的物理限制或CT扫描器的倾斜角度限制。手动格式化和重建不仅引入主观性,还占用了时间和劳动资源。为解决这些问题,我们提出了一种高效的自动头部CT图三维重建方法,提高精度和重复性,同时减少手动干预。我们的方法使用深度学习基于对象检测算法,通过识别和评估颈部线标记来自动重新格式化图像,以便在重建前进行三维重建。由于现有对头部CT图像的对象检测算法的评估缺乏,我们对十种算法进行了 teoretic 和实验性的比较。通过探索它们的精度、效率和Robustness,我们选择了轻量级的 YOLOv8,其MAP为 92.91%,并且在类偏oshadow 下表现出了很好的Robustness。我们的质量评估表示我们的方法在临床实践中是有效的和有效性。

Detecting Out-of-distribution Objects Using Neuron Activation Patterns

  • paper_url: http://arxiv.org/abs/2307.16433
  • repo_url: https://github.com/safednn-group/naptron
  • paper_authors: Bartłomiej Olber, Krystian Radlak, Krystian Chachuła, Jakub Łyskawa, Piotr Frątczak
  • for: 实时物类检测中的OOD检测问题
  • methods: 基于Neuron Activation PaTteRns的OOD检测方法
  • results: 在两个不同的OOD情况下和三种物类检测器上,我们的方法具有比顶对ID性能的优化和高准确率的OOD检测能力。
    Abstract Object detection is essential to many perception algorithms used in modern robotics applications. Unfortunately, the existing models share a tendency to assign high confidence scores for out-of-distribution (OOD) samples. Although OOD detection has been extensively studied in recent years by the computer vision (CV) community, most proposed solutions apply only to the image recognition task. Real-world applications such as perception in autonomous vehicles struggle with far more complex challenges than classification. In our work, we focus on the prevalent field of object detection, introducing Neuron Activation PaTteRns for out-of-distribution samples detection in Object detectioN (NAPTRON). Performed experiments show that our approach outperforms state-of-the-art methods, without the need to affect in-distribution (ID) performance. By evaluating the methods in two distinct OOD scenarios and three types of object detectors we have created the largest open-source benchmark for OOD object detection.
    摘要 Here's the Simplified Chinese translation:对象检测是现代 робо特性应用中非常重要的算法之一。然而,现有的模型往往对异distribution(OOD)样本分配高信任分数。CV社区在过去几年内对OOD检测进行了广泛的研究,但大多数提出的解决方案仅适用于图像识别任务。实际应用中,如自动驾驶车辆的感知系统,面临着远远超出类别检测的复杂挑战。在我们的工作中,我们将注意力集中在对象检测领域,并提出了基于神经元活动的Patterns for Out-of-Distribution Sample Detection in Object detectioN(NAPTRON)。我们的方法在现有的方法中表现出色,不需要影响类别内样本(ID)的性能。我们在两个不同的OOD场景中和三种对象检测器进行了实验,创建了最大的开源benchmark дляOOD对象检测。

High Dynamic Range Image Reconstruction via Deep Explicit Polynomial Curve Estimation

  • paper_url: http://arxiv.org/abs/2307.16426
  • repo_url: https://github.com/jqtangust/epce-hdr
  • paper_authors: Jiaqi Tang, Xiaogang Xu, Sixing Hu, Ying-Cong Chen
  • for: 解决限制了摄像机的镜头能力,数字图像通常有更窄的动态闪光范围,从而降低了图像的真实性。
  • methods: 提出了高动态范围(HDR)重建技术,以回归更好地表示实际场景。但是,由于不同的物理捕捉参数,图像和实际闪光范围之间的对应关系非常复杂,这使得HDR重建变得极其困难。现有的解决方案无法直接确定图像和实际闪光范围之间的对应关系,但这种关系在重建HDR图像时非常重要。
  • results: 我们提出了一种方法,可以直接估计图像和实际闪光范围之间的对应关系,并在一个网络中确定HDR图像。首先,根据闪光范围的特点,我们构建了一个模型,用多项式描述闪光范围的趋势。使用学习网络来估计这些系数。这个曲线会自动根据低动态范围图像的闪光空间自动调整,并重建实际的HDR图像。此外,由于现有的 dataset 没有提供图像和实际闪光范围之间的对应关系,我们构建了一个新的 dataset,包括 sintetic 和实际图像。广泛的实验显示,我们的方法可以在不同的闪光范围下进行极其好的普适性和高性能。
    Abstract Due to limited camera capacities, digital images usually have a narrower dynamic illumination range than real-world scene radiance. To resolve this problem, High Dynamic Range (HDR) reconstruction is proposed to recover the dynamic range to better represent real-world scenes. However, due to different physical imaging parameters, the tone-mapping functions between images and real radiance are highly diverse, which makes HDR reconstruction extremely challenging. Existing solutions can not explicitly clarify a corresponding relationship between the tone-mapping function and the generated HDR image, but this relationship is vital when guiding the reconstruction of HDR images. To address this problem, we propose a method to explicitly estimate the tone mapping function and its corresponding HDR image in one network. Firstly, based on the characteristics of the tone mapping function, we construct a model by a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image, and reconstruct the real HDR image. Besides, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves SOTA performance.
    摘要 First, based on the characteristics of the tone mapping function, we construct a model using a polynomial to describe the trend of the tone curve. To fit this curve, we use a learnable network to estimate the coefficients of the polynomial. This curve will be automatically adjusted according to the tone space of the Low Dynamic Range (LDR) image and reconstruct the real HDR image.Moreover, since all current datasets do not provide the corresponding relationship between the tone mapping function and the LDR image, we construct a new dataset with both synthetic and real images. Extensive experiments show that our method generalizes well under different tone-mapping functions and achieves state-of-the-art performance.

DRAW: Defending Camera-shooted RAW against Image Manipulation

  • paper_url: http://arxiv.org/abs/2307.16418
  • repo_url: None
  • paper_authors: Xiaoxiao Hu, Qichao Ying, Zhenxing Qian, Sheng Li, Xinpeng Zhang
  • for: 保护图像免受修改和欺诈。
  • methods: 利用多频部分融合网络(MPF-Net)和隐藏水印技术,将修改和欺诈信息写入RAW数据中。
  • results: 对多个著名RAW数据集进行了广泛的实验,并达到了高度的鲁棒性和精度。
    Abstract RAW files are the initial measurement of scene radiance widely used in most cameras, and the ubiquitously-used RGB images are converted from RAW data through Image Signal Processing (ISP) pipelines. Nowadays, digital images are risky of being nefariously manipulated. Inspired by the fact that innate immunity is the first line of body defense, we propose DRAW, a novel scheme of defending images against manipulation by protecting their sources, i.e., camera-shooted RAWs. Specifically, we design a lightweight Multi-frequency Partial Fusion Network (MPF-Net) friendly to devices with limited computing resources by frequency learning and partial feature fusion. It introduces invisible watermarks as protective signal into the RAW data. The protection capability can not only be transferred into the rendered RGB images regardless of the applied ISP pipeline, but also is resilient to post-processing operations such as blurring or compression. Once the image is manipulated, we can accurately identify the forged areas with a localization network. Extensive experiments on several famous RAW datasets, e.g., RAISE, FiveK and SIDD, indicate the effectiveness of our method. We hope that this technique can be used in future cameras as an option for image protection, which could effectively restrict image manipulation at the source.
    摘要 原始的RAW文件是现场辐射广泛使用的初始测量数据,而通用的RGB图像则是将RAW数据转换成through Image Signal Processing(ISP)管道。然而,在数字图像成为常用的现象以来,digital images have become susceptible to malicious manipulation. 以体内免疫系统为灵感,我们提出了一种新的图像防 manipulation 方案,即保护图像的来源,即摄像机拍摄的RAW数据。特别是,我们设计了一种轻量级的多频部分融合网络(MPF-Net),该网络适合具有有限的计算资源的设备,通过频率学习和部分特征融合来实现轻量级的性能。MPF-Net在RAW数据中引入不可见的水印,以保护图像免受辐射、压缩等后处理操作的攻击。如果图像被修改,我们可以使用一个本地化网络来准确地定位forge areas。我们在许多知名的RAW数据集,例如RAISE、FiveK和SIDD上进行了广泛的实验,结果表明我们的方法的有效性。我们希望这种技术可以在未来的摄像机中作为图像保护的选项,以防止图像修改在源头级别。

MRA-GNN: Minutiae Relation-Aware Model over Graph Neural Network for Fingerprint Embedding

  • paper_url: http://arxiv.org/abs/2307.16416
  • repo_url: None
  • paper_authors: Yapeng Su, Tong Zhao, Zicheng Zhang
  • for: 本研究旨在提高Automated Fingerprint Identification Systems中的指纹嵌入,使用Graph Neural Network (GNN)模型来利用指纹非结构数据,如指纹 topology和相关性,以提高嵌入的可识别性和稳定性。
  • methods: 我们提出了一种新的指纹嵌入方法,即Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN)。MRA-GNN使用GNN模型来编码指纹的 topology和相关性,将指纹数据转换为图像,并通过Topological relation Reasoning Module (TRM)和Correlation-Aware Module (CAM)来学习指纹嵌入。为了解决GNN模型中的过拟合问题,我们还在MRA-GNN中添加了Feed-Forward Module和图像差分连接。
  • results: 我们的实验结果表明,MRA-GNN比前些state-of-the-art方法在多个指纹数据集上表现更好, indicating that our approach can effectively exploit the nonstructural information of fingerprints.
    Abstract Deep learning has achieved remarkable results in fingerprint embedding, which plays a critical role in modern Automated Fingerprint Identification Systems. However, previous works including CNN-based and Transformer-based approaches fail to exploit the nonstructural data, such as topology and correlation in fingerprints, which is essential to facilitate the identifiability and robustness of embedding. To address this challenge, we propose a novel paradigm for fingerprint embedding, called Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN). Our proposed approach incorporates a GNN-based framework in fingerprint embedding to encode the topology and correlation of fingerprints into descriptive features, achieving fingerprint representation in the form of graph embedding. Specifically, we reinterpret fingerprint data and their relative connections as vertices and edges respectively, and introduce a minutia graph and fingerprint graph to represent the topological relations and correlation structures of fingerprints. We equip MRA-GNN with a Topological relation Reasoning Module (TRM) and Correlation-Aware Module (CAM) to learn the fingerprint embedding from these graphs successfully. To tackle the over-smoothing problem in GNN models, we incorporate Feed-Forward Module and graph residual connections into proposed modules. The experimental results demonstrate that our proposed approach outperforms state-of-the-art methods on various fingerprint datasets, indicating the effectiveness of our approach in exploiting nonstructural information of fingerprints.
    摘要 深度学习在指纹嵌入中取得了杰出的成果,这对现代自动指纹识别系统plays a critical role。然而,过去的方法,包括CNN和Transformer的方法,失去了非结构数据,如指纹图形和指纹之间的相关性,这些数据对实现嵌入的可识别性和稳定性至关重要。为解决这个挑战,我们提出了一种新的嵌入模型,called Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN)。我们的提议方法将指纹嵌入编码为图形特征,通过在GNN基础上的框架来表示指纹图形和相关性结构。 Specifically,我们将指纹数据和其相对连接看作为顶点和边分别,并将指纹图形和指纹相关结构表示为指纹图和指纹图。我们在MRA-GNN中引入Topological relation Reasoning Module (TRM)和Correlation-Aware Module (CAM)来学习指纹嵌入。为解决GNN模型中的过拟合问题,我们将Feed-Forward Module和图 residual connections incorporated into proposed modules。实验结果表明,我们的提议方法在多个指纹数据集上比州前方法表现出色,这表明我们的方法可以成功地利用指纹非结构数据。

DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization

  • paper_url: http://arxiv.org/abs/2307.16415
  • repo_url: https://github.com/xiaojuntang22/iccv2023-ddgnet
  • paper_authors: Xiaojun Tang, Junsong Fan, Chuanchen Luo, Zhaoxiang Zhang, Man Zhang, Zongyuan Yang
  • for: 强度监督的时间动作地图标识 (WTAL) 是一个实用又挑战的任务。大规模数据集的存在使得大多数现有方法使用其他数据集中预训网络提取特征,但这些特征并不适合WTAL。
  • methods: 为了解决这个问题,研究人员设计了多个模组来强化特征,其中包括时间关联模组,对于本地化模组的性能提高。然而,所有的模组都忽略了ambiguous信息的不良影响,这会导致其他特征的减少可识别性。
  • results: 我们提出了Discriminability-Driven Graph Network (DDG-Net),它明确地表示ambiguous snippet和特征可识别的 snippet之间的关联,避免了ambiguous信息的传递,并提高了对单位特征的可识别性。此外,我们提出了特征一致损失,以防止特征的融合和导引几何网络生成更加特征化的表示。实验结果显示DDG-Net在THUMOS14和ActivityNet1.2标准 benchmark上实现了新的州Of-The-Art结果,证明了DDG-Net的效果。代码可以在 \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet} 上获得。
    Abstract Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet}.
    摘要 《弱监督时间动作地标(WTAL)是一项实用又挑战性的任务。由于大规模数据集,大多数现有方法使用已经在其他数据集中预训练的网络提取特征,这些特征并不适合WTAL。为解决这个问题,研究人员设计了多个模块用于特征提高,这些模块可以提高地标模块的性能,特别是模elling时间关系 между片断。然而,所有这些方法均忽视了ambiguous信息的副作用,这会减少其他表示的可 distinguishability。 considering这种现象,我们提出了Discriminability-Driven Graph Network(DDG-Net),该网络Explicitly模型了ambiguous片断和 discriminative片断的Well-designed connections,防止了ambiguous信息的传递,并提高了片断级别表示的可 distinguishability。此外,我们还提出了特征一致损失,以防止特征的同化和驱动图 convolution网络生成更加 discriminative的表示。extensive experiments on THUMOS14和ActivityNet1.2 benchmarks表明DDG-Net的效果,创造了新的state-of-the-art result on both datasets。source code可以在 \url{https://github.com/XiaojunTang22/ICCV2023-DDGNet} 中找到。

RCS-YOLO: A Fast and High-Accuracy Object Detector for Brain Tumor Detection

  • paper_url: http://arxiv.org/abs/2307.16412
  • repo_url: https://github.com/mkang315/rcs-yolo
  • paper_authors: Ming Kang, Chee-Ming Ting, Fung Fung Ting, Raphaël C. -W. Phan
    for: Brain tumor detectionmethods:* Proposed a novel YOLO architecture with Reparameterized Convolution based on channel Shuffle (RCS-YOLO)* Introduced Reparameterized Convolution (RCS) and One-Shot Aggregation of RCS (RCS-OSA) to extract richer information and reduce time consumptionresults:* Surpassed YOLOv6, YOLOv7, and YOLOv8 in speed and accuracy on the brain tumor dataset Br35H* Improved precision by 2.6% and inference speed by 60% compared to YOLOv7* Achieved state-of-the-art performance on brain tumor detection taskHere’s the simplified Chinese version:for: 脑肿检测methods:* 提出了一种基于通道拼接的YOLO架构(RCS-YOLO)* 引入了Reparameterized Convolution(RCS)和One-Shot Aggregation of RCS(RCS-OSA)来提取更多的信息并降低计算时间results:* 在脑肿数据集Br35H上超过了YOLOv6、YOLOv7和YOLOv8的速度和准确率* 相比YOLOv7,RCS-YOLO提高了精度2.6%,并降低了计算速度60%* 实现了脑肿检测任务的state-of-the-art性
    Abstract With an excellent balance between speed and accuracy, cutting-edge YOLO frameworks have become one of the most efficient algorithms for object detection. However, the performance of using YOLO networks is scarcely investigated in brain tumor detection. We propose a novel YOLO architecture with Reparameterized Convolution based on channel Shuffle (RCS-YOLO). We present RCS and a One-Shot Aggregation of RCS (RCS-OSA), which link feature cascade and computation efficiency to extract richer information and reduce time consumption. Experimental results on the brain tumor dataset Br35H show that the proposed model surpasses YOLOv6, YOLOv7, and YOLOv8 in speed and accuracy. Notably, compared with YOLOv7, the precision of RCS-YOLO improves by 2.6%, and the inference speed by 60% at 114.8 images detected per second (FPS). Our proposed RCS-YOLO achieves state-of-the-art performance on the brain tumor detection task. The code is available at https://github.com/mkang315/RCS-YOLO.
    摘要 使用 cutting-edge YOLO 框架的协议,可以实现一个极高效的对象检测。然而,使用 YOLO 网络在脑肿检测中的性能尚未得到足够的研究。我们提出了一种新的 YOLO 架构,即 Reparameterized Convolution based on channel Shuffle (RCS-YOLO)。我们还提出了一种 One-Shot Aggregation of RCS (RCS-OSA),它将 feature cascade 和计算效率联系起来,以提取更多的信息并降低计算时间。我们在 Br35H 脑肿数据集上进行了实验,结果显示,我们的提posed模型在速度和准确性两个方面都超过了 YOLOv6、YOLOv7 和 YOLOv8。尤其是比 YOLOv7,我们的 RCS-YOLO 模型的精度提高了 2.6%,并且在 114.8 帧每秒 (FPS) 的检测速度上提高了 60%。我们的提posed RCS-YOLO 模型在脑肿检测任务中达到了国际顶尖的性能。代码可以在 上下载。

HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2307.16410
  • repo_url: None
  • paper_authors: Minyi Zhao, Yi Xu, Bingjia Li, Jie Wang, Jihong Guan, Shuigeng Zhou
  • for: 提高文本识别率,solve the problem of low-resolution scene images affecting text recognition.
  • methods: 提出了一种新的STISR框架,called High-Resolution ENhancement(HiREN),which consists of two branches and a quality estimation module. The first branch is used to recover LR images, and the other is an HR quality enhancement branch that aims to generate HQ text images based on HR images.
  • results: 在TextZoom dataset上进行了广泛的实验,结果表明HiREN可以与大多数现有的STISR方法结合使用,并显著提高它们的性能。
    Abstract Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.
    摘要 Scene文本图像超解像(STISR)是识别文本从低分辨率Scene图像前置处理技术中的重要一环。目前,许多方法已经被提出来提取高分辨率(HR)图像中特有的文本信息,以供STISR模型训练。然而,由于手动拍摄HR图像的因素(如摄影设备、 фокус和环境)的不可控,HR图像的质量无法保证,这会不可避免地影响STISR性能。 observe到HR图像质量问题,在这篇论文中,我们提出了一个新的想法,即首先提高HR图像的质量,然后使用提高后的HR图像作为STISR模型训练的超vision。具体来说,我们开发了一个新的STISR框架,called High-Resolution ENhancement(HiREN),它包括两个分支和一个质量评估模块。第一个分支是用于恢复低分辨率(LR)图像,另一个是一个HR质量提高分支,旨在基于HR图像生成高质量(HQ)文本图像,以供更准确的超vision。由于HR到HQ的降低可能有多种,而且没有像素级supervision的HR图像生成,我们设计了一个核心准导提高网络,用于处理多种降低,并利用recognizer和文本级别的回归信号作为弱supervision信号来训练HR增强分支。然后,我们使用质量评估模块评估HQ图像的质量,并将其用于抑制错误的超vision信息。extensive experiments show that HiREN can work well with most existing STISR methods and significantly boost their performances.

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

  • paper_url: http://arxiv.org/abs/2307.16399
  • repo_url: None
  • paper_authors: Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin
  • for: 这个研究旨在生成具有具体类型和情感的图像或影片描述,以增加它们的吸引力和情感适宜性。
  • methods: 我们提出了一个名为FS-StyleCap的框架,它使用一个可条件预测语言模型和一个可视投影模组。我们采用了两步训练方案:首先,我们将训练一个类型抽象器,以生成不同类型的概率表示。然后,我们将这个抽象器免费化,让我们的预测器根据提取的类型特征和可视内容特征来生成具有欲要的类型的描述。
  • results: 我们的自动评估结果显示,我们的模型在几何扩展中表现出色,与已有的方法相比,并且在几何上具有较高的内在一致性。人工评价也证明了我们的模型可以处理多种类型。
    Abstract Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model s ability to handle multiple styles.
    摘要 现代化的视觉描述目标是生成具有特定风格的图像或视频描述,使其更加吸引人和情感上更加适当。然而,该任务的一个主要挑战是缺乏协调的风格描述数据集,因此大多数现有的方法都是不需要平行数据集的不监督方法。然而,这些方法仍然需要训练具有足够的示例,并且生成的描述仅限于预定的风格。为解决这些限制,我们研究了几何风格视觉描述问题,该问题的目标是在描述过程中根据用户提供的少量示例生成任何风格的描述。我们提出了一个名为FS-StyleCap的框架来解决这个问题,该框架使用了决定性编码器-解码语言模型和视觉投影模块。我们的两步训练方案如下:首先,我们训练一个风格抽象器,以生成无标签文本 Corpora 上的风格表示。然后,我们冻结抽象器,并使我们的解码器基于抽象器生成风格表示和投影到视觉内容 vectors 的描述。在推断过程中,我们的模型可以根据用户提供的示例生成所需的风格描述。我们的自动评估结果表明,我们的方法在几何风格视觉描述任务中比 state-of-the-art 方法更高的评价结果,并且与完全在标注风格 Corpora 上训练的模型相当。人工评估还证明了我们的模型能够处理多种风格。

JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery

  • paper_url: http://arxiv.org/abs/2307.16377
  • repo_url: https://github.com/xljh0520/jotr
  • paper_authors: Jiahao Li, Zongxin Yang, Xiaohan Wang, Jianxin Ma, Chang Zhou, Yi Yang
  • for: 3D human mesh recovery from a single image under obscured conditions
  • methods: 3D JOint contrastive learning with TRansformers (JOTR) framework, including an encoder-decoder transformer architecture and a novel 3D joint contrastive learning approach
  • results: outperforms state-of-the-art competitors on both occlusion-specific and standard benchmarks, significantly improving the reconstruction of occluded humansHere is the Simplified Chinese version of the information:
  • for: 3D人体三维重建从单张图像下的遮盲条件
  • methods: 3D JOint contrastive learning with TRansformers (JOTR)框架,包括encoder-decoder transformer架构和一种新的3D JOINT contrastive learning方法
  • results: 超越了现有竞争对手在 occlusion-specific 和标准 Benchmark 上的表现,显著改善遮盲人体重建
    Abstract In this study, we focus on the problem of 3D human mesh recovery from a single image under obscured conditions. Most state-of-the-art methods aim to improve 2D alignment technologies, such as spatial averaging and 2D joint sampling. However, they tend to neglect the crucial aspect of 3D alignment by improving 3D representations. Furthermore, recent methods struggle to separate the target human from occlusion or background in crowded scenes as they optimize the 3D space of target human with 3D joint coordinates as local supervision. To address these issues, a desirable method would involve a framework for fusing 2D and 3D features and a strategy for optimizing the 3D space globally. Therefore, this paper presents 3D JOint contrastive learning with TRansformers (JOTR) framework for handling occluded 3D human mesh recovery. Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a coarse-to-fine manner and a novel 3D joint contrastive learning approach for adding explicitly global supervision for the 3D feature space. The contrastive learning approach includes two contrastive losses: joint-to-joint contrast for enhancing the similarity of semantically similar voxels (i.e., human joints), and joint-to-non-joint contrast for ensuring discrimination from others (e.g., occlusions and background). Qualitative and quantitative analyses demonstrate that our method outperforms state-of-the-art competitors on both occlusion-specific and standard benchmarks, significantly improving the reconstruction of occluded humans.
    摘要 在这项研究中,我们关注在单张图像下面临遮盲的3D人体 mesh 恢复问题。大多数当前的方法都是提高2D对齐技术,如空间平均和2D关节采样。然而,它们往往忽略了3D对齐的重要性,而且最近的方法在拥挤场景中很难分别人体和遮挡物或背景,因为它们在3D空间中优化目标人体的3D关节坐标作为本地监督。为解决这些问题,一个理想的方法应该包括一个混合2D和3D特征的框架,以及一种全球化优化3D空间的策略。因此,这篇论文提出了基于转换器的3D JOint contrastive learning(JOTR)框架,用于处理遮盲3D人体 mesh 恢复问题。我们的方法包括一个编码器-解码器转换器架构,用于在粗细化到细化的方式下混合2D和3D表示,以及一种新的3D关节对比学习方法,用于在3D特征空间中添加显式全球化监督。对比学习方法包括两种对比损失:关节到关节对比,用于提高相似的人体关节之间的相似性,以及关节到非关节对比,用于确保与其他物体(例如遮挡物和背景)的区分。qualitative和quantitative分析表明,我们的方法在遮盲人体 benchmark 上表现出色,与当前的竞争对手相比,显著提高了遮盲人体的重建。

MobileVidFactory: Automatic Diffusion-Based Social Media Video Generation for Mobile Devices from Text

  • paper_url: http://arxiv.org/abs/2307.16371
  • repo_url: None
  • paper_authors: Junchen Zhu, Huan Yang, Wenjing Wang, Huiguo He, Zixi Tuo, Yongsheng Yu, Wen-Huang Cheng, Lianli Gao, Jingkuan Song, Jianlong Fu, Jiebo Luo
  • for: 用于自动生成高质量垂直手机视频,让用户只需提供简单的文本来创建视频。
  • methods: 使用预训练的图像扩散模型,并对其进行修改,以生成高质量的开源频域垂直视频生成器。对音频,我们从我们的大型数据库中提取合适的背景声音。此外,我们还允许用户添加特定的屏幕文本来增强视觉表达,并选择自定义的语音。
  • results: 通过我们的系统,用户可以轻松地创建高质量的垂直手机视频,无需特殊的技术知识或专业技能。
    Abstract Videos for mobile devices become the most popular access to share and acquire information recently. For the convenience of users' creation, in this paper, we present a system, namely MobileVidFactory, to automatically generate vertical mobile videos where users only need to give simple texts mainly. Our system consists of two parts: basic and customized generation. In the basic generation, we take advantage of the pretrained image diffusion model, and adapt it to a high-quality open-domain vertical video generator for mobile devices. As for the audio, by retrieving from our big database, our system matches a suitable background sound for the video. Additionally to produce customized content, our system allows users to add specified screen texts to the video for enriching visual expression, and specify texts for automatic reading with optional voices as they like.
    摘要 mobile devices 的视频成为最近最受欢迎的信息分享和获取方式。为了便利用户创建,在这篇论文中,我们提出了一个系统,即 MobileVidFactory,可以自动生成高质量的垂直式移动视频,只需要用户提供简单的文本。我们的系统包括两部分:基本生成和个性化生成。在基本生成部分,我们利用预训练的图像扩散模型,并将其适应为高质量的开源频段 vertical video 生成器。对于音频,我们从我们大型数据库中提取合适的背景音乐,并将其匹配到视频中。此外,为了生成个性化内容,我们的系统允许用户添加自定义的屏幕文本,以激发视觉表达,并选择自定义的语音和读音。

Workshop on Document Intelligence Understanding

  • paper_url: http://arxiv.org/abs/2307.16369
  • repo_url: None
  • paper_authors: Soyeon Caren Han, Yihao Ding, Siwen Luo, Josiah Poon, HeeGuen Yoon, Zhe Huang, Paul Duuring, Eun Jung Holden
  • for: 本研讨会的目的是推动自动文档处理和理解技术的发展,以满足不同领域(如商业、法律和医学)中大量文档处理中的效率提升。
  • methods: 本研讨会将吸引来自文档智能和理解领域的研究人员和产业开发者,以推动多种文档类型的自动处理和理解技术的发展。
  • results: 本研讨会还发布了基于PDFVQA数据集的文档答案挑战,以测试提出的模型在全文档水平的结构和上下文理解能力。这种任务可以帮助提升文档理解步骤,从单页水平升级到全文档水平理解。
    Abstract Document understanding and information extraction include different tasks to understand a document and extract valuable information automatically. Recently, there has been a rising demand for developing document understanding among different domains, including business, law, and medicine, to boost the efficiency of work that is associated with a large number of documents. This workshop aims to bring together researchers and industry developers in the field of document intelligence and understanding diverse document types to boost automatic document processing and understanding techniques. We also released a data challenge on the recently introduced document-level VQA dataset, PDFVQA. The PDFVQA challenge examines the structural and contextual understandings of proposed models on the natural full document level of multiple consecutive document pages by including questions with a sequence of answers extracted from multi-pages of the full document. This task helps to boost the document understanding step from the single-page level to the full document level understanding.
    摘要 文档理解和信息提取包括不同任务来理解文档并自动提取有价值信息。最近,在不同领域,如商业、法律和医学等领域,有增加文档理解的需求,以提高大量文档相关的工作效率。这场工作室将帮助研究人员和行业开发人员在文档智能和理解多种文档类型中提高自动文档处理和理解技术。我们还发布了基于最近引入的文档级VQA数据集PDFVQA的数据挑战。PDFVQA挑战测试提出的模型在全文档水平上的结构和文本上下文理解能力,通过从多页全文档中提取多个答案序列来检验模型的文档理解能力。这个任务可以帮助提高文档理解的步骤,从单页水平提升到全文档水平的理解。

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

  • paper_url: http://arxiv.org/abs/2307.16368
  • repo_url: None
  • paper_authors: Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, Chen Sun
  • For: The paper is focused on the long-term action anticipation (LTA) task, which involves predicting an actor’s future behavior from video observations. The goal is to improve human-machine interaction.* Methods: The authors propose a two-stage framework called AntGPT, which leverages large language models (LLMs) to help with the LTA task. The first stage recognizes the actions already performed in the observed videos, and the second stage uses an LLM to predict the future actions or infer the goal and plan the whole procedure.* Results: The authors report state-of-the-art performance on several benchmarks, including the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, and EGTEA GAZE+. They also demonstrate the effectiveness of their approach through qualitative analysis, showing that AntGPT can successfully infer the goal and perform goal-conditioned “counterfactual” prediction.Here’s the simplified Chinese text for the three points:* For: 这篇论文关注的是长期动作预测任务(LTA),即从视频观察中预测演员的未来行为。目的是提高人机交互。* Methods: 作者们提议一种两个阶段的框架called AntGPT,利用大型语言模型(LLMs)来帮助LTA任务。第一阶段认识视频中已经完成的动作,第二阶段使用LLM预测未来动作或者推理演员的目标并规划整个过程。* Results: 作者们报告了多个benchmark上的州OF-the-art性能,包括Ego4D LTA v1和v2 benchmarks、EPIC-Kitchens-55、EGTEA GAZE+。他们还通过质量分析证明了AntGPT的有效性,表明它可以成功地推理演员的目标并进行目标conditioned “counterfactual” 预测。
    Abstract Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGPT
    摘要 <> translate the given text into Simplified Chinese.<>我们可以通过知道actor的当前行为后的共通发展(例如混合蛋)来预测actor的未来行为(例如烫蛋)。如果我们还知道actor的长期目标(例如制作蛋饭),那么长期行为预测(LTA)任务就变得非常重要。我们提议从两个角度来解决LTA任务:一个底层方法是通过模拟时间动力来预测下一个行为;另一个是通过理解actor的目标来规划需要完成目标的过程。我们认为大型自然语言模型(LLM),它们在recipe和how-to文本数据上进行预训练,有可能帮助LTA从两个角度。它可以提供下一个行为的可能性预测,以及基于观察到的部分过程来INFERactor的目标。为了利用LLM,我们提出了一个两个阶段框架,AntGPT。它首先识别视频中已经完成的行为,然后向LLM进行 conditioned generation,或者通过 chain-of-thought prompting来INFERactor的目标和计划整个过程。实验结果表明,AntGPT在Ego4D LTA v1和v2标准测试 benchmark、EPIC-Kitchens-55和EGTEA GAZE+上达到了状态之最好的性能,并可以成功地INFERactor的目标,并通过qualitative分析进行“counterfactual”预测。代码和模型将在https://brown-palm.github.io/AntGPT上发布。

Multi-modal Graph Neural Network for Early Diagnosis of Alzheimer’s Disease from sMRI and PET Scans

  • paper_url: http://arxiv.org/abs/2307.16366
  • repo_url: None
  • paper_authors: Yanteng Zhanga, Xiaohai He, Yi Hao Chan, Qizhi Teng, Jagath C. Rajapakse
  • for: 这篇论文主要用于早期诊断阿尔ц海默病(AD),使用深度学习模型和多Modal的MRI和PET影像数据。
  • methods: 本论文提出了使用图形神经网络(GNN)来处理多Modal的MRI和PET影像数据,并将它们与subject的生物 markers(phenotypic information)结合,以提高AD诊断的性能。
  • results: 实验结果显示, compared to single-modal approaches, 本论文提出的多Modal方法可以提高AD诊断的性能,并且可以将多Modal数据与生物 markers结合,以提高诊断的精度。
    Abstract In recent years, deep learning models have been applied to neuroimaging data for early diagnosis of Alzheimer's disease (AD). Structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) images provide structural and functional information about the brain, respectively. Combining these features leads to improved performance than using a single modality alone in building predictive models for AD diagnosis. However, current multi-modal approaches in deep learning, based on sMRI and PET, are mostly limited to convolutional neural networks, which do not facilitate integration of both image and phenotypic information of subjects. We propose to use graph neural networks (GNN) that are designed to deal with problems in non-Euclidean domains. In this study, we demonstrate how brain networks can be created from sMRI or PET images and be used in a population graph framework that can combine phenotypic information with imaging features of these brain networks. Then, we present a multi-modal GNN framework where each modality has its own branch of GNN and a technique is proposed to combine the multi-modal data at both the level of node vectors and adjacency matrices. Finally, we perform late fusion to combine the preliminary decisions made in each branch and produce a final prediction. As multi-modality data becomes available, multi-source and multi-modal is the trend of AD diagnosis. We conducted explorative experiments based on multi-modal imaging data combined with non-imaging phenotypic information for AD diagnosis and analyzed the impact of phenotypic information on diagnostic performance. Results from experiments demonstrated that our proposed multi-modal approach improves performance for AD diagnosis, and this study also provides technical reference and support the need for multivariate multi-modal diagnosis methods.
    摘要 近年来,深度学习模型在 neuroscience 数据中用于早期诊断阿尔茨海默病(AD)。Structural magnetic resonance imaging(sMRI)和 пози트рон发射tomography(PET)图像提供了脑组织结构和功能信息,分别。将这些特征结合使得建立预测模型的性能提高,而不使用单一模式。然而,现有的多Modal approaches在深度学习中,基于 sMRI 和 PET,主要是基于卷积神经网络,这些神经网络不能捕捉图像和subject的phenotypic信息。我们提议使用图 neural networks(GNN),这些神经网络是非欧几何问题的解决方案。在这项研究中,我们将脑网络创建自 sMRI 或 PET 图像,并将其用于人口图案框架,可以结合图像和subject的phenotypic信息。然后,我们提出了一种多Modal GNN 框架,其中每个模式有自己的 GNN 分支,并提出了将多Modal数据在每个分支和连接矩阵水平进行combine的技术。最后,我们进行了晚期融合,将每个分支的初步决策相互融合,生成最终预测。随着多Modal数据的普及,多源多Modal是阿尔茨海默病诊断的趋势。我们根据多Modal 图像和非图像phenotypic信息进行了探索性实验,分析了影响诊断性能的非图像信息。实验结果表明,我们提议的多Modal方法可以提高阿尔茨海默病诊断性能,这也提供了技术参考,支持多变量多Modal诊断方法的需求。

Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples

  • paper_url: http://arxiv.org/abs/2307.16361
  • repo_url: https://github.com/qiufan319/benchmark_pc_attack
  • paper_authors: Qiufan Ji, Lin Wang, Cong Shi, Shengshan Hu, Yingying Chen, Lichao Sun
  • for: 本研究旨在提高深度神经网络(DNNs)在3D点云识别 tasks 中的鲁棒性,因为这些任务在面对 adversarial examples 时容易受到威胁。
  • methods: 我们首先建立了一个完整、严格的点云 adversarial robustness 评估标准,以便更好地理解防御和攻击方法的效果。然后,我们收集了现有的点云 adversarial defense 技巧,并进行了广泛的系统性实验,以identify一个有效的组合这些技巧。最后,我们提出了一种hybrid training augmentation方法,考虑了不同类型的点云 adversarial examples,并将其加到了 adversarial training 中,以提高鲁棒性。
  • results: 我们的方法可以在面对多种攻击时保持83.45%的平均准确率, demonstrating its capability to enabling robust learners。我们的代码库在:\url{https://github.com/qiufan319/benchmark_pc_attack.git}。
    Abstract Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to adversarial examples, threatening their practical deployment. Despite the many research endeavors have been made to tackle this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For examples, attackers can generate adversarial examples by adding, shifting, or removing points. Consequently, existing defense strategies are hard to counter unseen point cloud adversarial examples. In this paper, we first establish a comprehensive, and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and then perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation methods that consider various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45\% against various attacks, demonstrating its capability to enabling robust learners. Our codebase are open-sourced on: \url{https://github.com/qiufan319/benchmark_pc_attack.git}.
    摘要 深度神经网络(DNNs)用于3D点云识别是易受到攻击的,这 threatening其实际应用。尽管在过去几年内有很多研究努力以counter这种issue,但3D点云上的攻击者可以通过添加、移动或 removing points来生成攻击示例,使得现有的防御策略很难counter未看到的点云攻击示例。在这篇论文中,我们首先建立了一个完整、严格的点云攻击robustness benchmark,以评估防御robustness,这可以为我们提供更好的理解防御和攻击方法的效果。然后,我们收集了现有的点云防御技巧,并进行了广泛和系统的实验,以确定有效的组合方法。此外,我们提议了一种混合培育方法,考虑了不同类型的点云攻击示例,并将其与 adversarial training 结合使用,以提高防御robustness。通过这些技巧的组合,我们建立了一个更加robust的防御框架,实现了83.45%的平均准确率, demonstrating its ability to enable robust learners。我们的代码库将在:\url{https://github.com/qiufan319/benchmark_pc_attack.git} 中开源。

Cardiac MRI Orientation Recognition and Standardization using Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.00615
  • repo_url: https://github.com/rxzhen/mscmr-orient
  • paper_authors: Ruoxuan Zhen
  • for: 本研究旨在提高医疗影像处理任务的效果,通过深度学习方法实现Orientation认识和预测。
  • methods: 本文提出了一种基于深度神经网络的Orientation认识和标准化方法,并采用了传输学习策略以适应多种MRI序列和模式。
  • results: 经过了广泛的实验, validate accuracy achieved were 100.0%, 100.0%, and 99.4%, demonstrating the robustness and effectiveness of our model。Note: The abstract is in English, and the information points are in Simplified Chinese.
    Abstract Orientation recognition and standardization play a crucial role in the effectiveness of medical image processing tasks. Deep learning-based methods have proven highly advantageous in orientation recognition and prediction tasks. In this paper, we address the challenge of imaging orientation in cardiac MRI and present a method that employs deep neural networks to categorize and standardize the orientation. To cater to multiple sequences and modalities of MRI, we propose a transfer learning strategy, enabling adaptation of our model from a single modality to diverse modalities. We conducted comprehensive experiments on CMR images from various modalities, including bSSFP, T2, and LGE. The validation accuracies achieved were 100.0\%, 100.0\%, and 99.4\%, confirming the robustness and effectiveness of our model. Our source code and network models are available at https://github.com/rxzhen/MSCMR-orient
    摘要 医疗影像处理任务中,orientation认识和标准化具有关键作用。深度学习基于方法在orientation认识和预测任务中表现出了高度优势。本文关注卡ди亚MRI影像orientation的挑战,并提出了使用深度神经网络来分类和标准化orientation。针对多种模式和模式的MRI影像,我们提议了转移学习策略,以适应多种模式的适应。我们在不同模式的CMR影像上进行了广泛的实验,包括bSSFP、T2和LGE模式。我们所得到的验证精度分别为100.0%、100.0%和99.4%,这confirm了我们的模型的稳定性和有效性。我们的源代码和网络模型可以在https://github.com/rxzhen/MSCMR-orient上获取。

Self-Supervised Learning of Gait-Based Biomarkers

  • paper_url: http://arxiv.org/abs/2307.16321
  • repo_url: None
  • paper_authors: R. James Cotton, J. D. Peiffer, Kunal Shah, Allison DeLillo, Anthony Cimorelli, Shawana Anarwala, Kayan Abdou, Tasos Karakostas
  • for: 这个论文的目的是为了提取基于Markerless Motion Capture(MMC)的步态分析中最有价值的信息。
  • methods: 这篇论文使用了自动预导学习(SSL)技术,特别是对比学习和 causal masking,来学习有用的步态表示。
  • results: 研究发现,对于不注释的步态数据进行对比学习可以学习出具有临床意义的步态表示,并且可以用来诊断和评估rehabilitation therapy的效果。
    Abstract Markerless motion capture (MMC) is revolutionizing gait analysis in clinical settings by making it more accessible, raising the question of how to extract the most clinically meaningful information from gait data. In multiple fields ranging from image processing to natural language processing, self-supervised learning (SSL) from large amounts of unannotated data produces very effective representations for downstream tasks. However, there has only been limited use of SSL to learn effective representations of gait and movement, and it has not been applied to gait analysis with MMC. One SSL objective that has not been applied to gait is contrastive learning, which finds representations that place similar samples closer together in the learned space. If the learned similarity metric captures clinically meaningful differences, this could produce a useful representation for many downstream clinical tasks. Contrastive learning can also be combined with causal masking to predict future timesteps, which is an appealing SSL objective given the dynamical nature of gait. We applied these techniques to gait analyses performed with MMC in a rehabilitation hospital from a diverse clinical population. We find that contrastive learning on unannotated gait data learns a representation that captures clinically meaningful information. We probe this learned representation using the framework of biomarkers and show it holds promise as both a diagnostic and response biomarker, by showing it can accurately classify diagnosis from gait and is responsive to inpatient therapy, respectively. We ultimately hope these learned representations will enable predictive and prognostic gait-based biomarkers that can facilitate precision rehabilitation through greater use of MMC to quantify movement in rehabilitation.
    摘要 无标记动作跟踪(MMC)在医学设置中革命化了步态分析,使其更加可 accessible,提出了如何从步态数据中提取最有价值的临床信息的问题。在多个领域,从图像处理到自然语言处理,无监督学习(SSL)从大量无注释数据中生成了非常有效的下游任务表示。然而,在步态和运动方面尚未广泛使用SSL学习有效表示,而且尚未应用到MMC步态分析中。一个尚未应用到步态的SSL目标是对比学习,它找到类似样本在学习空间中更近的方法。如果学习的相似度量表示临床有意义的差异,这将生成有用的下游临床任务表示。对比学习还可以与 causal 遮盾来预测未来时间步骤,这是一个有appeal的SSL目标,因为步态具有动态特征。我们对医疗机构中从多样化临床人口进行的MMC步态分析进行了应用。我们发现,对于无注释步态数据的对比学习可以学习一个表示,这个表示能够捕捉临床有意义的信息。我们使用生物标记框架来询问这个学习的表示,并证明它能够准确地分类诊断,并且对于入院治疗响应。我们希望这些学习的表示能够帮助建立预测和诊断基于步态的生物标记,以便通过更广泛的MMC来评估运动。

Mask-guided Data Augmentation for Multiparametric MRI Generation with a Rare Hepatocellular Carcinoma

  • paper_url: http://arxiv.org/abs/2307.16314
  • repo_url: None
  • paper_authors: Karen Sanchez, Carlos Hinojosa, Kevin Arias, Henry Arguello, Denis Kouame, Olivier Meyrignac, Adrian Basarab
  • for: 这篇论文是为了提高深度学习模型在医学应用中的整体性能而撰写的。
  • methods: 这篇论文使用了一种叫做Pix2Pix的生成深度学习方法,将 multiparametric MRI 数据集中的liver tumor masks和腹部边界作为输入,生成了一组合计1,000个synthetic MRI triplets和它们的肿瘤屏障。
  • results: 这篇论文的结果显示,使用这种方法可以将Frechet Inception Distance score提高到86.55。此外,这种方法在2021年的数据增幅挑战中被评为得奖作品之一。
    Abstract Data augmentation is classically used to improve the overall performance of deep learning models. It is, however, challenging in the case of medical applications, and in particular for multiparametric datasets. For example, traditional geometric transformations used in several applications to generate synthetic images can modify in a non-realistic manner the patients' anatomy. Therefore, dedicated image generation techniques are necessary in the medical field to, for example, mimic a given pathology realistically. This paper introduces a new data augmentation architecture that generates synthetic multiparametric (T1 arterial, T1 portal, and T2) magnetic resonance images (MRI) of massive macrotrabecular subtype hepatocellular carcinoma with their corresponding tumor masks through a generative deep learning approach. The proposed architecture creates liver tumor masks and abdominal edges used as input in a Pix2Pix network for synthetic data creation. The method's efficiency is demonstrated by training it on a limited multiparametric dataset of MRI triplets from $89$ patients with liver lesions to generate $1,000$ synthetic triplets and their corresponding liver tumor masks. The resulting Frechet Inception Distance score was $86.55$. The proposed approach was among the winners of the 2021 data augmentation challenge organized by the French Society of Radiology.
    摘要 “数据扩展是深度学习模型性能的提升方法之一,但在医疗应用中受到限制,特别是对多参数数据进行扩展。例如,传统的几何变换在许多应用中生成的 sintetic 图像可能会非现实地改变病人的解剖结构。因此,在医疗领域,需要专门的图像生成技术,以例如,模拟给定的疾病实际地。这篇论文介绍了一种新的数据扩展架构,通过生成多参数(T1血管、T1门脉和T2)核磁共振成像(MRI)图像的生成深度学习方法,生成大量大macrotrabecular型肝癌的 sintetic 图像和其医学标注。提议的架构使用MRI triplets的liver lesions从89名病人中训练Pix2Pix网络,生成1000个 sintetic triplets和其医学标注。结果的Frechet Inception Distance分数为86.55。该方法在2021年法国 radiology 社会组织的数据扩展挑战中获得奖。”Note that Simplified Chinese is used in this translation, as it is the most widely used variety of Chinese in mainland China and is more straightforward to read for non-native speakers. If you prefer Traditional Chinese, I can provide that version as well.

Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2307.16309
  • repo_url: None
  • paper_authors: Wenqing Wang, Kaifeng Gao, Yawei Luo, Tao Jiang, Fei Gao, Jian Shao, Jianwen Sun, Jun Xiao
  • for: 提高视频内容中 predicate 的准确性,解决现有 VidSGG 方法在具有较少表达的 predicate 上表现不佳的问题。
  • methods: 提出了一种名为 Trico 的方法,通过探索三种归一化空间时间相关性来补充缺失的 predicate。
  • results: 对 VidVRD 和 VidOR 等最常用的 VidSGG 数据集进行了广泛的实验,并达到了状态场下的表现,特别是在一些尾 predicate 上。
    Abstract Video-based scene graph generation (VidSGG) is an approach that aims to represent video content in a dynamic graph by identifying visual entities and their relationships. Due to the inherently biased distribution and missing annotations in the training data, current VidSGG methods have been found to perform poorly on less-represented predicates. In this paper, we propose an explicit solution to address this under-explored issue by supplementing missing predicates that should be appear in the ground-truth annotations. Dubbed Trico, our method seeks to supplement the missing predicates by exploring three complementary spatio-temporal correlations. Guided by these correlations, the missing labels can be effectively supplemented thus achieving an unbiased predicate predictions. We validate the effectiveness of Trico on the most widely used VidSGG datasets, i.e., VidVRD and VidOR. Extensive experiments demonstrate the state-of-the-art performance achieved by Trico, particularly on those tail predicates.
    摘要 文本描述:视频基于Scene graph生成(VidSGG)是一种方法,旨在通过识别视觉实体和其之间关系,将视频内容表示为动态图。由于训练数据的自然偏袋和缺失签注,现有的VidSGG方法在 menos-represented predicates 方面表现不佳。在这篇论文中,我们提出了一种显式的解决方案,通过补充缺失的 predicates 来解决这个未explored问题。我们称之为 Trico,我们的方法通过 three complementary spatio-temporal correlations 来补充缺失标签,从而实现无偏 predicate 预测。我们在 VidVRD 和 VidOR 等最常用的 VidSGG 数据集上验证了 Trico 的效果,特别是在 tail predicates 方面。Here's the translation in Traditional Chinese:文本描述:影像基于Scene graph生成(VidSGG)是一种方法,旨在透过识别视觉元素和其之间关系,将影像内容表示为动态图。由于训练数据的自然偏袋和缺失签识,现有的VidSGG方法在 menos-represented predicates 方面表现不佳。在这篇论文中,我们提出了一种显式的解决方案,通过补充缺失的 predicates 来解决这个未explored问题。我们称之为 Trico,我们的方法通过 three complementary spatio-temporal correlations 来补充缺失标签,从而实现无偏 predicate 预测。我们在 VidVRD 和 VidOR 等最常用的 VidSGG 数据集上验证了 Trico 的效果,特别是在 tail predicates 方面。

Stylized Projected GAN: A Novel Architecture for Fast and Realistic Image Generation

  • paper_url: http://arxiv.org/abs/2307.16275
  • repo_url: None
  • paper_authors: Md Nurul Muttakin, Malik Shahid Sultan, Robert Hoehndorf, Hernando Ombao
  • for: 本研究使用生成对抗网络(GANs)生成数据,但GANs在对抗Setting下训练是一项困难任务。
  • methods: 本研究使用传输学习将生成和真实样本投影到预训练的特征空间,以提高训练效率和稳定性。
  • results: 提出了一种优化的架构方案,即风格化投影GANs(Stylized Projected GANs),通过结合 Style GANs 的映射网络和 Fast GAN 的跳过层刺激,提高生成图像质量和避免生成图像中的缺陷。
    Abstract Generative Adversarial Networks are used for generating the data using a generator and a discriminator, GANs usually produce high-quality images, but training GANs in an adversarial setting is a difficult task. GANs require high computation power and hyper-parameter regularization for converging. Projected GANs tackle the training difficulty of GANs by using transfer learning to project the generated and real samples into a pre-trained feature space. Projected GANs improve the training time and convergence but produce artifacts in the generated images which reduce the quality of the generated samples, we propose an optimized architecture called Stylized Projected GANs which integrates the mapping network of the Style GANs with Skip Layer Excitation of Fast GAN. The integrated modules are incorporated within the generator architecture of the Fast GAN to mitigate the problem of artifacts in the generated images.
    摘要 Generative Adversarial Networks (GANs) 通常用一个生成器和一个欺骗器来生成数据,但训练 GANs 在反对性设定下是一个困难的任务。GANs 需要高度的计算能力和对应变数的调整以获得协调。对应 GANs 解决了 GANs 训练的困难性,通过将生成的和实际样本 проек到预训的特征空间中。对应 GANs 可以提高训练时间和协调,但是会导致生成的图像中的瑕疵,这样会降低生成的样本质量。我们提出一个优化的架构,即 Style Projected GANs,它将 Style GANs 的映射网络和 Fast GAN 的跳跃层刺激组合在一起,并将这些模组 incorporated 到 Fast GAN 的生成器架构中,以减少生成的图像中的瑕疵。

An objective validation of polyp and instrument segmentation methods in colonoscopy through Medico 2020 polyp segmentation and MedAI 2021 transparency challenges

  • paper_url: http://arxiv.org/abs/2307.16262
  • repo_url: https://github.com/georgebatch/kvasir-seg
  • paper_authors: Debesh Jha, Vanshali Sharma, Debapriya Banik, Debayan Bhattacharya, Kaushiki Roy, Steven A. Hicks, Nikhil Kumar Tomar, Vajira Thambawita, Adrian Krenzer, Ge-Peng Ji, Sahadev Poudel, George Batchkala, Saruar Alam, Awadelrahman M. A. Ahmed, Quoc-Huy Trinh, Zeshan Khan, Tien-Phat Nguyen, Shruti Shrestha, Sabari Nathan, Jeonghwan Gwak, Ritika K. Jha, Zheyuan Zhang, Alexander Schlaefer, Debotosh Bhattacharjee, M. K. Bhuyan, Pradip K. Das, Sravanthi Parsa, Sharib Ali, Michael A. Riegler, Pål Halvorsen, Ulas Bagci, Thomas De Lange
    for:The paper is written to promote the development of efficient and transparent methods for automatic analysis of colonoscopy images, with the goal of improving the early detection of precancerous polyps.methods:The paper uses a combination of deep learning techniques and transparency and interpretability analysis to evaluate the performance and credibility of various algorithms for polyp segmentation and classification.results:The paper presents a comprehensive summary and analysis of the “Medico 2020” and “MedAI: Transparency in Medical Image Segmentation (MedAI 2021)” competitions, highlighting the strengths of the best-performing methods and discussing the possibility of clinical translations of such methods into the clinic. The paper also encourages qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.Here is the same information in Simplified Chinese text:for:这篇论文是为了促进自动检测colonoscopy图像的效率和透明度方面的研究,以提高早期检测前канcerous polyp的可能性。methods:这篇论文使用了深度学习技术和透明度分析,以评估不同算法的效果和可靠性。results:这篇论文提供了“Medico 2020”和“MedAI: Transparency in Medical Image Segmentation (MedAI 2021)”比赛的全面总结和分析,把最佳performing方法的优势强调出来,并讨论如何将这些方法在临床中应用。同时,它也鼓励更多的质量评估,以建立更加透明和理解的AI基于colonoscopy系统。
    Abstract Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems.
    摘要 自动分析干式摄影图像是一个活跃的研究领域,受到早期检测前期癌病变的重要性启发。然而,在实时检查中检测病变可以是困难的,因为医生们的技能和经验差异,注意力不集中,疲劳等因素导致高检测病变率。深度学习在这个挑战中出现为一个有前途的解决方案,它可以帮助医生在实时检查中检测和分类检测到的病变和异常。此外,算法的准确率不是唯一的重要因素,还需要考虑算法的透明度和可解释性,以便理解算法的预测结果的原因和过程。目前,大多数算法都是在私人数据、关闭源代码或商业软件上开发的,导致方法缺乏可重复性。为了促进效率和透明度的方法的发展,我们在“医疗自动肠道分 segmentation(Medico 2020)”和“MedAI:医疗图像分 segmentation(MedAI 2021)”的竞赛中组织了一系列的挑战。我们在这篇文章中提供了这些竞赛的全面概述,分析每个贡献的优势,并评估了最佳方法的可靠性和临床应用性。为了评估团队的透明度,我们组织了一个多дисциплиnea队伍,包括专业的 Gastroenterologist,对每个提交进行评估,以evaluate团队的开源实践、失败案例分析、割除研究、可用性和理解度来深入了解模型的可靠性和临床应用性。通过全面分析这些挑战,我们不仅披露了肠道和手术 instrumente的分 segmentation的进步,还鼓励了质量评估的建立,以建立更透明和可理解的AI基于干式摄影系统。