cs.CV - 2023-09-13

Automated Assessment of Critical View of Safety in Laparoscopic Cholecystectomy

  • paper_url: http://arxiv.org/abs/2309.07330
  • repo_url: None
  • paper_authors: Yunfan Li, Himanshu Gupta, Haibin Ling, IV Ramakrishnan, Prateek Prasanna, Georgios Georgakis, Aaron Sasson
    for: 这项研究旨在开发深度学习技术,自动评估 lap choledochoscopy 中的安全视野 (CVS)。methods: 研究采用了两栅 semantic segmentation 方法,首先生成两个分割图,然后根据近距离 gallbladder 的 анатомиче结构进行定量计算,最后通过规则来确定每一个 CVS 标准的满足 Condition。results: 研究所获得的结果包括:1) 对 relevant 类型的 mIoU 提高了11.8%以上,相比单基eline模型; 2) 对 Transformer 基eline模型的 Sobel 损失函数,提高了1.84%的 mIoU; 3) CVS 标准的评估中,提高了16%以上,全 CV 评估中提高了5%。
    Abstract Cholecystectomy (gallbladder removal) is one of the most common procedures in the US, with more than 1.2M procedures annually. Compared with classical open cholecystectomy, laparoscopic cholecystectomy (LC) is associated with significantly shorter recovery period, and hence is the preferred method. However, LC is also associated with an increase in bile duct injuries (BDIs), resulting in significant morbidity and mortality. The primary cause of BDIs from LCs is misidentification of the cystic duct with the bile duct. Critical view of safety (CVS) is the most effective of safety protocols, which is said to be achieved during the surgery if certain criteria are met. However, due to suboptimal understanding and implementation of CVS, the BDI rates have remained stable over the last three decades. In this paper, we develop deep-learning techniques to automate the assessment of CVS in LCs. An innovative aspect of our research is on developing specialized learning techniques by incorporating domain knowledge to compensate for the limited training data available in practice. In particular, our CVS assessment process involves a fusion of two segmentation maps followed by an estimation of a certain region of interest based on anatomical structures close to the gallbladder, and then finally determination of each of the three CVS criteria via rule-based assessment of structural information. We achieved a gain of over 11.8% in mIoU on relevant classes with our two-stream semantic segmentation approach when compared to a single-model baseline, and 1.84% in mIoU with our proposed Sobel loss function when compared to a Transformer-based baseline model. For CVS criteria, we achieved up to 16% improvement and, for the overall CVS assessment, we achieved 5% improvement in balanced accuracy compared to DeepCVS under the same experiment settings.
    摘要 每年有超过1.2万次Cholecystectomy(胆囊除除)手术在美国,而与经典开胆囊手术相比, Laparoscopic Cholecystectomy(LC)具有明显更短的恢复时间,因此成为首选方法。然而,LC也会导致胆囊损伤(BDIs)的增加,从而导致重要的致病和死亡率。胆囊损伤的主要原因是在LC中误认胆囊与胆囊之间的区别。 Critical View of Safety(CVS)是安全协议中最有效的一种,CVS的实施可以在手术中达到特定的标准。然而,由于CVS的理解和实施不够,BDIs的发生率在过去三十年内保持了稳定的水平。在这篇论文中,我们利用深度学习技术自动评估LC中CVS。我们的研究的创新之处在于通过结合域名知识来补偿实际数据的有限性,以提高CVS评估的准确性。我们的CVS评估过程包括将两个分割图像融合,然后根据胆囊附近的 анатомиче结构进行一定区域的估计,最后通过基于结构信息的规则来确定每一个CVS标准。我们的两栅semantic segmentation方法在相关的类型上获得了11.8%的增强,而我们的 Sobel损失函数在基于Transformer的基线模型上获得了1.84%的增强。对于CVS标准,我们达到了16%的提高,而对于总CVS评估,我们达到了5%的提高,与DeepCVS相比,在同一个实验设置下。

$\texttt{NePhi}$: Neural Deformation Fields for Approximately Diffeomorphic Medical Image Registration

  • paper_url: http://arxiv.org/abs/2309.07322
  • repo_url: None
  • paper_authors: Lin Tian, Soumyadip Sengupta, Hastings Greer, Raúl San José Estépar, Marc Niethammer
  • for: 本研究提出了一种基于神经网络的凹变模型,可以实现约等于 diffeomorphic 变换。
  • methods: 这种模型使用 функциональ表示凹变,从而减少了训练和推断中的内存占用。这对大量的三维注册问题非常重要。
  • results: 我们在两个 synthetic 2D 数据集和实际的3D 肺注册中测试了 $\texttt{NePhi}$,结果显示它可以在单resolution注册设置下 достичь与 voxel-based 表示相似的准确率,同时使用更少的内存和允许更快的实例优化。
    Abstract This work proposes $\texttt{NePhi}$, a neural deformation model which results in approximately diffeomorphic transformations. In contrast to the predominant voxel-based approaches, $\texttt{NePhi}$ represents deformations functionally which allows for memory-efficient training and inference. This is of particular importance for large volumetric registrations. Further, while medical image registration approaches representing transformation maps via multi-layer perceptrons have been proposed, $\texttt{NePhi}$ facilitates both pairwise optimization-based registration $\textit{as well as}$ learning-based registration via predicted or optimized global and local latent codes. Lastly, as deformation regularity is a highly desirable property for most medical image registration tasks, $\texttt{NePhi}$ makes use of gradient inverse consistency regularization which empirically results in approximately diffeomorphic transformations. We show the performance of $\texttt{NePhi}$ on two 2D synthetic datasets as well as on real 3D lung registration. Our results show that $\texttt{NePhi}$ can achieve similar accuracies as voxel-based representations in a single-resolution registration setting while using less memory and allowing for faster instance-optimization.
    摘要 这个工作提出了一种名为 $\texttt{NePhi}$ 的神经网络变换模型,它可以生成约等于Diffusion的变换。与传统的 voxel-based 方法不同, $\texttt{NePhi}$ 表示变换函циональ地,这使得训练和推理中的内存占用更加有效。这对大量的核心注册特别重要。此外,医疗影像注册方法表示转换地图via多层感知器已经提出,但 $\texttt{NePhi}$ 可以实现对约束 optimize 的 pairwise 注册以及学习基于预测或优化的全局和本地秘密码注册。最后,由于变换规范是医疗影像注册任务中最为急需的特性之一, $\texttt{NePhi}$ 使用了梯度反转一致正则化,这些正则化在实际中能够使变换变得约等于Diffusion的。我们在两个二维 sintetic 数据集以及真实的三维肺注册任务中展示了 $\texttt{NePhi}$ 的性能,结果表明 $\texttt{NePhi}$ 可以在单个分辨率注册设置中 achieve 类似于 voxel-based 表示的准确性,使用更少的内存和允许更快的实例优化。

Multi-Modal Hybrid Learning and Sequential Training for RGB-T Saliency Detection

  • paper_url: http://arxiv.org/abs/2309.07297
  • repo_url: None
  • paper_authors: Guangyu Ren, Jitesh Joshi, Youngjun Cho
  • For: 本研究旨在提高RGB-T焦点检测的精度,Addressing the limitations of existing methods that neglect the characteristics of cross-modal features and rely solely on network structures to fuse RGB and thermal features.* Methods: 我们提出了一种多Modal Hybrid loss(MMHL),包括监督和自监督损失函数。 semantic features from different modalities are distinctly utilized in the supervised loss component, while the self-supervised loss component reduces the distance between RGB and thermal features. We also consider both spatial and channel information during feature fusion and propose the Hybrid Fusion Module to effectively fuse RGB and thermal features.* Results: 我们采用了一种顺序训练策略,首先在RGB图像上进行训练,然后在第二个阶段学习交叉模式特征。 This training strategy improves saliency detection performance without increasing computational overhead. Results from performance evaluation and ablation studies demonstrate the superior performance achieved by the proposed method compared with the existing state-of-the-art methods.
    Abstract RGB-T saliency detection has emerged as an important computer vision task, identifying conspicuous objects in challenging scenes such as dark environments. However, existing methods neglect the characteristics of cross-modal features and rely solely on network structures to fuse RGB and thermal features. To address this, we first propose a Multi-Modal Hybrid loss (MMHL) that comprises supervised and self-supervised loss functions. The supervised loss component of MMHL distinctly utilizes semantic features from different modalities, while the self-supervised loss component reduces the distance between RGB and thermal features. We further consider both spatial and channel information during feature fusion and propose the Hybrid Fusion Module to effectively fuse RGB and thermal features. Lastly, instead of jointly training the network with cross-modal features, we implement a sequential training strategy which performs training only on RGB images in the first stage and then learns cross-modal features in the second stage. This training strategy improves saliency detection performance without computational overhead. Results from performance evaluation and ablation studies demonstrate the superior performance achieved by the proposed method compared with the existing state-of-the-art methods.
    摘要

GAN-based Algorithm for Efficient Image Inpainting

  • paper_url: http://arxiv.org/abs/2309.07293
  • repo_url: None
  • paper_authors: Zhengyang Han, Zehao Jiang, Yuan Ju
  • for: 实现面部识别中掩盖面罩的问题,因为COVID-19的全球大流行导致了新的挑战。
  • methods: 使用机器学习的图像填充技术,以完成掩盖面罩中的可能的面部。特别是使用自适应器和生成敌方网络(GAN),以保留具体的图像特征和生成力。
  • results: 使用50,000个影星面部图像进行训练,获得了一个可靠的结果,但还有空间进行改进。此外,文章还讨论了模型的缺陷和改进方向,以及未来的应用范围和进一步改进方法。
    Abstract Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general features of the image as well as the generative power of the generative adversarial network (GAN). The authors implement a combination of the two models, context encoders and explain how it combines the power of the two models and train the model with 50,000 images of influencers faces and yields a solid result that still contains space for improvements. Furthermore, the authors discuss some shortcomings with the model, their possible improvements, as well as some area of study for future investigation for applicative perspective, as well as directions to further enhance and refine the model.
    摘要 全球大流行 COVID-19 病毒的蔓延已经带来了一种新的维度的面部识别挑战,由于人们开始穿着口罩。在这种情况下,作者们考虑使用机器学习图像填充技术来解决问题,通过完成掩盖在口罩中的可能的面部。特别是,自适应器具有保留图像重要特征的能力和生成型生成器网络(GAN)的生成力。作者们实现了这两种模型的组合,并解释了这两种模型如何结合并训练模型,使用50,000张影武者脸部图像,并获得了一个坚固的结果,还有空间进行改进。此外,作者们还讨论了模型的缺陷、可能的改进和未来研究的方向,以及如何进一步提高和细化模型。

Unbiased Face Synthesis With Diffusion Models: Are We There Yet?

  • paper_url: http://arxiv.org/abs/2309.07277
  • repo_url: None
  • paper_authors: Harrison Rosenberg, Shimaa Ahmed, Guruprasad V Ramesh, Ramya Korlakai Vinayak, Kassem Fawaz
  • for: 本研究旨在 investigate text-to-image diffusion models 的效果和缺陷在人脸生成方面。
  • methods: 本研究使用了一组质量量化指标和用户研究来评估生成的人脸图像。
  • results: 研究发现生成的人脸图像存在忠实度、人口群体差异和分布Shift等缺陷。此外,我们还提出了一种分析模型,可以帮助理解训练数据选择对生成模型的表现产生了什么影响。
    Abstract Text-to-image diffusion models have achieved widespread popularity due to their unprecedented image generation capability. In particular, their ability to synthesize and modify human faces has spurred research into using generated face images in both training data augmentation and model performance assessments. In this paper, we study the efficacy and shortcomings of generative models in the context of face generation. Utilizing a combination of qualitative and quantitative measures, including embedding-based metrics and user studies, we present a framework to audit the characteristics of generated faces conditioned on a set of social attributes. We applied our framework on faces generated through state-of-the-art text-to-image diffusion models. We identify several limitations of face image generation that include faithfulness to the text prompt, demographic disparities, and distributional shifts. Furthermore, we present an analytical model that provides insights into how training data selection contributes to the performance of generative models.
    摘要 文本到图像扩散模型已经得到了广泛的推广,尤其是它们可以生成和修改人脸图像的能力。在这篇论文中,我们研究了生成模型在人脸生成方面的有效性和缺陷。我们使用了一组质量量化指标和用户调查来评估生成图像的特点,并应用于使用最新的文本到图像扩散模型生成的人脸图像。我们发现了一些生成人脸图像的限制,包括文本提示的准确性、人口群体差异和分布偏移。此外,我们还提出了一个分析模型,帮助我们理解训练数据选择对生成模型的影响。

So you think you can track?

  • paper_url: http://arxiv.org/abs/2309.07268
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Derek Gloudemans, Gergely Zachár, Yanbing Wang, Junyi Ji, Matt Nice, Matt Bunting, William Barbour, Jonathan Sprinkle, Benedetto Piccoli, Maria Laura Delle Monache, Alexandre Bayen, Benjamin Seibold, Daniel B. Work
  • for: 这篇论文旨在提供一个多 камерatracking数据集,用于测试跟踪算法的性能。
  • methods: 论文使用了234个高清晰度相机记录了一段4.2英里长的8-10车道高速公路附近纳什维尔的视频数据,并将视频数据与270辆车辆的GPS轨迹数据结合使用,以提供一组真实的轨迹数据。
  • results: 初步的测试结果显示,使用跟踪算法对视频数据进行跟踪,只能获得9.5%的最佳HOTA(最高识别率75.9%,IOU 0.1,平均每个真实轨迹对象的ID数为47.9),这表明测试的跟踪算法无法在需要的长时间和空间尺度上达到足够的性能。
    Abstract This work introduces a multi-camera tracking dataset consisting of 234 hours of video data recorded concurrently from 234 overlapping HD cameras covering a 4.2 mile stretch of 8-10 lane interstate highway near Nashville, TN. The video is recorded during a period of high traffic density with 500+ objects typically visible within the scene and typical object longevities of 3-15 minutes. GPS trajectories from 270 vehicle passes through the scene are manually corrected in the video data to provide a set of ground-truth trajectories for recall-oriented tracking metrics, and object detections are provided for each camera in the scene (159 million total before cross-camera fusion). Initial benchmarking of tracking-by-detection algorithms is performed against the GPS trajectories, and a best HOTA of only 9.5% is obtained (best recall 75.9% at IOU 0.1, 47.9 average IDs per ground truth object), indicating the benchmarked trackers do not perform sufficiently well at the long temporal and spatial durations required for traffic scene understanding.
    摘要 这项工作介绍了一个多个摄像头跟踪数据集,包含234小时的视频数据,由234个 overlap 高清晰度摄像头记录在田中的4.2英里长8-10车道高速公路附近。视频在高交通密度时期录制,typical object longevities 3-15分钟,可以看到500多个对象在场景中。GPS轨迹从270辆车辆通过场景被手动修正在视频数据中,以提供一个基准轨迹数据集,并提供了每个摄像头上的对象探测结果(共159万个)。初步测试了基于检测的跟踪算法,并在GPS轨迹上实现了最佳HOTA的9.5%(最高准确率75.9%,IOU 0.1,47.9个平均ID每个基准对象),这表明测试的跟踪算法无法在需要的长时间和空间持续时间内达到足够的性能。

Automated segmentation of rheumatoid arthritis immunohistochemistry stained synovial tissue

  • paper_url: http://arxiv.org/abs/2309.07255
  • repo_url: https://github.com/amayags/ihc_synovium_segmentation
  • paper_authors: Amaya Gallagher-Syed, Abbas Khan, Felice Rivellese, Costantino Pitzalis, Myles J. Lewis, Gregory Slabaugh, Michael R. Barnes
    for: 这个研究是为了开发一个可靠、重复的自动分类算法,以帮助研究人员分析患有慢性Autoimmune疾病的Synovial tissue标本。methods: 这个研究使用了UNET对一个手动检验、多中心临床资料集R4RA进行训练,以 Handle多种IHC染色方法和不同来源的资料集中的变化。results: 模型在DICE分数0.865的基础上,成功地分类不同类型的IHC染色、处理不同来源的资料集中的变化、和常见的WSIs错误。这个算法可以用作自动影像分析管道的第一步,从而提高速度、重复性和Robustness。
    Abstract Rheumatoid Arthritis (RA) is a chronic, autoimmune disease which primarily affects the joint's synovial tissue. It is a highly heterogeneous disease, with wide cellular and molecular variability observed in synovial tissues. Over the last two decades, the methods available for their study have advanced considerably. In particular, Immunohistochemistry stains are well suited to highlighting the functional organisation of samples. Yet, analysis of IHC-stained synovial tissue samples is still overwhelmingly done manually and semi-quantitatively by expert pathologists. This is because in addition to the fragmented nature of IHC stained synovial tissue, there exist wide variations in intensity and colour, strong clinical centre batch effect, as well as the presence of many undesirable artefacts present in gigapixel Whole Slide Images (WSIs), such as water droplets, pen annotation, folded tissue, blurriness, etc. There is therefore a strong need for a robust, repeatable automated tissue segmentation algorithm which can cope with this variability and provide support to imaging pipelines. We train a UNET on a hand-curated, heterogeneous real-world multi-centre clinical dataset R4RA, which contains multiple types of IHC staining. The model obtains a DICE score of 0.865 and successfully segments different types of IHC staining, as well as dealing with variance in colours, intensity and common WSIs artefacts from the different clinical centres. It can be used as the first step in an automated image analysis pipeline for synovial tissue samples stained with IHC, increasing speed, reproducibility and robustness.
    摘要 《急性风湿综合病(RA)是一种chronic autoimmune疾病,主要影响 JOINT的 synovial тissue。这是一种highly heterogeneous的疾病, synovial tissue中observation到了宽泛的细胞和分子多样性。过去二十年, изу究这种疾病的方法有了很大的进步。特别是,免疫组织染色(IHC)是一种非常适合高亮samples的功能组织结构的方法。然而,IHC染色synovial tissue样本的分析仍然是由专家病理学家手动、半量地进行的。这是因为synovial tissue样本的残留物和水滴落、笔迹注解、折叠、模糊等多种缺点和噪声存在。因此,有一个强大、重复的自动化组织分 segmentation算法的需求,以适应这种多样性,并提供图像管道中的支持。我们在一个手动、多中心临床数据集R4RA上训练了一个UNET模型,该模型在多种IHC染色类型中 segments不同类型的IHC染色,并处理了不同临床中心的variance in colors、intensity和常见的WSIs噪声。它可以作为自动化图像分析管道的第一步,提高速度、可重复性和 Robustness。

Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement

  • paper_url: http://arxiv.org/abs/2309.07254
  • repo_url: None
  • paper_authors: Chenghao Li, Dake Chen, Yuke Zhang, Peter A. Beerel
  • for: 减少扩散模型中的复制现象,以保护隐私。
  • methods: 引入一个通用度分数来衡量训练数据标签的通用性,并使用大语言模型(LLM)来泛化训练标签。然后,我们提出了一种双混合增强方法来缓解扩散模型中的复制现象。
  • results: 我们的提议方法可以相比原始扩散模型,降低复制率 by 43.5%,同时保持生成的多样性和质量。
    Abstract While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate' training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations.
    摘要 Diffusion models demonstrate remarkable image generation capabilities, but their tendency to "replicate" training data raises privacy concerns. Recent research suggests that this replication stems from insufficient generalization of training data captions and duplication of training images, but effective mitigation strategies remain elusive. To address this gap, our paper introduces a generality score to measure caption generality and uses large language models (LLM) to generalize training captions. We then propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results show that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations.Here's the text in Simplified Chinese characters:Diffusion models 表现出杰出的图像生成能力,但它们的“复制”行为引起隐私问题。最近的研究表明,这种复制是因为训练数据标签的不够普遍性和训练图像的重复,但有效的 mitigation 策略仍然存在问题。为了解决这个差距,我们的论文首先引入一个普遍度分数来衡量标签普遍性,然后使用大型自然语言模型(LLM)来普遍化训练标签。接着,我们提出了一种新的双拟合增强方法来缓解 diffusion 模型的复制。我们的实验结果表明,我们的提议方法可以将复制量降低到原始 diffusion 模型的 43.5% 水平,同时保持生成的多样性和质量。

  • paper_url: http://arxiv.org/abs/2309.07243
  • repo_url: None
  • paper_authors: Peter Hardy, Hansung Kim
  • for: recover 3D human poses from 2D kinematic skeletons
  • methods: lift-then-fill approach, custom sampling function, and independent lifting of skeleton parts
  • results: significantly more accurate results, improved stability and likelihood estimation, and consistent accuracy in scenarios without occlusion
    Abstract We present LInKs, a novel unsupervised learning method to recover 3D human poses from 2D kinematic skeletons obtained from a single image, even when occlusions are present. Our approach follows a unique two-step process, which involves first lifting the occluded 2D pose to the 3D domain, followed by filling in the occluded parts using the partially reconstructed 3D coordinates. This lift-then-fill approach leads to significantly more accurate results compared to models that complete the pose in 2D space alone. Additionally, we improve the stability and likelihood estimation of normalising flows through a custom sampling function replacing PCA dimensionality reduction previously used in prior work. Furthermore, we are the first to investigate if different parts of the 2D kinematic skeleton can be lifted independently which we find by itself reduces the error of current lifting approaches. We attribute this to the reduction of long-range keypoint correlations. In our detailed evaluation, we quantify the error under various realistic occlusion scenarios, showcasing the versatility and applicability of our model. Our results consistently demonstrate the superiority of handling all types of occlusions in 3D space when compared to others that complete the pose in 2D space. Our approach also exhibits consistent accuracy in scenarios without occlusion, as evidenced by a 7.9% reduction in reconstruction error compared to prior works on the Human3.6M dataset. Furthermore, our method excels in accurately retrieving complete 3D poses even in the presence of occlusions, making it highly applicable in situations where complete 2D pose information is unavailable.
    摘要 我们介绍了LInKs,一种新的无监督学习方法,用于从单张图像中提取3D人姿 pose,即使存在 occlusion。我们的方法采用了一个Unique two-step process,首先将受 occlusion 的2D pose提升到3D空间,然后使用部分重建的3D坐标填充 occluded 部分。这种 lift-then-fill 方法,与先前只在2D空间完成 pose 的模型相比,具有显著更高的准确性。此外,我们通过自定义抽样函数取代先前在先前工作中使用的 PCA 维度减少,提高了流体的稳定性和likelihood估计。进一步,我们发现可以独立提升不同部分的2D骨架,这会减少了长距离关键点相关性,从而降低 error。在我们的详细评估中,我们对不同的 occlusion 场景下的 error 进行了量化分析,并展示了我们的模型在不同的场景下的多样化和可应用性。我们的结果表明,在3D空间中处理所有类型的 occlusion 时,我们的方法具有明显的优势,并且在没有 occlusion 的场景下也具有高度的准确性。此外,我们的方法能够准确地从图像中提取完整的3D姿 pose,即使完整的2D pose信息不available。

Text-Guided Generation and Editing of Compositional 3D Avatars

  • paper_url: http://arxiv.org/abs/2309.07125
  • repo_url: https://github.com/HaoZhang990127/TECA
  • paper_authors: Hao Zhang, Yao Feng, Peter Kulits, Yandong Wen, Justus Thies, Michael J. Black
  • for: 通过文本描述生成高质量的3D人物头像,包括发型和配饰。
  • methods: 使用分解模型,将人物头部、脸部、头发和配饰分别用3D矩阵和神经辐射场(NeRF)来表示,以提高实现和编辑人物的出现。
  • results: 通过Text-guided generation and Editing of Compositional Avatars(TECA)方法,可以从文本描述中生成更加真实的3D人物头像,同时支持编辑人物的外观特征,如发型、辫子和其他配饰。
    Abstract Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.
    摘要 我们的目标是创建一个现实主义3D人物头像,包括头发和配件,只使用文本描述。Recent interest in this challenge has been significant, but existing methods lack realism, produce unrealistic shapes, or do not support editing. We argue that these methods are limited because they use a monolithic modeling approach, where the head, face, hair, and accessories are represented by a single model. Our observation is that the hair and face have very different structural qualities that benefit from different representations. Based on this insight, we generate avatars with a compositional model, where the head, face, and upper body are represented by traditional 3D meshes, and the hair, clothing, and accessories are represented by neural radiance fields (NeRF). This approach provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. NeRFs are used to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system, Text-guided generation and Editing of Compositional Avatars (TECA), synthesizes high-quality compositional avatars from text descriptions. Experimental results show that our method produces more realistic avatars than recent methods, and is editable due to its compositional nature. For example, our TECA enables seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars, supporting applications such as virtual try-on.

Tree-Structured Shading Decomposition

  • paper_url: http://arxiv.org/abs/2309.07122
  • repo_url: https://github.com/gcgeng/inv-shade-trees
  • paper_authors: Chen Geng, Hong-Xing Yu, Sharon Zhang, Maneesh Agrawala, Jiajun Wu
  • for: 本研究旨在从单一图像中推断物体阴影树状表示,以便对物体表面阴影进行编辑。
  • methods: 本文提出使用阴影树表示,结合基本阴影节点和混合方法来因式物体表面阴影。这种表示方式可以让不熟悉物理阴影过程的 novice 用户通过有效和直观的方式进行编辑。
  • results: 本文通过实验表明,使用 hybrid 方法可以有效地推断阴影树并且可以在不同的图像和描述符下进行应用。这些应用包括材料编辑、vectorized 阴影和重新照明。
    Abstract We study inferring a tree-structured representation from a single image for object shading. Prior work typically uses the parametric or measured representation to model shading, which is neither interpretable nor easily editable. We propose using the shade tree representation, which combines basic shading nodes and compositing methods to factorize object surface shading. The shade tree representation enables novice users who are unfamiliar with the physical shading process to edit object shading in an efficient and intuitive manner. A main challenge in inferring the shade tree is that the inference problem involves both the discrete tree structure and the continuous parameters of the tree nodes. We propose a hybrid approach to address this issue. We introduce an auto-regressive inference model to generate a rough estimation of the tree structure and node parameters, and then we fine-tune the inferred shade tree through an optimization algorithm. We show experiments on synthetic images, captured reflectance, real images, and non-realistic vector drawings, allowing downstream applications such as material editing, vectorized shading, and relighting. Project website: https://chen-geng.com/inv-shade-trees
    摘要 我们研究从单张图像推导出树状表示,以便对物体陷阱进行推断。先前的工作通常使用参数化或测量表示方法来模拟陷阱,这些方法都不是可解释的,也不是容易修改的。我们提议使用阴影树表示,这种表示结合基本阴影节点和组合方法来因式化物体表面阴影。阴影树表示使得无经验的用户可以快速和直观地编辑物体阴影。主要挑战在推导阴影树时是解决混合精度和树结构的问题。我们提出了一种混合方法,包括自动回归推断模型生成粗略的树结构和节点参数,然后通过优化算法进行细化调整。我们在Synthetic图像、捕捉反射图像、真实图像和非现实 вектор图像上进行了实验,以便下游应用如材质编辑、vector化阴影和重新照明。项目网站:https://chen-geng.com/inv-shade-trees

PILOT: A Pre-Trained Model-Based Continual Learning Toolbox

  • paper_url: http://arxiv.org/abs/2309.07117
  • repo_url: https://github.com/sun-hailong/lamda-pilot
  • paper_authors: Hai-Long Sun, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
  • for: 本研究旨在开发一个基于预训练模型的 continual learning 工具箱(PILOT),以便在实际应用中适应新数据的到达。
  • methods: 本研究使用了一些当前领先的类增量学习算法基于预训练模型,如L2P、DualPrompt和CODA-Prompt。同时,PILOT还将典型的类增量学习算法(如DER、FOSTER和MEMO)置于预训练模型的Context中进行评估其效果。
  • results: PILOT在实际应用中表现出色,能够在不同的类增量学习任务中保持高度的性能。
    Abstract While traditional machine learning can effectively tackle a wide range of problems, it primarily operates within a closed-world setting, which presents limitations when dealing with streaming data. As a solution, incremental learning emerges to address real-world scenarios involving new data's arrival. Recently, pre-training has made significant advancements and garnered the attention of numerous researchers. The strong performance of these pre-trained models (PTMs) presents a promising avenue for developing continual learning algorithms that can effectively adapt to real-world scenarios. Consequently, exploring the utilization of PTMs in incremental learning has become essential. This paper introduces a pre-trained model-based continual learning toolbox known as PILOT. On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt. On the other hand, PILOT also fits typical class-incremental learning algorithms (e.g., DER, FOSTER, and MEMO) within the context of pre-trained models to evaluate their effectiveness.
    摘要 传统机器学习可以有效地解决广泛的问题,但它通常运作在关闭世界设定下,这限制了处理流动资料的能力。为了解决这个问题,增量学习 emerges 作为一个解决方案,可以在实际世界情况下进行学习。最近,预训条件(Pre-training)已经做出了重要的进步,并吸引了许多研究人员的注意。这些预训模型(PTMs)的强大表现表明了它们可以作为增量学习的基础,以便更好地适应实际世界情况。因此,使用 PTMs 在增量学习中是必要的。本文提出了一个基于预训模型的增量学习工具箱,称为 PILOT。一方面,PILOT 实现了一些现代的分类增量学习算法,基于预训模型,如 L2P、DualPrompt 和 CODA-Prompt。另一方面,PILOT 还可以将传统的分类增量学习算法(例如 DER、FOSTER 和 MEMO)与预训模型集成,以评估其效果。

Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

  • paper_url: http://arxiv.org/abs/2309.07115
  • repo_url: None
  • paper_authors: Anith Selvakumar, Homa Fashandi
  • for: 这篇论文旨在提出一种实现多modal人资料的稳定测试方法,以便进行开集 audio-visual 人识别。
  • methods: 这篇论文使用了多任务学习技术,以强化距离度量学习(DML)方法,并证明了具有弱标签的副任务可以增加学习的话者表示的紧密度。 In addition, the authors extended the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrated that it can achieve competitive performance in an audio-visual space.
  • results: 这篇论文的网络实现了开集 audio-visual speaker verification的state of the art表现, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H.
    Abstract In this paper, we present a methodology for achieving robust multimodal person representations optimized for open-set audio-visual speaker verification. Distance Metric Learning (DML) approaches have typically dominated this problem space, owing to strong performance on new and unseen classes. In our work, we explored multitask learning techniques to further boost performance of the DML approach and show that an auxiliary task with weak labels can increase the compactness of the learned speaker representation. We also extend the Generalized end-to-end loss (GE2E) to multimodal inputs and demonstrate that it can achieve competitive performance in an audio-visual space. Finally, we introduce a non-synchronous audio-visual sampling random strategy during training time that has shown to improve generalization. Our network achieves state of the art performance for speaker verification, reporting 0.244%, 0.252%, 0.441% Equal Error Rate (EER) on the three official trial lists of VoxCeleb1-O/E/H, which is to our knowledge, the best published results on VoxCeleb1-E and VoxCeleb1-H.
    摘要 在这篇论文中,我们提出了一种方法来实现可靠的多模态人表示,optimized for open-set audio-visual speaker verification。传统的Distance Metric Learning(DML)方法在这个问题空间中占据主导地位,因为它在新和未经见的类型上表现出色。在我们的工作中,我们探索了多任务学习技术,以提高DML方法的性能,并证明了一个auxiliary任务的弱标签可以提高学习的人类表示的紧凑性。我们还扩展了Generalized end-to-end loss(GE2E)到多模态输入,并证明它可以在音频视频空间中达到竞争性能。最后,我们引入了在训练时Audio-Visual采样随机化策略,并证明其可以提高总体化。我们的网络实现了人识别的状态方法,报告了VoxCeleb1-O/E/H三个官方试验列表的EER为0.244%、0.252%和0.441%,这是我们所知道的最佳发表结果。

Contrastive Deep Encoding Enables Uncertainty-aware Machine-learning-assisted Histopathology

  • paper_url: http://arxiv.org/abs/2309.07113
  • repo_url: None
  • paper_authors: Nirhoshan Sivaroopan, Chamuditha Jayanga, Chalani Ekanayake, Hasindri Watawana, Jathurshan Pradeepkumar, Mithunjha Anandakumar, Ranga Rodrigo, Chamira U. S. Edussooriya, Dushan N. Wadduwage
  • for: 本研究旨在使用大量公共领域数据来预训深度神经网络模型,以便在医学 histopathology 图像中学习丰富的特征。
  • methods: 本研究使用了大量公共领域数据进行预训,然后使用一小部分注释数据进行精度调整。此外,研究还提出了一种不确定性意识损失函数,以衡量模型在推理过程中的信任程度。
  • results: 研究表明,使用预训后精度调整的方法可以达到当今最佳性能(SOTA),并且只需要1-10%的注释数据。此外,研究还证明了不确定性意识损失函数可以帮助专家选择最佳的实例进行进一步训练。
    Abstract Deep neural network models can learn clinically relevant features from millions of histopathology images. However generating high-quality annotations to train such models for each hospital, each cancer type, and each diagnostic task is prohibitively laborious. On the other hand, terabytes of training data -- while lacking reliable annotations -- are readily available in the public domain in some cases. In this work, we explore how these large datasets can be consciously utilized to pre-train deep networks to encode informative representations. We then fine-tune our pre-trained models on a fraction of annotated training data to perform specific downstream tasks. We show that our approach can reach the state-of-the-art (SOTA) for patch-level classification with only 1-10% randomly selected annotations compared to other SOTA approaches. Moreover, we propose an uncertainty-aware loss function, to quantify the model confidence during inference. Quantified uncertainty helps experts select the best instances to label for further training. Our uncertainty-aware labeling reaches the SOTA with significantly fewer annotations compared to random labeling. Last, we demonstrate how our pre-trained encoders can surpass current SOTA for whole-slide image classification with weak supervision. Our work lays the foundation for data and task-agnostic pre-trained deep networks with quantified uncertainty.
    摘要 深度神经网络模型可以从百万个 histopathology 图像中学习丰富的临床相关特征。然而,为每个医院、每种肿瘤类型和每个诊断任务生成高质量笔记是不可能的。一方面,一些情况下有公共领域中的 terabytes 训练数据,尽管缺乏可靠笔记,但是可以采用。在这种情况下,我们探索了如何利用这些大量数据来预训练深度网络,以便在特定下游任务上编码有用的表示。然后,我们使用一部分注释训练数据来精度地训练我们的预训练模型,以达到特定的下游任务。我们发现,我们的方法可以与其他 SOTA 方法相比,只需要1-10%的随机选择笔记,可以达到 SOTA 的 patch-level 分类结果。此外,我们还提出了一种不确定性感知损失函数,用于衡量模型在推理过程中的自信度。这种量化不确定性帮助专家选择最佳的实例进行进一步训练。我们的不确定性感知标注可以与随机标注相比,并达到 SOTA 结果。最后,我们展示了我们的预训练Encoder可以在弱级指导下超越当前 SOTA 的整个扫描图像分类结果。我们的工作为数据和任务无关的预训练深度网络和量化不确定性提供了基础。

Hardening RGB-D Object Recognition Systems against Adversarial Patch Attacks

  • paper_url: http://arxiv.org/abs/2309.07106
  • repo_url: None
  • paper_authors: Yang Zheng, Luca Demetrio, Antonio Emanuele Cinà, Xiaoyi Feng, Zhaoqiang Xia, Xiaoyue Jiang, Ambra Demontis, Battista Biggio, Fabio Roli
  • for: 这种研究是为了提高RGB-D对象识别系统的预测性能,并且通过将色彩和深度信息 fusion来实现这一目标。
  • methods: 这种研究使用了RGB-D系统,并通过对这些系统进行攻击来检验其 robustness。
  • results: 研究发现,RGB-D系统在面对攻击时的Robustness与RGB-only系统类似,而且RGB-D系统的Robustness还受到了原始图像颜色的修改的攻击。此外,研究还提出了一种基于检测机制的防御方法,可以提高RGB-D系统对攻击的Robustness。
    Abstract RGB-D object recognition systems improve their predictive performances by fusing color and depth information, outperforming neural network architectures that rely solely on colors. While RGB-D systems are expected to be more robust to adversarial examples than RGB-only systems, they have also been proven to be highly vulnerable. Their robustness is similar even when the adversarial examples are generated by altering only the original images' colors. Different works highlighted the vulnerability of RGB-D systems; however, there is a lacking of technical explanations for this weakness. Hence, in our work, we bridge this gap by investigating the learned deep representation of RGB-D systems, discovering that color features make the function learned by the network more complex and, thus, more sensitive to small perturbations. To mitigate this problem, we propose a defense based on a detection mechanism that makes RGB-D systems more robust against adversarial examples. We empirically show that this defense improves the performances of RGB-D systems against adversarial examples even when they are computed ad-hoc to circumvent this detection mechanism, and that is also more effective than adversarial training.
    摘要

Polygon Intersection-over-Union Loss for Viewpoint-Agnostic Monocular 3D Vehicle Detection

  • paper_url: http://arxiv.org/abs/2309.07104
  • repo_url: None
  • paper_authors: Derek Gloudemans, Xinxuan Lu, Shepard Xia, Daniel B. Work
  • for: 提高缺乏视点的单目3D物体检测精度
  • methods: 使用新的 polygon IoU 损失函数(PIoU loss)和传统的 L1 损失函数的组合
  • results: 在三种state-of-the-art 视点不对称的3D检测模型上测试并证明了PIoU loss的更快收敛速度和更高的精度(+1.64% AP70 for MonoCon on cars, +0.18% AP70 for RTM3D on cars, and +0.83%/+2.46% AP50/AP25 for MonoRCNN on cyclists)
    Abstract Monocular 3D object detection is a challenging task because depth information is difficult to obtain from 2D images. A subset of viewpoint-agnostic monocular 3D detection methods also do not explicitly leverage scene homography or geometry during training, meaning that a model trained thusly can detect objects in images from arbitrary viewpoints. Such works predict the projections of the 3D bounding boxes on the image plane to estimate the location of the 3D boxes, but these projections are not rectangular so the calculation of IoU between these projected polygons is not straightforward. This work proposes an efficient, fully differentiable algorithm for the calculation of IoU between two convex polygons, which can be utilized to compute the IoU between two 3D bounding box footprints viewed from an arbitrary angle. We test the performance of the proposed polygon IoU loss (PIoU loss) on three state-of-the-art viewpoint-agnostic 3D detection models. Experiments demonstrate that the proposed PIoU loss converges faster than L1 loss and that in 3D detection models, a combination of PIoU loss and L1 loss gives better results than L1 loss alone (+1.64% AP70 for MonoCon on cars, +0.18% AP70 for RTM3D on cars, and +0.83%/+2.46% AP50/AP25 for MonoRCNN on cyclists).
    摘要 “监视器一阶段3D物体检测是一个具有挑战性的任务,因为深度信息从2D图像中很难获取。一些不受视点影响的监视器一阶段3D检测方法不直接利用场势对或几何学 during training,这意味着它们可以在任意视点下检测物体。这些方法预测图像平面上3D bounding box的投影,但这些投影不是正方形的,因此计算IoU(交集率) между这些投影的多边形是不直接的。本文提出了一个高效、完全可微分的多边形IoU损失(PIoU损失),可以用来计算两个3D bounding box的投影之间的IoU。我们将这个PIoU损失与L1损失进行比较,实验结果显示PIoU损失在3D检测模型中较快速收敛,并且在3D检测模型中,PIoU损失和L1损失的组合比L1损失 alone (+1.64% AP70 for MonoCon on cars, +0.18% AP70 for RTM3D on cars, and +0.83%/+2.46% AP50/AP25 for MonoRCNN on cyclists)。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

RadarLCD: Learnable Radar-based Loop Closure Detection Pipeline

  • paper_url: http://arxiv.org/abs/2309.07094
  • repo_url: None
  • paper_authors: Mirko Usuelli, Matteo Frosi, Paolo Cudrano, Simone Mentasti, Matteo Matteucci
  • for: The paper is written for the task of Loop Closure Detection (LCD) in robotics and computer vision, and to address the challenges of integrating radar data for this task.
  • methods: The paper proposes a novel supervised deep learning pipeline called RadarLCD, which leverages a pre-trained HERO model to select key points crucial for LCD tasks and achieve better performance than state-of-the-art methods.
  • results: The paper evaluates RadarLCD on a variety of FMCW Radar dataset scenes and shows that it surpasses state-of-the-art systems in multiple aspects of Loop Closure Detection.Here’s the Chinese translation of the three key points:
  • for: 这篇论文是为了Loop Closure Detection(LCD)任务而写的,并且解决了 integrating radar data 的挑战。
  • methods: 这篇论文提出了一种新的supervised deep learning pipeline,即RadarLCD,它利用了 pre-trained HERO 模型来选择关键的 LCD 任务点,并且在多个 FMCW Radar 数据集场景中进行评估,并与state-of-the-art系统进行比较。
  • results: 这篇论文在多个 FMCW Radar 数据集场景中进行评估,并显示RadarLCD 在多个方面的Loop Closure Detection 性能比state-of-the-art系统更高。
    Abstract Loop Closure Detection (LCD) is an essential task in robotics and computer vision, serving as a fundamental component for various applications across diverse domains. These applications encompass object recognition, image retrieval, and video analysis. LCD consists in identifying whether a robot has returned to a previously visited location, referred to as a loop, and then estimating the related roto-translation with respect to the analyzed location. Despite the numerous advantages of radar sensors, such as their ability to operate under diverse weather conditions and provide a wider range of view compared to other commonly used sensors (e.g., cameras or LiDARs), integrating radar data remains an arduous task due to intrinsic noise and distortion. To address this challenge, this research introduces RadarLCD, a novel supervised deep learning pipeline specifically designed for Loop Closure Detection using the FMCW Radar (Frequency Modulated Continuous Wave) sensor. RadarLCD, a learning-based LCD methodology explicitly designed for radar systems, makes a significant contribution by leveraging the pre-trained HERO (Hybrid Estimation Radar Odometry) model. Being originally developed for radar odometry, HERO's features are used to select key points crucial for LCD tasks. The methodology undergoes evaluation across a variety of FMCW Radar dataset scenes, and it is compared to state-of-the-art systems such as Scan Context for Place Recognition and ICP for Loop Closure. The results demonstrate that RadarLCD surpasses the alternatives in multiple aspects of Loop Closure Detection.
    摘要 Loop Closure Detection (LCD) 是 robotics 和计算机视觉中的一项重要任务,对各种应用领域有着广泛的应用,如对象识别、图像检索和视频分析。LCD 的目标是判断机器人是否返回到了之前访问过的位置(即循环),并估计相关的扭转翻译。尽管雷达传感器具有多种优势,如在不同天气条件下运行和其他普用传感器(如摄像头或 LiDAR)的视场更广泛,但将雷达数据与其他传感器结合仍然是一项困难的任务,因为雷达数据具有内在的噪声和扭曲。为解决这个挑战,本研究提出了 RadarLCD,一种基于深度学习的新型 Loop Closure Detection 管道,专门针对 Frequency Modulated Continuous Wave 雷达传感器。RadarLCD 是一种基于学习的 LCD 方法,通过利用预训练的 HERO(混合估算雷达运动)模型,选择关键点对 LCD 任务的重要性。该方法在多个 FMCW Radar 数据集场景进行评估,与状态的扫Context for Place Recognition 和 ICP for Loop Closure 相比较。结果表明,RadarLCD 在多个方面的 Loop Closure Detection 方面表现出色。

Developing a Novel Image Marker to Predict the Responses of Neoadjuvant Chemotherapy (NACT) for Ovarian Cancer Patients

  • paper_url: http://arxiv.org/abs/2309.07087
  • repo_url: None
  • paper_authors: Ke Zhang, Neman Abdoli, Patrik Gilley, Youkabed Sadri, Xuxin Chen, Theresa C. Thai, Lauren Dockery, Kathleen Moore, Robert S. Mannel, Yuchen Qiu
  • for: 这个研究的目的是开发一种新的图像标记,以便在早期预测NACT治疗的响应。
  • methods: 研究人员首先计算了1373个 радиологи学特征,以量化肿瘤特征,这些特征可以分为三类:几何特征、Intensity特征和xture特征。然后,这些特征被用principal component analysis算法优化,生成一个紧凑而有用的特征集。使用这个特征集作为输入,一个基于SVM的分类器被开发和优化,以创建一个最终的标记,表示病人是否响应NACT治疗。
  • results: 结果显示,这种新方法在ROC曲线的AUC为0.745,并达到了全体准确率的76.2%,正确预测率为70%,错误预测率为78.1%。
    Abstract Objective: Neoadjuvant chemotherapy (NACT) is one kind of treatment for advanced stage ovarian cancer patients. However, due to the nature of tumor heterogeneity, the patients' responses to NACT varies significantly among different subgroups. To address this clinical challenge, the purpose of this study is to develop a novel image marker to achieve high accuracy response prediction of the NACT at an early stage. Methods: For this purpose, we first computed a total of 1373 radiomics features to quantify the tumor characteristics, which can be grouped into three categories: geometric, intensity, and texture features. Second, all these features were optimized by principal component analysis algorithm to generate a compact and informative feature cluster. Using this cluster as the input, an SVM based classifier was developed and optimized to create a final marker, indicating the likelihood of the patient being responsive to the NACT treatment. To validate this scheme, a total of 42 ovarian cancer patients were retrospectively collected. A nested leave-one-out cross-validation was adopted for model performance assessment. Results: The results demonstrate that the new method yielded an AUC (area under the ROC [receiver characteristic operation] curve) of 0.745. Meanwhile, the model achieved overall accuracy of 76.2%, positive predictive value of 70%, and negative predictive value of 78.1%. Conclusion: This study provides meaningful information for the development of radiomics based image markers in NACT response prediction.
    摘要 目的:使用neoadjuvant化学治疗(NACT)对高度晚期卵巢癌患者进行治疗。然而,由于肿瘤多样性,患者对NACT的响应差异非常大。为解决这一临床挑战,本研究的目的是开发一种高精度响应预测 marker。方法:为达到这个目的,我们首先计算了1373个 радиологи特征,以量化肿瘤特征,这些特征可以分为三类:几何、Intensity和Texture特征。然后,我们使用主成分分析算法对这些特征进行优化,生成一个紧凑而有用的特征集。使用这个特征集作为输入,我们开发了一个基于SVM的分类器,并且优化了它以创建一个最终的marker,用于预测患者对NACT治疗的响应。以验证这种方案,我们收集了42例卵巢癌患者的数据,并采用了一种嵌入式的留一个出样验证。结果:结果显示,新的方法在ROC曲线上的AUC为0.745,并且模型的总准确率为76.2%,正确预测率为70%,错误预测率为78.1%。结论:本研究为 радиологи基于图像 marker 在NACT响应预测方面提供了有用的信息。

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2309.07084
  • repo_url: https://github.com/iranqin/supfusion
  • paper_authors: Yiran Qin, Chaoqun Wang, Zijian Kang, Ningning Ma, Zhen Li, Ruimao Zhang
    for:* 本研究旨在提出一种新的训练策略,即SupFusion,以提高LiDAR-Camera融合的检测性能。methods:* 我们提出了一种名为极地抽象的数据增强方法,用于增强稀疏的对象,并训练一个助手模型来生成高质量的特征作为监督。* 我们还提出了一种简单 yet effective的深度融合模块,可以连续提高检测器的能力。results:* 我们的提议在KITTI测试benchmark上实现了约2%的3D mAP提升,基于多个LiDAR-Camera 3D检测器。
    Abstract In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors.
    摘要 在这篇论文中,我们提出了一种新的训练策略,称为SupFusion,该策略提供了LiDAR-Camera fusión中的auxiliary feature层超级视图,以提高检测性能。我们的策略包括一种名为极地抽象法(Polar Sampling)的数据增强方法,该方法用于增强稀疏的对象并训练一个助手模型以生成高质量的特征。这些特征然后用于训练LiDAR-Camera fusión模型,其拼接特征优化以模拟生成的高质量特征。此外,我们还提出了一种简单 yet有效的深度融合模块,该模块可以不断提高检测器的能力。因此,我们的提议具有以下优点:一、SupFusion引入了auxiliary feature层超级视图,可以不添加额外的推理成本,提高LiDAR-Camera检测性能。二、我们提出的深度融合模块可以不断提高检测器的能力。我们的SupFusion和深度融合模块是可插入的,我们进行了广泛的实验来证明其效果。 Specifically, we gained around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors.

FAIR: Frequency-aware Image Restoration for Industrial Visual Anomaly Detection

  • paper_url: http://arxiv.org/abs/2309.07068
  • repo_url: https://github.com/liutongkun/fair
  • paper_authors: Tongkun Liu, Bing Li, Xiao Du, Bingke Jiang, Leqi Geng, Feiyang Wang, Zhuo Zhao
  • for: 这种论文主要针对的是 industrial visual inspection 中的图像重建型异常检测模型,它们通常受到正常重建精度和异常重建分辨率之间的负面负担影响。
  • methods: 作者提出了一种新的自我超级视觉任务,即频率意识图像恢复(FAIR),它利用了异常重建错误的频率偏好来提高正常图像的恢复精度,同时降低不良泛化到异常图像上。
  • results: 使用简单的杂色UNet,FAIR可以在多种缺陷检测数据集上达到状态 искусственный知识的表现,并且比传统方法更高效。代码可以在 GitHub 上找到:https://github.com/liutongkun/FAIR。
    Abstract Image reconstruction-based anomaly detection models are widely explored in industrial visual inspection. However, existing models usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. In this paper, we find that the above trade-off can be better mitigated by leveraging the distinct frequency biases between normal and abnormal reconstruction errors. To this end, we propose Frequency-aware Image Restoration (FAIR), a novel self-supervised image restoration task that restores images from their high-frequency components. It enables precise reconstruction of normal patterns while mitigating unfavorable generalization to anomalies. Using only a simple vanilla UNet, FAIR achieves state-of-the-art performance with higher efficiency on various defect detection datasets. Code: https://github.com/liutongkun/FAIR.
    摘要 工业视觉检查中广泛探索图像重建型异常检测模型。然而,现有模型通常受到正常重建准确性和异常重建分别性之间的负面trade-off影响,这会降低性能。在这篇论文中,我们发现了正常重建和异常重建错误频率偏好之间的明显差异。为了利用这一点,我们提出了频率意识图像恢复(FAIR)任务,该任务可以从高频组成部分恢复图像。这使得正常模式的精确重建同时减少了不利于异常检测的泛化。只使用简单的vanilla UNet,FAIR实现了与其他数据集上的状态码表现相同或更高的性能,同时更高效。代码:https://github.com/liutongkun/FAIR。

Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring

  • paper_url: http://arxiv.org/abs/2309.07054
  • repo_url: https://github.com/shangwei5/stgtn
  • paper_authors: Dongwei Ren, Wei Shang, Yi Yang, Wangmeng Zuo
  • for: 这种视频去滤方法是为了从给定的模糊视频中恢复连续的锐利帧而设计的。
  • methods: 该方法利用了邻域帧和当前锐利帧的hybrid transformer来集成特征。首先,我们训练了模糊检测器,以便在模糊视频中分辨锐利帧和模糊帧。然后,我们使用了窗口基于的本地transformer来利用邻域帧的特征,并在不需要显式坐标匹配的情况下通过对准器来协同集成特征。此外,我们还使用了全球transformer来聚合长期锐利特征。
  • results: 对于标准测试集,我们的提出方法在量化指标和视觉质量上都超过了现有的视频去滤方法和事件驱动视频去滤方法。ources are available at https://github.com/shangwei5/STGTN.
    Abstract Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world blurry videos taken by modern imaging devices, sharp frames usually appear in the given video, thus making temporal long-term sharp features available for facilitating the restoration of a blurry frame. In this work, we propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate long-term sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at https://github.com/shangwei5/STGTN.
    摘要 “视频抖杂方法,目标是从给定的抖杂视频中恢复连续的锐利帧,通常假设输入视频中的每帧都是抖杂的。然而,现实中拍摄的视频中,有些帧是锐利的,因此可以利用这些锐利帧来帮助恢复抖杂帧。在这种情况下,我们提出了一种利用邻帧和当前锐利帧的hybrid transformer来Feature汇集的视频抖杂方法。具体来说,我们首先训练一个抖杂检测器,以便在视频中分辨锐利和抖杂帧。然后,我们使用窗口基本的本地transformer来利用邻帧中的特征,其中径观注意是有利于不需要显式匹配的特征汇集。此外,我们还利用全球的transformer来汇集长期内的锐利特征,并且我们可以轻松地扩展这种方法到事件驱动的视频抖杂。我们的方法在标准的量化指标和视觉质量上都超过了现有的视频抖杂方法和事件驱动的视频抖杂方法。我们的源代码和训练模型可以在https://github.com/shangwei5/STGTN上下载。”

Exploiting Multiple Priors for Neural 3D Indoor Reconstruction

  • paper_url: http://arxiv.org/abs/2309.07021
  • repo_url: None
  • paper_authors: Federico Lincetto, Gianluca Agresti, Mattia Rossi, Pietro Zanuttigh
  • for: 实现高质量3D重建结果的大型室内场景
  • methods: 提出了一种基于多种规则化策略的神经隐式模型方法,利用图像来实现更好的大型室内环境重建
  • results: 实验结果表明,我们的方法可以在复杂的室内场景中实现状态动态3D重建结果
    Abstract Neural implicit modeling permits to achieve impressive 3D reconstruction results on small objects, while it exhibits significant limitations in large indoor scenes. In this work, we propose a novel neural implicit modeling method that leverages multiple regularization strategies to achieve better reconstructions of large indoor environments, while relying only on images. A sparse but accurate depth prior is used to anchor the scene to the initial model. A dense but less accurate depth prior is also introduced, flexible enough to still let the model diverge from it to improve the estimated geometry. Then, a novel self-supervised strategy to regularize the estimated surface normals is presented. Finally, a learnable exposure compensation scheme permits to cope with challenging lighting conditions. Experimental results show that our approach produces state-of-the-art 3D reconstructions in challenging indoor scenarios.
    摘要

Instance Adaptive Prototypical Contrastive Embedding for Generalized Zero Shot Learning

  • paper_url: http://arxiv.org/abs/2309.06987
  • repo_url: None
  • paper_authors: Riti Paul, Sahil Vora, Baoxin Li
  • for: 这 paper 的目的是解决 generalized zero-shot learning(GZSL) 中样本分类问题,即在训练时不可以获取未经训练的标签。
  • methods: 这 paper 使用了 contrastive-learning-based (instance-based) embedding 在生成网络中,并利用数据点之间的 semantic relationship。然而,现有的嵌入建模方法受到两个限制:(1)不能考虑细致的集群结构,导致嵌入特征的有限可识别性;(2)对现有的对比嵌入网络来说,采用 restriction 的扩展机制,导致嵌入空间中的表示不够多样化。为了提高嵌入空间中的表示质量,我们提出了一种 margin-based prototypical contrastive learning embedding network,该网络可以从 prototype-data 和 implicit data-data 的交互中获得 clusters 的质量提升,同时为嵌入网络和生成器提供了明显的 cluster 超visulization。
  • results: 通过对三个 benchmark 数据集进行全面的实验评估,我们表明了我们的方法可以超越当前状态的艺术。我们的方法还在 GZSL Setting 中具有最好的未经训练性能。
    Abstract Generalized zero-shot learning(GZSL) aims to classify samples from seen and unseen labels, assuming unseen labels are not accessible during training. Recent advancements in GZSL have been expedited by incorporating contrastive-learning-based (instance-based) embedding in generative networks and leveraging the semantic relationship between data points. However, existing embedding architectures suffer from two limitations: (1) limited discriminability of synthetic features' embedding without considering fine-grained cluster structures; (2) inflexible optimization due to restricted scaling mechanisms on existing contrastive embedding networks, leading to overlapped representations in the embedding space. To enhance the quality of representations in the embedding space, as mentioned in (1), we propose a margin-based prototypical contrastive learning embedding network that reaps the benefits of prototype-data (cluster quality enhancement) and implicit data-data (fine-grained representations) interaction while providing substantial cluster supervision to the embedding network and the generator. To tackle (2), we propose an instance adaptive contrastive loss that leads to generalized representations for unseen labels with increased inter-class margin. Through comprehensive experimental evaluation, we show that our method can outperform the current state-of-the-art on three benchmark datasets. Our approach also consistently achieves the best unseen performance in the GZSL setting.
    摘要 通用零例学习(GZSL)目标是将训练时见过的和未见过的标签分类,假设未见过的标签在训练过程中不可访问。Recent Advances in GZSL 被加速通过在生成网络中 incorporating 对准学习(例程)基于的嵌入,并利用数据点之间的semantic关系。然而,现有的嵌入架构受到两种限制:(1)不考虑细化类划结构, Synthetic features的嵌入不具有充分的推荐性;(2)现有对准学习网络的优化机制受限,导致嵌入空间中的表示不够强大。为了提高嵌入空间中表示质量,我们提出一种基于prototype的嵌入对照学习网络,该网络可以利用对象-数据(精细化表示)和隐式数据-数据(群体质量提高)的交互,同时为嵌入网络和生成器提供了重要的类型指导。此外,我们还提出了一种适应实例的对准损失,以提高对未见过标签的表示。通过全面的实验评估,我们显示了我们的方法可以在三个标准数据集上超越当前状态的艺术。我们的方法还一直保持了GZSL中最佳的未见表现。

Differentiable JPEG: The Devil is in the Details

  • paper_url: http://arxiv.org/abs/2309.06978
  • repo_url: https://github.com/necla-ml/diff-jpeg
  • paper_authors: Christoph Reich, Biplob Debnath, Deep Patel, Srimat Chakradhar
  • for: 该论文旨在对现有的可微分JPEG方法进行全面的回顾,并将其中缺失的重要细节所涉及的问题作出解决。
  • methods: 该论文提出了一种新的可微分JPEG方法,该方法可以具有输入图像、JPEG质量、量化表和色彩转换参数的微分性。
  • results: 对于已有的diff. JPEG方法,该论文进行了forward和backward性的评估,并进行了广泛的ablation测试,以评估关键的设计选择。结果显示,该新的diff. JPEG方法可以距离参照实现最佳,在强大压缩率下可以提高PSNR值$9.51$dB。
    Abstract JPEG remains one of the most widespread lossy image coding methods. However, the non-differentiable nature of JPEG restricts the application in deep learning pipelines. Several differentiable approximations of JPEG have recently been proposed to address this issue. This paper conducts a comprehensive review of existing diff. JPEG approaches and identifies critical details that have been missed by previous methods. To this end, we propose a novel diff. JPEG approach, overcoming previous limitations. Our approach is differentiable w.r.t. the input image, the JPEG quality, the quantization tables, and the color conversion parameters. We evaluate the forward and backward performance of our diff. JPEG approach against existing methods. Additionally, extensive ablations are performed to evaluate crucial design choices. Our proposed diff. JPEG resembles the (non-diff.) reference implementation best, significantly surpassing the recent-best diff. approach by $3.47$dB (PSNR) on average. For strong compression rates, we can even improve PSNR by $9.51$dB. Strong adversarial attack results are yielded by our diff. JPEG, demonstrating the effective gradient approximation. Our code is available at https://github.com/necla-ml/Diff-JPEG.
    摘要 JPEG 仍然是最广泛的损失图像编码方法之一,但它的非微分性限制了它在深度学习管道中的应用。在过去几年中,一些微分 aproximation of JPEG 已经被提出来解决这个问题。本文进行了对现有 diff. JPEG 方法的全面回顾,并发现了以前的方法缺失的关键细节。为此,我们提出了一种新的 diff. JPEG 方法,超越了先前的限制。我们的方法是对输入图像、JPEG 质量、量化表和色彩转换参数进行微分。我们评估了我们的 diff. JPEG 方法的前向和反向性能,并对关键设计选择进行了广泛的ablations。我们的提出的 diff. JPEG 方法与(非微分)参考实现最接近,在 PSNR 上平均提高 $3.47$ dB,并在强大压缩率下提高 PSNR 的 $9.51$ dB。我们的 diff. JPEG 还能够通过强大的攻击测试,证明了有效的梯度近似。我们的代码可以在 https://github.com/necla-ml/Diff-JPEG 上获取。

Neural network-based coronary dominance classification of RCA angiograms

  • paper_url: http://arxiv.org/abs/2309.06958
  • repo_url: None
  • paper_authors: Ivan Kruzhilov, Egor Ikryannikov, Artem Shadrin, Ruslan Utegenov, Galina Zubkova, Ivan Bessonov
  • for: 这 paper 是为了研究基于右 coronary artery (RCA) 涂整图像的 cardiac dominance 分类算法。
  • methods: 这 paper 使用了 Convolutional Neural Network (ConvNext) 和 Swin transformer 进行 2D 图像 (frame) 分类,并使用多数投票来 классифициayer cardio angiographic view。 auxiliary network 也用于检测无关图像,并将其从数据集中排除。
  • results: 五次交叉验证给出了以下 dominance 分类指标:macro recall=93.1%, accuracy=93.5%, macro F1=89.2%。模型通常在 RCA 堵塞和小径 combines with poor quality cardio angiographic view 的情况下失败。在这些情况下, cardiac dominance 分类可能会复杂,需要专家们之间的讨论以确定准确的结论。
    Abstract Background. Cardiac dominance classification is essential for SYNTAX score estimation, which is a tool used to determine the complexity of coronary artery disease and guide patient selection toward optimal revascularization strategy. Objectives. Cardiac dominance classification algorithm based on the analysis of right coronary artery (RCA) angiograms using neural network Method. We employed convolutional neural network ConvNext and Swin transformer for 2D image (frames) classification, along with a majority vote for cardio angiographic view classification. An auxiliary network was also used to detect irrelevant images which were then excluded from the data set. Our data set consisted of 828 angiographic studies, 192 of them being patients with left dominance. Results. 5-fold cross validation gave the following dominance classification metrics (p=95%): macro recall=93.1%, accuracy=93.5%, macro F1=89.2%. The most common case in which the model regularly failed was RCA occlusion, as it requires utilization of LCA information. Another cause for false prediction is a small diameter combined with poor quality cardio angiographic view. In such cases, cardiac dominance classification can be complex and may require discussion among specialists to reach an accurate conclusion. Conclusion. The use of machine learning approaches to classify cardiac dominance based on RCA alone has been shown to be successful with satisfactory accuracy. However, for higher accuracy, it is necessary to utilize LCA information in the case of an occluded RCA and detect cases where there is high uncertainty.
    摘要 背景:心脏主导分类是 SYNTAX 分数估计的关键因素,它用于评估心血管疾病的复杂度并选择患者最佳的再入力策略。目标:使用神经网络对右 coronary artery(RCA) angeiogram 进行分类。方法:我们使用 ConvNext 和 Swin transformer 对二维图像(帧)进行分类,并使用多数投票进行心血管视图分类。此外,我们还使用 auxilary network 检测不相关图像,并将其从数据集中排除。我们的数据集包括 828 个 angeiographic 研究,其中 192 个是左主导的患者。结果:5-fold 十字验证给出了以下主导分类指标(p=95%):macro recall=93.1%、准确率=93.5%、macro F1=89.2%。模型经常错误的情况包括 RCA 填充和小 diameter combined with poor quality 心血管视图。在这些情况下,心脏主导分类可能是复杂的,需要专家之间的讨论以达到准确的结论。结论:使用机器学习方法来基于 RCA alone 分类心脏主导成功,但是为了提高准确率,需要在 RCA 填充情况下使用 LCA 信息,并检测高不确定性的情况。

TransNet: A Transfer Learning-Based Network for Human Action Recognition

  • paper_url: http://arxiv.org/abs/2309.06951
  • repo_url: None
  • paper_authors: K. Alomar, X. Cai
  • for: 人体动作识别 (HAR) 是计算机视觉领域的高级和重要研究领域,它的应用场景广泛。
  • methods: 该文提出了一种简单 yet 多样和有效的深度学习架构,名为 TransNet,用于 HAR。TransNet 将复杂的 3D-CNN 拆分成 2D-CNN 和 1D-CNN,其中 2D-CNN 和 1D-CNN 组件分别提取视频中的空间特征和时间模式。
  • results: 对比于当前 HAR 模型,TransNet 具有更高的 fleксиibilty、模型复杂度、训练速度和分类精度。
    Abstract Human action recognition (HAR) is a high-level and significant research area in computer vision due to its ubiquitous applications. The main limitations of the current HAR models are their complex structures and lengthy training time. In this paper, we propose a simple yet versatile and effective end-to-end deep learning architecture, coined as TransNet, for HAR. TransNet decomposes the complex 3D-CNNs into 2D- and 1D-CNNs, where the 2D- and 1D-CNN components extract spatial features and temporal patterns in videos, respectively. Benefiting from its concise architecture, TransNet is ideally compatible with any pretrained state-of-the-art 2D-CNN models in other fields, being transferred to serve the HAR task. In other words, it naturally leverages the power and success of transfer learning for HAR, bringing huge advantages in terms of efficiency and effectiveness. Extensive experimental results and the comparison with the state-of-the-art models demonstrate the superior performance of the proposed TransNet in HAR in terms of flexibility, model complexity, training speed and classification accuracy.
    摘要 人体动作识别(HAR)是计算机视觉领域的高级和重要研究领域,因其广泛的应用领域。现有的HAR模型的主要限制是其复杂的结构和训练时间。在本文中,我们提出了一种简单却强大、有效的端到端深度学习架构,名为TransNet,用于HAR。TransNet将复杂的3D-CNN decomposed into 2D-和1D-CNN,其中2D-CNN和1D-CNN组件分别EXTRACT SPATIAL FEATURES AND TEMPORAL PATTERNS IN VIDEOS。由于TransNet的简洁架构,它可以 идеальcompatibility with any pre-trained state-of-the-art 2D-CNN models in other fields,可以 transferred to serve the HAR task。即,它自然地利用了转移学习的力量和成功, bringing huge advantages in terms of efficiency and effectiveness。经验证的结果和对状态 искусственныйINTelligence模型的比较表明,提议的TransNet在HAR中具有优秀的flexibility、模型复杂度、训练速度和分类精度。

Limited-Angle Tomography Reconstruction via Deep End-To-End Learning on Synthetic Data

  • paper_url: http://arxiv.org/abs/2309.06948
  • repo_url: https://github.com/99991/htc2022-tud-hhu-version-1
  • paper_authors: Thomas Germer, Jan Robine, Sebastian Konietzny, Stefan Harmeling, Tobias Uelwer
  • for: 解决限角 Tomatoesography重建问题
  • methods: 使用深度神经网络,在大量合成数据上训练
  • results: 实现了30°或40°sinogram的 Tomatoesography重建,并在 Helsinki Tomography Challenge 2022 上获得了第一名
    Abstract Computed tomography (CT) has become an essential part of modern science and medicine. A CT scanner consists of an X-ray source that is spun around an object of interest. On the opposite end of the X-ray source, a detector captures X-rays that are not absorbed by the object. The reconstruction of an image is a linear inverse problem, which is usually solved by filtered back projection. However, when the number of measurements is small, the reconstruction problem is ill-posed. This is for example the case when the X-ray source is not spun completely around the object, but rather irradiates the object only from a limited angle. To tackle this problem, we present a deep neural network that is trained on a large amount of carefully-crafted synthetic data and can perform limited-angle tomography reconstruction even for only 30{\deg} or 40{\deg} sinograms. With our approach we won the first place in the Helsinki Tomography Challenge 2022.
    摘要

DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.06933
  • repo_url: None
  • paper_authors: Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, Daesik Kim, Seung-Hun Nam, Kibeom Hong
  • for: 这个研究旨在探讨大规模文本至图模型的进步,尤其是在艺术领域中的应用。
  • methods: 这篇论文提出了一个名为 DreamStyler 的新框架,用于艺术图像生成。DreamStyler 利用了多阶段文本嵌入,并且可以同时进行文本至图生成和类型转移。
  • results: 实验结果显示,DreamStyler 能够在多个场景中表现出色,包括文本描述和类型参考的情况下。这表明 DreamStyler 具有优秀的创作潜力。
    Abstract Recent progresses in large-scale text-to-image models have yielded remarkable accomplishments, finding various applications in art domain. However, expressing unique characteristics of an artwork (e.g. brushwork, colortone, or composition) with text prompts alone may encounter limitations due to the inherent constraints of verbal description. To this end, we introduce DreamStyler, a novel framework designed for artistic image synthesis, proficient in both text-to-image synthesis and style transfer. DreamStyler optimizes a multi-stage textual embedding with a context-aware text prompt, resulting in prominent image quality. In addition, with content and style guidance, DreamStyler exhibits flexibility to accommodate a range of style references. Experimental results demonstrate its superior performance across multiple scenarios, suggesting its promising potential in artistic product creation.
    摘要 近期大规模文本到图像模型的进步带来了非常出色的成果,在艺术领域找到了多种应用。然而,通过文本提示alone表达艺术作品的独特特征(如笔触、颜色气息或 композиitions)可能会遇到限制,因为文本描述的本质受到限制。为此,我们介绍 DreamStyler,一种新的框架,用于艺术图像生成,具有文本到图像生成和风格传递的能力。 DreamStyler 优化了多 Stage 文本嵌入,使用 Context-aware 文本提示,从而实现了显著的图像质量。另外,通过内容和风格引用, DreamStyler 具有柔性,可以满足不同风格引用的需求。实验结果表明其在多种场景中的超越性,表明它在艺术产品创作中具有普遍的潜力。

Contrast-Phys+: Unsupervised and Weakly-supervised Video-based Remote Physiological Measurement via Spatiotemporal Contrast

  • paper_url: http://arxiv.org/abs/2309.06924
  • repo_url: None
  • paper_authors: Zhaodong Sun, Xiaobai Li
  • for: 这个论文目的是提出一种无监督的远程生物物理测量方法,使用视频来测量血液含量信号。
  • methods: 这个方法使用3DCNN模型生成多个空间时间的血液含量信号,并采用了对比损失函数来捕捉血液含量信号的先验知识。
  • results: 这个方法在五个公共可用的数据集上进行了评估,并与现有的监督方法进行了比较。结果表明,无监督的contrast-Phys+方法可以超过现有的监督方法,即使使用部分可用或不一致的GT标签或无标签。此外,这个方法具有计算效率高、鲁棒性好、泛化能力强等优点。
    Abstract Video-based remote physiological measurement utilizes facial videos to measure the blood volume change signal, which is also called remote photoplethysmography (rPPG). Supervised methods for rPPG measurements have been shown to achieve good performance. However, the drawback of these methods is that they require facial videos with ground truth (GT) physiological signals, which are often costly and difficult to obtain. In this paper, we propose Contrast-Phys+, a method that can be trained in both unsupervised and weakly-supervised settings. We employ a 3DCNN model to generate multiple spatiotemporal rPPG signals and incorporate prior knowledge of rPPG into a contrastive loss function. We further incorporate the GT signals into contrastive learning to adapt to partial or misaligned labels. The contrastive loss encourages rPPG/GT signals from the same video to be grouped together, while pushing those from different videos apart. We evaluate our methods on five publicly available datasets that include both RGB and Near-infrared videos. Contrast-Phys+ outperforms the state-of-the-art supervised methods, even when using partially available or misaligned GT signals, or no labels at all. Additionally, we highlight the advantages of our methods in terms of computational efficiency, noise robustness, and generalization.
    摘要 视频基于远程生物学量测量利用 faces 视频测量血液量变化信号,也称为远程 Plethysmography (rPPG)。已经展示过监督方法可以达到良好的性能。然而,这些方法需要 faces 视频中的GT 生物学信号,这些信号通常是 expensive 和困难获得的。在这篇论文中,我们提出了 Contrast-Phys+,一种可以在不监督和弱监督设置下训练的方法。我们使用 3DCNN 模型生成多个 spatiotemporal rPPG 信号,并将 rPPG 的先前知识 integrate 到一个对比损失函数中。我们进一步将 GT 信号 integrate 到对比学习中,以适应部分或错配置的标签。对比损失函数会将 rPPG/GT 信号从同一个视频集成一起,而推动它们来自不同视频的信号分离开。我们对五个公共可用的数据集进行评估,包括 RGB 和 Near-infrared 视频。Contrast-Phys+ 超过了当前最佳监督方法的性能,即使使用部分可用或错配置的 GT 标签,或没有标签。此外,我们还指出了我们的方法的计算效率、阈值鲁棒性和泛化优势。

Hydra: Multi-head Low-rank Adaptation for Parameter Efficient Fine-tuning

  • paper_url: http://arxiv.org/abs/2309.06922
  • repo_url: None
  • paper_authors: Sanghyeon Kim, Hyunmo Yang, Younghyun Kim, Youngjoon Hong, Eunbyung Park
  • for: 本文旨在探讨一种基于平行和顺序分支的 adapter module,以提高大规模基础模型的精度和灵活性。
  • methods: 本文提出了一种名为 Hydra 的方法,该方法基于分支的分析,并结合平行和顺序分支来整合特性,从而提高了表达能力。此外,该方法还利用预训练权重,通过线性组合来提高预训练特征的泛化性。
  • results: 经过广泛的实验证明,Hydra 方法可以提高精度和灵活性,并且在多种应用中表现出优于单分支方法。 Code 可以在 \url{https://github.com/extremebird/Hydra} 上获取。
    Abstract The recent surge in large-scale foundation models has spurred the development of efficient methods for adapting these models to various downstream tasks. Low-rank adaptation methods, such as LoRA, have gained significant attention due to their outstanding parameter efficiency and no additional inference latency. This paper investigates a more general form of adapter module based on the analysis that parallel and sequential adaptation branches learn novel and general features during fine-tuning, respectively. The proposed method, named Hydra, due to its multi-head computational branches, combines parallel and sequential branch to integrate capabilities, which is more expressive than existing single branch methods and enables the exploration of a broader range of optimal points in the fine-tuning process. In addition, the proposed adaptation method explicitly leverages the pre-trained weights by performing a linear combination of the pre-trained features. It allows the learned features to have better generalization performance across diverse downstream tasks. Furthermore, we perform a comprehensive analysis of the characteristics of each adaptation branch with empirical evidence. Through an extensive range of experiments, encompassing comparisons and ablation studies, we substantiate the efficiency and demonstrate the superior performance of Hydra. This comprehensive evaluation underscores the potential impact and effectiveness of Hydra in a variety of applications. Our code is available on \url{https://github.com/extremebird/Hydra}
    摘要 Recent large-scale foundation models have led to the development of efficient methods for adapting these models to various downstream tasks. Low-rank adaptation methods, such as LoRA, have gained significant attention due to their outstanding parameter efficiency and no additional inference latency. This paper investigates a more general form of adapter module based on the analysis that parallel and sequential adaptation branches learn novel and general features during fine-tuning, respectively. The proposed method, named Hydra, due to its multi-head computational branches, combines parallel and sequential branches to integrate capabilities, which is more expressive than existing single branch methods and enables the exploration of a broader range of optimal points in the fine-tuning process. In addition, the proposed adaptation method explicitly leverages the pre-trained weights by performing a linear combination of the pre-trained features. It allows the learned features to have better generalization performance across diverse downstream tasks. Furthermore, we perform a comprehensive analysis of the characteristics of each adaptation branch with empirical evidence. Through an extensive range of experiments, encompassing comparisons and ablation studies, we substantiate the efficiency and demonstrate the superior performance of Hydra. This comprehensive evaluation underscores the potential impact and effectiveness of Hydra in a variety of applications. Our code is available on \url{https://github.com/extremebird/Hydra}.Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

CCSPNet-Joint: Efficient Joint Training Method for Traffic Sign Detection Under Extreme Conditions

  • paper_url: http://arxiv.org/abs/2309.06902
  • repo_url: https://github.com/haoqinhong/ccspnet-joint
  • paper_authors: Haoqin Hong, Yue Zhou, Xiangyu Shu, Xiangfang Hu
  • for: traffic sign detection in extreme conditions such as fog, rain, and motion blur
  • methods: CCSPNet, an efficient feature extraction module based on Transformers and CNNs, and joint training model CCSPNet-Joint
  • results: state-of-the-art performance in traffic sign detection under extreme conditions, with a 5.32% improvement in precision and an 18.09% improvement in mAP@.5 compared to end-to-end methods
    Abstract Traffic sign detection is an important research direction in intelligent driving. Unfortunately, existing methods often overlook extreme conditions such as fog, rain, and motion blur. Moreover, the end-to-end training strategy for image denoising and object detection models fails to utilize inter-model information effectively. To address these issues, we propose CCSPNet, an efficient feature extraction module based on Transformers and CNNs, which effectively leverages contextual information, achieves faster inference speed and provides stronger feature enhancement capabilities. Furthermore, we establish the correlation between object detection and image denoising tasks and propose a joint training model, CCSPNet-Joint, to improve data efficiency and generalization. Finally, to validate our approach, we create the CCTSDB-AUG dataset for traffic sign detection in extreme scenarios. Extensive experiments have shown that CCSPNet achieves state-of-the-art performance in traffic sign detection under extreme conditions. Compared to end-to-end methods, CCSPNet-Joint achieves a 5.32% improvement in precision and an 18.09% improvement in mAP@.5.
    摘要 《交通标志检测是智能驾驶研究的重要方向。然而,现有方法经常忽略极端条件,如雾、雨和运动模糊。另外,末端培训策略对图像净化和物体检测模型的训练效果不够利用交互信息。为解决这些问题,我们提出了 CCSPNet,一种高效的特征提取模块,基于 Transformers 和 CNNs,可以有效利用上下文信息,实现更快的推理速度和更强的特征增强能力。此外,我们确立了对象检测和图像净化任务之间的相关性,并提出了一种共同培训模型 CCSPNet-Joint,以提高数据效率和泛化能力。最后,为证明我们的方法,我们创建了 CCTSDB-AUG 数据集,用于交通标志检测在极端情况下。广泛的实验表明,CCSPNet 在极端情况下的交通标志检测性能达到了当前最佳水平。相比于端到端方法,CCSPNet-Joint 在精度和 mAP@0.5 上具有5.32% 和18.09% 的提升。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

MagiCapture: High-Resolution Multi-Concept Portrait Customization

  • paper_url: http://arxiv.org/abs/2309.06895
  • repo_url: None
  • paper_authors: Junha Hyung, Jaeyo Shin, Jaegul Choo
  • for: 这个论文旨在个性化大规模文本到图像模型,包括稳定扩散模型,以生成高品质人脸图像。
  • methods: 该论文提出了一种基于少量主题和风格引用图像的个性化方法,可以生成高分辨率人脸图像。它还使用了一种新的注意力重фокус损失函数和辅助约束,以便在弱监督学习环境下进行稳定的学习。
  • results: 根据论文的评估,MagiCapture方法可以在生成人脸图像时提供高品质的输出,并且可以普适化到其他非人类对象上。
    Abstract Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.
    摘要 大规模的文本到图像模型,包括稳定扩散,能够生成高效率的高分辨率人脸图像。有一个活跃的研究领域专门做个性化这些模型,以生成特定主题或风格使用提供的参考图像集。然而,尽管这些个性化方法可能会生成可信的结果,但它们通常会生成图像,它们的真实性不够,而且还没有商业化水平。这是特别明显在人脸图像生成中,因为人类对人脸的偏见会让任何不自然的artifact在人脸上容易被识别出来。为解决这个问题,我们介绍了MagiCapture,一种个性化方法,可以将主题和风格概念与高分辨率人脸图像相结合,只需要几张随机的自拍照片。例如,我们的精度调整后的模型可以生成高质量的人脸图像,以特定的风格,如护照照片或 Profil photo。主要挑战在这个任务中是缺乏compose的ground truth,导致最终输出质量下降和源主题的认同shift。为解决这些问题,我们提出了一种新的注意力重新定向损失,以及auxiliary priors,它们都可以在这种弱supervised learning Setting中Robust learning。我们的管道还包括额外的后处理步骤,以确保生成的输出非常真实。MagiCapture在量和质量上都超过了其他基elines,并且可以泛化到其他非人物对象。

Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?

  • paper_url: http://arxiv.org/abs/2309.06891
  • repo_url: https://github.com/billpsomas/simpool
  • paper_authors: Bill Psomas, Ioannis Kakogeorgiou, Konstantinos Karantzalos, Yannis Avrithis
  • for: This paper aims to improve the performance of both convolutional and transformer encoders by developing a generic pooling framework and proposing a simple attention-based pooling mechanism called SimPool.
  • methods: The paper uses a combination of theoretical analysis and experimental evaluation to compare the properties of different pooling methods and derive the SimPool mechanism. The authors also propose a simple attention mechanism that can be used as a replacement for the default pooling method in both convolutional and transformer encoders.
  • results: The paper shows that SimPool improves performance on pre-training and downstream tasks, and provides attention maps that delineate object boundaries in all cases, whether supervised or self-supervised. The authors claim that SimPool is “universal” because it can be used with any type of supervision or attention mechanism, and it provides attention maps of at least as good quality as self-supervised methods without explicit losses or modifying the architecture.
    Abstract Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. One could thus call SimPool universal. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. Code at: https://github.com/billpsomas/simpool.
    摘要 卷积网络和视transformer具有不同的对比交互方式,包括层内 Pooling 和网络结束处 Pooling。后者是否真的需要不同呢?作为对比 Pooling 的产物,视transformer提供了彩色注意力,但这通常是低质量的,除非自我超视,这并不是很好地研究。是超级视还是问题呢?在这项工作中,我们开发了一个通用的 Pooling 框架,然后将现有方法视为实体的实例。通过讨论每组方法的性质,我们 derivate SimPool,一种简单的注意力基于 Pooling 机制,用于替换 convolutional 和 transformer 编码器的默认 Pooling 机制。我们发现,无论是超级视还是自我超视,这种改进性能在预训练和下游任务中,并且提供了对象边界的注意力图。因此,可以称 SimPool 为通用的。据我们知道,我们是第一个在超级视中获得至少相当于自我超视的质量的注意力图, без эксплицит的损失或修改网络结构。代码可以在 GitHub 上找到:https://github.com/billpsomas/simpool。

Manufacturing Quality Control with Autoencoder-Based Defect Localization and Unsupervised Class Selection

  • paper_url: http://arxiv.org/abs/2309.06884
  • repo_url: None
  • paper_authors: Devang Mehta, Noah Klarmann
  • For: This paper aims to improve visual defect localization in manufacturing industries using a defect localizing autoencoder with unsupervised class selection.* Methods: The proposed method uses a pre-trained VGG-16 network to extract features, which are then clustered using k-means to select the most relevant classes of defects. The selected classes are augmented with natural wild textures to simulate artificial defects.* Results: The proposed method demonstrates effectiveness in improving defect detection in manufacturing industries, with precise and accurate localization of quality defects on melamine-faced boards for the furniture industry. Incorporating artificial defects into the training data shows significant potential for practical implementation in real-world quality control scenarios.Here’s the same information in Simplified Chinese text:* For: 这篇论文目标是通过使用杂化自动编码器来提高制造业中的视觉缺陷检测。* Methods: 提议的方法使用预训练的VGG-16网络提取特征,然后使用k-means归一化选择最相关的缺陷类。选择的缺陷类加以自然野生的文本涂抹来模拟人工缺陷。* Results: 提议的方法在制造业中显示出效果,能够准确地检测制造过程中的质量缺陷,并在家具行业中在批量生产中实现高精度的缺陷检测。将人工缺陷添加到训练数据中显示出了实际应用中的潜在优势。
    Abstract Manufacturing industries require efficient and voluminous production of high-quality finished goods. In the context of Industry 4.0, visual anomaly detection poses an optimistic solution for automatically controlling product quality with high precision. Automation based on computer vision poses a promising solution to prevent bottlenecks at the product quality checkpoint. We considered recent advancements in machine learning to improve visual defect localization, but challenges persist in obtaining a balanced feature set and database of the wide variety of defects occurring in the production line. This paper proposes a defect localizing autoencoder with unsupervised class selection by clustering with k-means the features extracted from a pre-trained VGG-16 network. The selected classes of defects are augmented with natural wild textures to simulate artificial defects. The study demonstrates the effectiveness of the defect localizing autoencoder with unsupervised class selection for improving defect detection in manufacturing industries. The proposed methodology shows promising results with precise and accurate localization of quality defects on melamine-faced boards for the furniture industry. Incorporating artificial defects into the training data shows significant potential for practical implementation in real-world quality control scenarios.
    摘要 制造业需要高效、大量生产高质量完成品。在第四代工业时代下,视觉异常检测提供了一个优秀的自动控制产品质量的解决方案。基于计算机视觉的自动化可以解决生产线上质量检查瓶颈。我们利用了最新的机器学习技术来提高视觉缺陷定位,但是面临着获得多样化缺陷库和平衡特征集的挑战。这篇论文提出了基于自动编码器的缺陷定位方法,通过归一化分解特征来自动选择缺陷类别。选择的缺陷类别会被人工添加自然的野生文本纹理,以模拟人工缺陷。研究表明,该方法在制造业中进行质量控制时具有高精度和准确的缺陷定位能力。通过在折射面板上使用人工添加的缺陷,研究表明了在实际应用中添加人工缺陷的可能性。

ProMap: Datasets for Product Mapping in E-commerce

  • paper_url: http://arxiv.org/abs/2309.06882
  • repo_url: None
  • paper_authors: Kateřina Macková, Martin Pilát
  • for: 这两个 datasets 用于识别两个不同的电商平台上的同一款产品。
  • methods: 这两个 datasets 包括图像和文本描述产品特性,包括产品规格,使其成为识别产品的最佳数据集。
  • results: 这两个 datasets 提供了识别产品的 Golden Standard,可以填充现有数据集中的空白,并且可以用于训练和测试识别产品的机器学习算法。
    Abstract The goal of product mapping is to decide, whether two listings from two different e-shops describe the same products. Existing datasets of matching and non-matching pairs of products, however, often suffer from incomplete product information or contain only very distant non-matching products. Therefore, while predictive models trained on these datasets achieve good results on them, in practice, they are unusable as they cannot distinguish very similar but non-matching pairs of products. This paper introduces two new datasets for product mapping: ProMapCz consisting of 1,495 Czech product pairs and ProMapEn consisting of 1,555 English product pairs of matching and non-matching products manually scraped from two pairs of e-shops. The datasets contain both images and textual descriptions of the products, including their specifications, making them one of the most complete datasets for product mapping. Additionally, the non-matching products were selected in two phases, creating two types of non-matches -- close non-matches and medium non-matches. Even the medium non-matches are pairs of products that are much more similar than non-matches in other datasets -- for example, they still need to have the same brand and similar name and price. After simple data preprocessing, several machine learning algorithms were trained on these and two the other datasets to demonstrate the complexity and completeness of ProMap datasets. ProMap datasets are presented as a golden standard for further research of product mapping filling the gaps in existing ones.
    摘要 “产品映射的目标是判断两个不同电商平台上的两个产品是否相同。现有的匹配和不匹配产品集合经常受到产品信息不完整或只包含非常远的不匹配产品的影响,因此虽然使用这些数据集训练预测模型可以获得良好的结果,但在实际应用中无法分辨非常相似但不匹配的两个产品。这篇论文介绍了两个新的产品映射数据集:ProMapCz和ProMapEn,它们分别包含1,495个捷克产品对和1,555个英文产品对匹配和不匹配产品,通过手动从两个电商平台抽取。这两个数据集包含产品图像和文本描述,包括产品规格,使其成为目前最完整的产品映射数据集之一。此外,非匹配产品被选择了两个阶段,创造了两种非匹配类型:近似非匹配和中等非匹配。即使中等非匹配也比其他数据集中的非匹配产品更相似,例如它们仍需要具有相同品牌和类似名称和价格。经过简单的数据处理后,数据集被用于训练多种机器学习算法,以示ProMap数据集的复杂性和完整性。ProMap数据集被提出为未来产品映射研究的金标准,填补现有数据集的缺陷。”

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

  • paper_url: http://arxiv.org/abs/2309.06877
  • repo_url: https://github.com/yyyooooo/dmi
  • paper_authors: Zhenguang Liu, Xinyang Yu, Ruili Wang, Shuai Ye, Zhe Ma, Jianfeng Dong, Sifeng He, Feng Qian, Xiaobo Zhang, Roger Zimmermann, Lei Yang
  • for: 本研究目的是提高视频权利侵犯检测的精度,以保护视频创作者的利益和积极性。
  • methods: 本研究提出了两种方法来解决问题:首先,提出了一种分解原始高维特征的方法,以分离出不相互重叠的子特征,从而 removing 繁殖信息;其次,在这些子特征之上,进一步学习一种辅助特征以增强子特征。
  • results: 实验结果表明,我们的方法在两个大规模的数据集(SVD 和 VCSL)上达到了 90.1% TOP-100 mAP 和新的状态之最在 VCSL 数据集上。我们的代码和模型已经在 GitHub 上公开(https://github.com/yyyooooo/DMI/),希望能为社区作出贡献。
    Abstract The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features. In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature. Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at https://github.com/yyyooooo/DMI/, hoping to contribute to the community.
    摘要 自媒体时代为我们提供了巨大的高质量视频。然而,视频版权侵犯问题现在严重地危害着视频创作者的利益和积极性。正确识别侵犯视频是一项急需要解决的问题。目前的状态艺术方法通常是将高维混合视频特征 feed 到深度神经网络中,希望神经网络可以从特征中提取有用的表示。虽然这种方法简单,但它依 heavily 靠原始杂合的特征,缺乏约束,使得神经网络可能无法提取有用的任务相关的semantic。在这篇论文中,我们尝试解决以上挑战的两个方面:1. 我们提议将原始高维特征分解成多个子特征,明确地分解特征,将每个子特征编码为独立的低维Component。我们预期子特征会含有非重叠的semantic,从而消除 redundancy 信息。2. 在子特征的基础上,我们进一步学习一个辅助特征,以增强子特征。我们 theoretically 分析了标签和分解特征之间的共 informations,得到一个损失函数,以便提取任务相关的信息。我们在两个大规模的 benchmark 数据集(即 SVD 和 VCSL)上进行了广泛的实验,结果显示,我们的方法在大规模 SVD 数据集上达到了 90.1% TOP-100 mAP,并在 VCSL 数据集上设置了新的 state-of-the-art。我们的代码和模型已经在 GitHub 上发布,希望能为社区作出贡献。

UniBrain: Universal Brain MRI Diagnosis with Hierarchical Knowledge-enhanced Pre-training

  • paper_url: http://arxiv.org/abs/2309.06828
  • repo_url: https://github.com/ljy19970415/unibrain
  • paper_authors: Jiayu Lei, Lisong Dai, Haoyun Jiang, Chaoyi Wu, Xiaoman Zhang, Yao Zhang, Jiangchao Yao, Weidi Xie, Yanyong Zhang, Yuehua Li, Ya Zhang, Yanfeng Wang
  • for: 这个研究旨在提出一种基于大规模数据的高效级别诊断方法,以提高脑病诊断的准确性和效率。
  • methods: 该方法提出了一种层次知识强化预训练框架,称为UniBrain,该框架利用了24,770个成像报告对的大规模数据集,并采用了层次对齐机制,以强化特征学习效率。
  • results: 该方法在三个实际世界数据集和BraTS2019数据集上进行验证,与所有现有诊断方法相比,具有显著的超越性和优异表现,并与专业医生在某些疾病类型上的表现相当。
    Abstract Magnetic resonance imaging~(MRI) have played a crucial role in brain disease diagnosis, with which a range of computer-aided artificial intelligence methods have been proposed. However, the early explorations usually focus on the limited types of brain diseases in one study and train the model on the data in a small scale, yielding the bottleneck of generalization. Towards a more effective and scalable paradigm, we propose a hierarchical knowledge-enhanced pre-training framework for the universal brain MRI diagnosis, termed as UniBrain. Specifically, UniBrain leverages a large-scale dataset of 24,770 imaging-report pairs from routine diagnostics. Different from previous pre-training techniques for the unitary vision or textual feature, or with the brute-force alignment between vision and language information, we leverage the unique characteristic of report information in different granularity to build a hierarchical alignment mechanism, which strengthens the efficiency in feature learning. Our UniBrain is validated on three real world datasets with severe class imbalance and the public BraTS2019 dataset. It not only consistently outperforms all state-of-the-art diagnostic methods by a large margin and provides a superior grounding performance but also shows comparable performance compared to expert radiologists on certain disease types.
    摘要

Topology-inspired Cross-domain Network for Developmental Cervical Stenosis Quantification

  • paper_url: http://arxiv.org/abs/2309.06825
  • repo_url: None
  • paper_authors: Zhenxi Zhang, Yanyang Wang, Yao Wu, Weifei Wu
  • for: 验证 Developmental Canal Stenosis (DCS) 的数量是否正确,以便检测颈椎病变。
  • methods: 使用深度关键点本地化网络,并在坐标空间和图像空间进行交叉领域协调,以提高数量的准确性和效率。
  • results: 提出了一种名为 Topology-inspired Cross-domain Network (TCN) 的方法,可以更好地解决骨架图像中的弯曲和缺失关系问题,并提高了数量的准确性和生成性。
    Abstract Developmental Canal Stenosis (DCS) quantification is crucial in cervical spondylosis screening. Compared with quantifying DCS manually, a more efficient and time-saving manner is provided by deep keypoint localization networks, which can be implemented in either the coordinate or the image domain. However, the vertebral visualization features often lead to abnormal topological structures during keypoint localization, including keypoint distortion with edges and weakly connected structures, which cannot be fully suppressed in either the coordinate or image domain alone. To overcome this limitation, a keypoint-edge and a reparameterization modules are utilized to restrict these abnormal structures in a cross-domain manner. The keypoint-edge constraint module restricts the keypoints on the edges of vertebrae, which ensures that the distribution pattern of keypoint coordinates is consistent with those for DCS quantification. And the reparameterization module constrains the weakly connected structures in image-domain heatmaps with coordinates combined. Moreover, the cross-domain network improves spatial generalization by utilizing heatmaps and incorporating coordinates for accurate localization, which avoids the trade-off between these two properties in an individual domain. Comprehensive results of distinct quantification tasks show the superiority and generability of the proposed Topology-inspired Cross-domain Network (TCN) compared with other competing localization methods.
    摘要 发展颈部狭窄(DCS)的量化是颈部硬化检测中非常重要。相比手动量化DCS,深度关键点本地化网络可以提供更高效和时间换算的方式。然而, vertebral 视觉特征经常导致关键点本地化过程中的异常拓扑结构,包括关键点扭曲、边缘和弱连接结构,这些结构不能在坐标或图像领域独立地完全抑制。为了解决这个限制,我们提出了关键点-边缘约束模块和重parameter化模块。关键点-边缘约束模块使得关键点在颈椎边缘上分布均匀,从而确保DCS量化中的分布模式与实际相符。而重parameter化模块在图像领域的热图上使用坐标combined进行弱连接结构的约束,从而避免了坐标领域和图像领域之间的负面相互作用。此外,交叉领域网络可以提高空间总化的性能,通过使用热图和坐标进行准确的本地化,从而避免了坐标领域和图像领域之间的负面相互作用。对于不同的量化任务,我们的提案的Topology-inspired Cross-domain Network(TCN)在与其他竞争方法相比显示出了超越性和可重用性。

Tracking Particles Ejected From Active Asteroid Bennu With Event-Based Vision

  • paper_url: http://arxiv.org/abs/2309.06819
  • repo_url: None
  • paper_authors: Loïc J. Azzalini, Dario Izzo
  • for: 预测和跟踪小行星系统中的喷发物,以保证航天器安全和科学观测。
  • methods: 使用事件驱动的相机检测和跟踪几厘米大的粒子,而不是使用标准帧驱动的相机。
  • results: 可以提高类似时间紧Constrained任务的科学返回,并且可以补充现有航天器上的影像技术。
    Abstract Early detection and tracking of ejecta in the vicinity of small solar system bodies is crucial to guarantee spacecraft safety and support scientific observation. During the visit of active asteroid Bennu, the OSIRIS-REx spacecraft relied on the analysis of images captured by onboard navigation cameras to detect particle ejection events, which ultimately became one of the mission's scientific highlights. To increase the scientific return of similar time-constrained missions, this work proposes an event-based solution that is dedicated to the detection and tracking of centimetre-sized particles. Unlike a standard frame-based camera, the pixels of an event-based camera independently trigger events indicating whether the scene brightness has increased or decreased at that time and location in the sensor plane. As a result of the sparse and asynchronous spatiotemporal output, event cameras combine very high dynamic range and temporal resolution with low-power consumption, which could complement existing onboard imaging techniques. This paper motivates the use of a scientific event camera by reconstructing the particle ejection episodes reported by the OSIRIS-REx mission in a photorealistic scene generator and in turn, simulating event-based observations. The resulting streams of spatiotemporal data support future work on event-based multi-object tracking.
    摘要 早期探测和跟踪小行星附近喷发物是保证航天器安全和支持科学观测的关键。在活跃小行星奥塞里斯-雷克号航天器探测中,使用摄像头捕捉的图像进行分析以探测喷发物事件,最终成为任务的科学焦点之一。为了增加类似时间紧张任务的科学返回,本文提出了事件基于解决方案,专门用于探测和跟踪厘米级喷发物。不同于标准帧基式摄像头,事件基本摄像头的像素独立触发事件,表示抽象场景的明亮度在感知平面上增加或减少。由于事件摄像头的稀疏和 asynchronous 的特点,它们可以同时实现高动态范围和低功耗消耗,这将与现有航天器上的摄像头技术相结合,以提高任务的科学返回。本文驱动使用科学事件摄像头的使用,通过重建奥塞里斯-雷克号任务报道的喷发物 episodess 在一个实时生成的光学场景生成器中进行重建,并在转换为事件基本的探测方式下,生成喷发物跟踪数据。这些数据将支持未来的事件基本多 объек跟踪工作。

TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

  • paper_url: http://arxiv.org/abs/2309.06809
  • repo_url: None
  • paper_authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Horst Possegger, Rogerio Feris, Horst Bischof
  • for: 本研究旨在提高CLIP等视觉语言模型(VLM)的视觉识别性能,使其能够更好地适应下游任务的数据分布。
  • methods: 本研究使用文本生成模型(LLM)生成的文本数据进行VLM的单向training,以提高其视觉识别性能。
  • results: 比对基eline的文本只VLM训练方法,本研究在特定任务下的适应性能提高至8.4%,细致识别性能提高至8.7%,零shot分类性能提高3.1%。
    Abstract Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines.
    摘要 视力和语言模型(VLM),如CLIP,已经实现了基于文本提示的可 COUNT 类别视觉识别。然而,为了 achieve the best 视觉识别性能,这些模型仍需要调整,以适应下游任务的数据分布,并且 overcome the domain shift from the web-based pre-training data。最近,有人提出了不需要对数据进行对应的 Training 可以有效地调整 VLM 的 Visual Recognition 性能。在这篇论文中,我们会 deeper 探究这种 Text-only VLM 训练方法,并 explore 如何通过在 LLMs 生成的文本数据中采样来进一步提高 VLM 的 Visual Recognition 性能。特别是,Compared to the SOTA text-only VLM training approach,我们示出了在 Cross-domain 适应、细化识别和 zero-shot 分类中的 Performance Improvement。Here's the breakdown of the translation:* "Visual recognition" is translated as "视觉识别" (wēi jǐng zhì bèi)* "VLM" is translated as "视力和语言模型" (wēi jǐng yǔ yán yǔ mó delè)* "pre-training data" is translated as "预训练数据" (xiāng xù xíng xì)* "domain shift" is translated as "领域转移" (diàn yì zhī yì)* "downstream tasks" is translated as "下游任务" (xià yòu jìn yè)* "text-only training" is translated as "文本Only 训练" (wén tiě only xù xì)* "LLMs" is translated as "大语言模型" (dà yǔ yán mó delè)* "fine-grained recognition" is translated as "细化识别" (xì huà zhì bèi)* "zero-shot classification" is translated as "zero-shot 分类" (zhì shòu bìng lè)* "SOTA" is translated as "state-of-the-art" (zhì yì jīn yì)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Dynamic NeRFs for Soccer Scenes

  • paper_url: http://arxiv.org/abs/2309.06802
  • repo_url: https://github.com/iSach/SoccerNeRFs
  • paper_authors: Sacha Lewin, Maxime Vandegar, Thomas Hoyoux, Olivier Barnich, Gilles Louppe
  • for: 本研究旨在解决长期困扰 novel view synthesis 领域的问题,具体来说是为 sports broadcasting 领域提供高质量的 synthetic replay。
  • methods: 本研究使用 neural radiance fields (NeRFs) 技术来解决这个问题,NeRFs 是一种基于深度学习的方法,可以生成高品质的视觉效果。
  • results: 研究表明,使用 NeRFs 技术可以在 soccer 场景中重构场景,但是这种方法无法完全满足 target 应用的质量要求。然而,这种方法仍然表现出了扎实的推动力,并且开发出了一个可用的 dataset 和代码。
    Abstract The long-standing problem of novel view synthesis has many applications, notably in sports broadcasting. Photorealistic novel view synthesis of soccer actions, in particular, is of enormous interest to the broadcast industry. Yet only a few industrial solutions have been proposed, and even fewer that achieve near-broadcast quality of the synthetic replays. Except for their setup of multiple static cameras around the playfield, the best proprietary systems disclose close to no information about their inner workings. Leveraging multiple static cameras for such a task indeed presents a challenge rarely tackled in the literature, for a lack of public datasets: the reconstruction of a large-scale, mostly static environment, with small, fast-moving elements. Recently, the emergence of neural radiance fields has induced stunning progress in many novel view synthesis applications, leveraging deep learning principles to produce photorealistic results in the most challenging settings. In this work, we investigate the feasibility of basing a solution to the task on dynamic NeRFs, i.e., neural models purposed to reconstruct general dynamic content. We compose synthetic soccer environments and conduct multiple experiments using them, identifying key components that help reconstruct soccer scenes with dynamic NeRFs. We show that, although this approach cannot fully meet the quality requirements for the target application, it suggests promising avenues toward a cost-efficient, automatic solution. We also make our work dataset and code publicly available, with the goal to encourage further efforts from the research community on the task of novel view synthesis for dynamic soccer scenes. For code, data, and video results, please see https://soccernerfs.isach.be.
    摘要 长期存在的新视图合成问题具有广泛的应用,特别是在体育直播中。高品质的新视图合成 Soccer 动作非常有价值于广播业。然而,只有一些商业解决方案被提出,而且它们几乎不公开自己的内部工作原理。使用多个静止摄像头环绕场地的设置是最佳的商业系统的一个挑战,因为它们几乎从来没有在文献中被探讨。Recently, the emergence of neural radiance fields has made significant progress in many novel view synthesis applications, using deep learning principles to produce photorealistic results in the most challenging settings. In this work, we investigate the feasibility of basing a solution to the task on dynamic NeRFs, i.e., neural models purposed to reconstruct general dynamic content. We create synthetic soccer environments and conduct multiple experiments using them, identifying key components that help reconstruct soccer scenes with dynamic NeRFs. We show that, although this approach cannot fully meet the quality requirements for the target application, it suggests promising avenues toward a cost-efficient, automatic solution. We also make our work dataset and code publicly available, with the goal to encourage further efforts from the research community on the task of novel view synthesis for dynamic soccer scenes. For code, data, and video results, please see .

Motion-Bias-Free Feature-Based SLAM

  • paper_url: http://arxiv.org/abs/2309.06792
  • repo_url: None
  • paper_authors: Alejandro Fontan, Javier Civera, Michael Milford
  • for: 提高 SLAM 在未知环境中安全部署,需要具备一些关键性能,而现有的标准测试不能完全覆盖这些性能。
  • methods: 本文提出了一些改进 feature-based SLAM 管道,以解决前后方向行程偏好问题。
  • results: 在四个数据集的完整评估中,我们的改进约束 Significantly 减少了前后方向行程偏好问题,同时改进了总轨迹误差。 elimination of SLAM motion bias has significant relevance for a wide range of robotics and computer vision applications where performance consistency is important.
    Abstract For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias between forward and reverse directions of travel, that is in addition inconsistent regarding which direction of travel exhibits better performance. In this paper we propose several contributions to feature-based SLAM pipelines that remedies the motion bias problem. In a comprehensive evaluation across four datasets, we show that our contributions implemented in ORB-SLAM2 substantially reduce the bias between forward and backward motion and additionally improve the aggregated trajectory error. Removing the SLAM motion bias has significant relevance for the wide range of robotics and computer vision applications where performance consistency is important.
    摘要 为了安全地部署SLAM在无结构环境中,它必须具备一些关键的特性,这些特性不包括传统测试准则。在这篇论文中,我们表明SLAM commutativity,即在前进和返回两个相同路径上的轨迹估计的一致性,是当前状态的主要问题。当前的管道显示了前进和返回两个方向的旅行中存在显著的偏好,并且这种偏好不一致地适用于哪一个方向的性能更好。在这篇论文中,我们提出了一些对feature-based SLAM管道的贡献,以解决运动偏好问题。在四个数据集的完整评估中,我们表明我们的贡献在ORB-SLAM2中实现了显著减少前进和返回运动之间的偏好,并且改善总轨迹错误。从除掉SLAM运动偏好来看,这种改进具有广泛的 роботех和计算机视觉应用中的重要性,其中性能一致性是关键的。

Remote Sensing Object Detection Meets Deep Learning: A Meta-review of Challenges and Advances

  • paper_url: http://arxiv.org/abs/2309.06751
  • repo_url: None
  • paper_authors: Xiangrong Zhang, Tianyang Zhang, Guanchun Wang, Peng Zhu, Xu Tang, Xiuping Jia, Licheng Jiao
  • for: 本文提供了一项涵盖 latest achievements in deep learning based remote sensing object detection (RSOD) 技术的 comprehensive 评论。
  • methods: 文章系统地介绍了 RSOD 领域中的五大挑战,并对它们的应用进行了分层分类。
  • results: 文章评论了 widely used 的 benchmark datasets 和评价指标,以及 RSOD 在不同应用场景中的应用。
    Abstract Remote sensing object detection (RSOD), one of the most fundamental and challenging tasks in the remote sensing field, has received longstanding attention. In recent years, deep learning techniques have demonstrated robust feature representation capabilities and led to a big leap in the development of RSOD techniques. In this era of rapid technical evolution, this review aims to present a comprehensive review of the recent achievements in deep learning based RSOD methods. More than 300 papers are covered in this review. We identify five main challenges in RSOD, including multi-scale object detection, rotated object detection, weak object detection, tiny object detection, and object detection with limited supervision, and systematically review the corresponding methods developed in a hierarchical division manner. We also review the widely used benchmark datasets and evaluation metrics within the field of RSOD, as well as the application scenarios for RSOD. Future research directions are provided for further promoting the research in RSOD.
    摘要 遥感对象检测(RSOD)是遥感领域中最基本和最具挑战性的任务之一,在最近几年内得到了长期的关注。在技术不断发展的今天,深度学习技术的特点强大的特征表示能力,对RSOD技术的发展带来了很大的进步。本文是遥感领域中最新的RSOD技术发展的全面回顾,涵盖了超过300篇论文。我们在这篇文章中分析了RSOD中的5个主要挑战,即多尺度对象检测、旋转对象检测、弱对象检测、小对象检测和有限监督对象检测,并在一个层次分区的方式中系统地介绍了相应的方法。此外,我们还评估了在RSOD领域中最常用的评价指标和数据集,以及RSOD的应用场景。最后,我们还提供了未来研究方向,以便进一步推动RSOD领域的研究。

MFL-YOLO: An Object Detection Model for Damaged Traffic Signs

  • paper_url: http://arxiv.org/abs/2309.06750
  • repo_url: None
  • paper_authors: Tengyang Chen, Jiangtao Ren
  • for: 这个论文的目的是提出一种基于YOLOv5s的改进对象检测方法,以检测损坏的交通标志。
  • methods: 该方法使用了一种简单的跨层损失函数,使模型在不同层次有不同的角色,从而学习更多的多样性特征。此外,模型还使用了GSConv和VoVGSCSP instead of traditional convolution and CSP。
  • results: 相比YOLOv5s,我们的MFL-YOLO方法在F1分数和mAP上提高4.3和5.1,同时降低了FLOPs的计算量 by 8.9%。在CCTSDB2021和TT100K上进行了进一步的验证,以证明我们的模型具有更好的泛化能力。
    Abstract Traffic signs are important facilities to ensure traffic safety and smooth flow, but may be damaged due to many reasons, which poses a great safety hazard. Therefore, it is important to study a method to detect damaged traffic signs. Existing object detection techniques for damaged traffic signs are still absent. Since damaged traffic signs are closer in appearance to normal ones, it is difficult to capture the detailed local damage features of damaged traffic signs using traditional object detection methods. In this paper, we propose an improved object detection method based on YOLOv5s, namely MFL-YOLO (Mutual Feature Levels Loss enhanced YOLO). We designed a simple cross-level loss function so that each level of the model has its own role, which is beneficial for the model to be able to learn more diverse features and improve the fine granularity. The method can be applied as a plug-and-play module and it does not increase the structural complexity or the computational complexity while improving the accuracy. We also replaced the traditional convolution and CSP with the GSConv and VoVGSCSP in the neck of YOLOv5s to reduce the scale and computational complexity. Compared with YOLOv5s, our MFL-YOLO improves 4.3 and 5.1 in F1 scores and mAP, while reducing the FLOPs by 8.9%. The Grad-CAM heat map visualization shows that our model can better focus on the local details of the damaged traffic signs. In addition, we also conducted experiments on CCTSDB2021 and TT100K to further validate the generalization of our model.
    摘要 交通标志是重要的安全设施,可以确保交通顺畅,但可能因多种原因受损,带来安全隐患。因此,研究一种检测受损交通标志的方法非常重要。现有的交通标志检测技术尚不存在。因为受损交通标志与正常的交通标志相似,使用传统的对象检测方法难以捕捉受损交通标志的详细地方特征。在这篇论文中,我们提出了一种改进的对象检测方法基于YOLOv5s,即MFL-YOLO(多级特征水平损失增强YOLO)。我们设计了一个简单的跨级损失函数,使得每级模型都有自己的角色,有助于模型学习更多的多样性特征,提高细腻度。这种方法可以作为插件模块使用,不增加结构复杂度或计算复杂度,同时提高准确率。我们还将传统的卷积和CSP替换为GSConv和VoVGSCSP,从颈部处理YOLOv5s来减少缩放和计算复杂度。与YOLOv5s相比,我们的MFL-YOLO提高了4.3和5.1的F1分数和MAP,同时降低了FLOPs的8.9%。Grad-CAM热力映射视觉化表示,我们的模型更好地关注受损交通标志的地方特征。此外,我们还进行了CCTSDB2021和TT100K的实验,以验证我们的模型的通用性。

Integrating GAN and Texture Synthesis for Enhanced Road Damage Detection

  • paper_url: http://arxiv.org/abs/2309.06747
  • repo_url: None
  • paper_authors: Tengyang Chen, Jiangtao Ren
  • for: 提高道路破坏检测精度,以保障安全驾驶和 prolong road durability
  • methods: 使用生成对抗网络生成多种形态的道路破坏,并利用文本合成技术提取道路xture,以控制破坏严重程度
  • results: 提高了4.1%的mAP和4.5%的F1-score
    Abstract In the domain of traffic safety and road maintenance, precise detection of road damage is crucial for ensuring safe driving and prolonging road durability. However, current methods often fall short due to limited data. Prior attempts have used Generative Adversarial Networks to generate damage with diverse shapes and manually integrate it into appropriate positions. However, the problem has not been well explored and is faced with two challenges. First, they only enrich the location and shape of damage while neglect the diversity of severity levels, and the realism still needs further improvement. Second, they require a significant amount of manual effort. To address these challenges, we propose an innovative approach. In addition to using GAN to generate damage with various shapes, we further employ texture synthesis techniques to extract road textures. These two elements are then mixed with different weights, allowing us to control the severity of the synthesized damage, which are then embedded back into the original images via Poisson blending. Our method ensures both richness of damage severity and a better alignment with the background. To save labor costs, we leverage structural similarity for automated sample selection during embedding. Each augmented data of an original image contains versions with varying severity levels. We implement a straightforward screening strategy to mitigate distribution drift. Experiments are conducted on a public road damage dataset. The proposed method not only eliminates the need for manual labor but also achieves remarkable enhancements, improving the mAP by 4.1% and the F1-score by 4.5%.
    摘要 在交通安全和路面维护领域,精确检测路面损坏是保证安全驾驶和路面使用的重要因素。然而,现有方法往往因为有限数据而无法实现。先前的尝试使用生成敌方网络(Generative Adversarial Networks,GAN)生成具有多种形状的损坏,并手动将其插入适当的位置。然而,这个问题尚未得到充分探索,面临两个挑战。首先,它们只能够丰富路面损坏的位置和形状,而忽略损坏的严重程度的多样性。其次,它们需要大量的人工努力。为解决这些挑战,我们提出了一个创新的方法。除了使用GAN生成具有多种形状的损坏之外,我们还使用 текстур合成技术提取路面的teksture。这两个元素被混合不同的重量,allowing us to control the severity of the synthesized damage。这些合成损坏被回填回原始图像中via Poisson blending,以保证损坏的丰富性和背景的 Better alignment。为避免劳动成本,我们利用结构相似性进行自动化样本选择 during embedding。每个增强的数据包含不同严重程度的版本。我们实现了一个简单的萤幕策略来缓和分布迁移。实验在公共路面损坏数据集上进行。提出的方法不仅减少了劳动成本,而且取得了很大的改进,提高了mAP by 4.1%和F1-score by 4.5%。

VEATIC: Video-based Emotion and Affect Tracking in Context Dataset

  • paper_url: http://arxiv.org/abs/2309.06745
  • repo_url: None
  • paper_authors: Zhihang Ren, Jefferson Ortega, Yifan Wang, Zhimin Chen, Yunhui Guo, Stella X. Yu, David Whitney
  • For: 这个论文的目的是为了提供一个新的大型数据集,以便更好地理解人类情绪认知的机制和通用情况。* Methods: 这篇论文使用了124个电影、纪录片和家用视频的剪辑,并通过实时注释来提供每帧视频帧的连续�valence和arousal评分。此外,论文还提出了一种新的计算机视觉任务,即在视频帧中推断人物的情绪,并提出了一种简单的模型来评估这个任务。* Results: 实验显示,使用这个数据集训练的预训练模型可以与其他类似数据集的模型进行竞争,这表明VEATIC数据集的一般性。
    Abstract Human affect recognition has been a significant topic in psychophysics and computer vision. However, the currently published datasets have many limitations. For example, most datasets contain frames that contain only information about facial expressions. Due to the limitations of previous datasets, it is very hard to either understand the mechanisms for affect recognition of humans or generalize well on common cases for computer vision models trained on those datasets. In this work, we introduce a brand new large dataset, the Video-based Emotion and Affect Tracking in Context Dataset (VEATIC), that can conquer the limitations of the previous datasets. VEATIC has 124 video clips from Hollywood movies, documentaries, and home videos with continuous valence and arousal ratings of each frame via real-time annotation. Along with the dataset, we propose a new computer vision task to infer the affect of the selected character via both context and character information in each video frame. Additionally, we propose a simple model to benchmark this new computer vision task. We also compare the performance of the pretrained model using our dataset with other similar datasets. Experiments show the competing results of our pretrained model via VEATIC, indicating the generalizability of VEATIC. Our dataset is available at https://veatic.github.io.
    摘要 人类情感认知是心理 физи学和计算机视觉领域中的一个重要话题。然而,现有的发布 datasets 有很多限制。例如,大多数 datasets 只包含表达 facial expressions 的帧。由于过去的 datasets 的限制,很难理解人类情感认知的机制或者将模型在常见情况下进行普适的推理。在这项工作中,我们介绍了一个全新的大型 datasets,即 Video-based Emotion and Affect Tracking in Context Dataset (VEATIC)。VEATIC 包含 124 个 Hollywood 电影、纪录片和家庭视频的视频剪辑,每帧有实时标注的总体情感和高度情感值。此外,我们还提出了一个新的计算机视觉任务,即根据视频帧中的上下文和人物信息来预测人物的情感。此外,我们还提出了一个简单的模型来评估这个计算机视觉任务。我们还比较了使用我们的 dataset 预训练的模型的性能与其他相似的 dataset 的模型。实验结果显示了我们的预训练模型在 VEATIC 上的竞争性。我们的 dataset 可以在 上下载。

MTD: Multi-Timestep Detector for Delayed Streaming Perception

  • paper_url: http://arxiv.org/abs/2309.06742
  • repo_url: https://github.com/yulin1004/mtd
  • paper_authors: Yihui Huang, Ningjiang Chen
  • for: 这篇论文的目的是提高自动驾驶系统的实时环境感知,以确保用户的安全和体验。
  • methods: 该论文提出了一种名为多时步检测器(MTD)的端到端检测器,该检测器使用动态路由进行多支流未来预测,使模型具有抗延迟弹性。此外,一种延迟分析模块(DAM)也被提出,用于优化现有延迟感知方法,不断监测模型推理堆栈的延迟趋势。
  • results: 该论文在Argoverse-HD数据集上进行了实验,实验结果表明,该方法在不同的延迟设置下实现了状态革命的表现。
    Abstract Autonomous driving systems require real-time environmental perception to ensure user safety and experience. Streaming perception is a task of reporting the current state of the world, which is used to evaluate the delay and accuracy of autonomous driving systems. In real-world applications, factors such as hardware limitations and high temperatures inevitably cause delays in autonomous driving systems, resulting in the offset between the model output and the world state. In order to solve this problem, this paper propose the Multi- Timestep Detector (MTD), an end-to-end detector which uses dynamic routing for multi-branch future prediction, giving model the ability to resist delay fluctuations. A Delay Analysis Module (DAM) is proposed to optimize the existing delay sensing method, continuously monitoring the model inference stack and calculating the delay trend. Moreover, a novel Timestep Branch Module (TBM) is constructed, which includes static flow and adaptive flow to adaptively predict specific timesteps according to the delay trend. The proposed method has been evaluated on the Argoverse-HD dataset, and the experimental results show that it has achieved state-of-the-art performance across various delay settings.
    摘要 自动驾驶系统需要实时环境感知以确保用户安全和体验。流动感知是报告当前世界状态的任务,用于评估自动驾驶系统的延迟和准确性。在实际应用中,硬件限制和高温会导致自动驾驶系统的延迟,从而导致模型输出和世界状态之间的偏差。为解决这个问题,本文提出了多步调用器(MTD),一种端到端检测器,使用动态路由进行多支分支未来预测,让模型具有抗延迟波动的能力。延迟分析模块(DAM)被提出,用于优化现有延迟感知方法,持续监测模型推理堆栈的延迟趋势。此外,一种新的时间步模块(TBM)被构建,包括静止流和适应流,可以适应延迟趋势进行特定时间步预测。提出的方法在Argoverse-HD数据集上进行了实验,实验结果显示,它在不同延迟设置下实现了state-of-the-art的性能。

GelFlow: Self-supervised Learning of Optical Flow for Vision-Based Tactile Sensor Displacement Measurement

  • paper_url: http://arxiv.org/abs/2309.06735
  • repo_url: None
  • paper_authors: Zhiyuan Zhang, Hua Yang, Zhouping Yin
  • for: 支持更灵活的机器人手指操作,高分辨率多模态信息可以由视觉基于感觉器获取。
  • methods: 使用自主学习的光流方法,解决现有光流方法的精度问题。
  • results: 提出了一种基于深度学习的光流方法,实现了高精度的偏移量测量。对比传统和深度学习基于光流方法,得到了更高的偏移量测量精度。
    Abstract High-resolution multi-modality information acquired by vision-based tactile sensors can support more dexterous manipulations for robot fingers. Optical flow is low-level information directly obtained by vision-based tactile sensors, which can be transformed into other modalities like force, geometry and depth. Current vision-tactile sensors employ optical flow methods from OpenCV to estimate the deformation of markers in gels. However, these methods need to be more precise for accurately measuring the displacement of markers during large elastic deformation of the gel, as this can significantly impact the accuracy of downstream tasks. This study proposes a self-supervised optical flow method based on deep learning to achieve high accuracy in displacement measurement for vision-based tactile sensors. The proposed method employs a coarse-to-fine strategy to handle large deformations by constructing a multi-scale feature pyramid from the input image. To better deal with the elastic deformation caused by the gel, the Helmholtz velocity decomposition constraint combined with the elastic deformation constraint are adopted to address the distortion rate and area change rate, respectively. A local flow fusion module is designed to smooth the optical flow, taking into account the prior knowledge of the blurred effect of gel deformation. We trained the proposed self-supervised network using an open-source dataset and compared it with traditional and deep learning-based optical flow methods. The results show that the proposed method achieved the highest displacement measurement accuracy, thereby demonstrating its potential for enabling more precise measurement of downstream tasks using vision-based tactile sensors.
    摘要 高分辨率多Modal信息由视觉基于感觉传感器获取可以支持机器人手指更灵活的抓取操作。视觉流是视觉基于感觉传感器直接获取的低级信息,可以转换为其他模式如力、几何和深度。现有的视觉感觉传感器使用OpenCV中的视觉流方法来估计gel中 marker的变形。然而,这些方法需要更加精准地测量Marker的移动 during large elastic deformation of the gel,因为这可能会对下游任务的准确性产生重要影响。本研究提出了一种基于深度学习的自主Optical flow方法来实现高精度的移动测量。该方法采用了粗细到细的策略来处理大的变形,通过构建输入图像的多尺度特征 pyramid。为了更好地处理由gel引起的弹性扭formation,该方法采用了Helmholtz速度分解约束和弹性扭formation约束来处理扭formation rate和面积变化率,分别。此外,为了更好地处理gel的模糊效应,该方法还设计了一个本地流合并模块来平滑Optical flow。我们使用了一个开源数据集来训练我们的提案的自主网络,并与传统和深度学习基于的Optical flow方法进行比较。结果显示,我们的方法实现了最高的移动测量精度,从而证明了其在视觉基于感觉传感器上的潜在应用。

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

  • paper_url: http://arxiv.org/abs/2309.07929
  • repo_url: None
  • paper_authors: Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li
  • for: 本研究旨在解决静音下Audio-Visual Localization和Segmentation任务中数据缺乏和数据分布不均问题,提高模型的泛化能力。
  • methods: 我们提出了Encoder-Prompt-Decoder模型,其中首先构建了Semantic-aware Audio Prompt(SAP),以帮助视觉基础模型更好地听到物体的声音。然后,我们开发了Correlation Adapter(ColA),以保持最小的训练努力并维护视觉基础模型的知识。
  • results: 我们通过广泛的实验证明,Compared with其他拟合方法,我们的方法在未看过类和跨数据集情况下表现更好, indicating that our method can better generalize to unseen data.
    Abstract Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios.
    摘要 原文:假设我们没有直接见到对象,但可以听到它的声音。在这种情况下,模型是否可以准确地确定对象的视觉位置?在这个工作中,我们关注了音频视频本地化和分割任务,但是在零次和几次学习enario下进行。为了实现这个目标,不同于现有的方法,我们不使用混合Encoder-Fusion-Decoder模型来解码音频视频特征中的本地化信息。而是引入Encoder-Prompt-Decoder模型,以更好地适应数据缺乏和数据分布的变化问题,并利用大量的预训练模型知识。Specifically:我们首先提出了Semantic-aware Audio Prompt(SAP),帮助视觉基础模型更好地注意声音对象,同时也鼓励视觉和声音模式之间的semantic gap减小。然后,我们开发了Correlation Adapter(ColA),以保持最小的训练努力,同时也保持视觉基础模型的知识。通过这些手段,我们进行了广泛的实验,并证明了这种新方法在未看到类和跨数据集 Setting下比其他混合方法表现更好。我们希望这种工作可以进一步促进实际应用场景中Audio-Visual Localization和Segmentation的总结研究。Translation:假设我们没有直接见到对象,但可以听到它的声音。在这种情况下,模型是否可以准确地确定对象的视觉位置?在这个工作中,我们关注了音频视频本地化和分割任务,但是在零次和几次学习enario下进行。为了实现这个目标,不同于现有的方法,我们不使用混合Encoder-Fusion-Decoder模型来解码音频视频特征中的本地化信息。而是引入Encoder-Prompt-Decoder模型,以更好地适应数据缺乏和数据分布的变化问题,并利用大量的预训练模型知识。Specifically,我们首先提出了Semantic-aware Audio Prompt(SAP),帮助视觉基础模型更好地注意声音对象,同时也鼓励视觉和声音模式之间的semantic gap减小。然后,我们开发了Correlation Adapter(ColA),以保持最小的训练努力,同时也保持视觉基础模型的知识。通过这些手段,我们进行了广泛的实验,并证明了这种新方法在未看到类和跨数据集 Setting下比其他混合方法表现更好。我们希望这种工作可以进一步促进实际应用场景中Audio-Visual Localization和Segmentation的总结研究。

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

  • paper_url: http://arxiv.org/abs/2309.06728
  • repo_url: None
  • paper_authors: Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Xiatian Zhu
  • for: 这个论文的目标是提出一种无监督的音频视频分割方法,以便在实际应用中避免繁琐的批处理和标注工作。
  • methods: 这个方法基于一种新的卷积权重学习策略,通过利用现有的多Modal基础模型(如检测[1]、开放世界分割[2]和多Modal协调[3])来准确地关联音频mask对。
  • results: 经验表明,该方法可以与现有的监督学习方法相比,在复杂的场景下具有良好的性能,尤其是在多个声音对象重叠的情况下。
    Abstract Audio-Visual Segmentation (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. This limits their scalability since it is time consuming and tedious to acquire such cross-modality pixel level labels. To overcome this obstacle, in this work we introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training. For tackling this newly proposed problem, we formulate a novel Cross-Modality Semantic Filtering (CMSF) approach to accurately associate the underlying audio-mask pairs by leveraging the off-the-shelf multi-modal foundation models (e.g., detection [1], open-world segmentation [2] and multi-modal alignment [3]). Guiding the proposal generation by either audio or visual cues, we design two training-free variants: AT-GDINO-SAM and OWOD-BIND. Extensive experiments on the AVS-Bench dataset show that our unsupervised approach can perform well in comparison to prior art supervised counterparts across complex scenarios with multiple auditory objects. Particularly, in situations where existing supervised AVS methods struggle with overlapping foreground objects, our models still excel in accurately segmenting overlapped auditory objects. Our code will be publicly released.
    摘要 Audio-Visual Segmentation (AVS) 目标是在视觉场景中像素级准确标识可听对象。现有的 AVS 方法需要精grained的音频 маска对在supervised 学习方式下进行标注。这限制了它们的扩展性,因为获得这种 across-modality 像素级标注是时间consuming 和痛苦的。为了解决这个问题,在这个工作中,我们介绍了无监督的音频视觉分割方法,无需任务特定的数据标注和模型训练。为解决这个新提出的问题,我们提出了一种 Cross-Modality Semantic Filtering (CMSF) 方法,以准确地关联音频 маска对。我们利用了市场上可得到的多Modal foundation models(例如检测 [1]、开放世界分割 [2]和多Modal alignment [3]),以帮助我们准确地关联音频和视觉信号。我们通过音频或视觉提示来引导提议生成,设计了两种无需训练的变体:AT-GDINO-SAM 和 OWOD-BIND。我们在 AVS-Bench 数据集上进行了广泛的实验,结果表明,我们的无监督方法可以在复杂的场景中与先前的监督性 AVS 方法相比,表现良好。特别是在多个 auditory 对象 overlap 的情况下,我们的模型仍能准确地分割 overlap 的 auditory 对象。我们将代码公开发布。

Deep Nonparametric Convexified Filtering for Computational Photography, Image Synthesis and Adversarial Defense

  • paper_url: http://arxiv.org/abs/2309.06724
  • repo_url: None
  • paper_authors: Jianqiao Wangni
  • for: 提供一个通用的计算 fotografraphy 框架,从不完美的图像中恢复真实场景,通过深度非 Parametric 凸 filtering (DNCF)。
  • methods: 使用一个非 Parametric 深度网络来模仿图像形成物理方程,如噪声除除、超解像、填充和闪光。DNCF 没有dependent于训练数据的参数化,因此具有强大的泛化和Robustness 对抗图像修改。
  • results: 在推理过程中,我们鼓励网络参数为非负值,创建了输入和参数之间的bi-凸函数,并采用了第二个优化算法,实现了10倍的加速。通过这些工具,我们在实验中证明了DNCF 可以在实时中防止图像分类深度网络被攻击 algorithms。
    Abstract We aim to provide a general framework of for computational photography that recovers the real scene from imperfect images, via the Deep Nonparametric Convexified Filtering (DNCF). It is consists of a nonparametric deep network to resemble the physical equations behind the image formation, such as denoising, super-resolution, inpainting, and flash. DNCF has no parameterization dependent on training data, therefore has a strong generalization and robustness to adversarial image manipulation. During inference, we also encourage the network parameters to be nonnegative and create a bi-convex function on the input and parameters, and this adapts to second-order optimization algorithms with insufficient running time, having 10X acceleration over Deep Image Prior. With these tools, we empirically verify its capability to defend image classification deep networks against adversary attack algorithms in real-time.
    摘要 我们目标是提供一个通用的计算摄影框架,通过深度非 Parametric 矩阵 Filtering (DNCF) 来回归真实场景 из不完美图像。DNCF 包含一个非 Parametric 深度网络,用于模拟图像形成物理方程,如净化、超分解、填充和闪光。DNCF 没有依赖于训练数据的参数化,因此具有强大的泛化和鲁棒性,抵御恶意图像修饰。在推理过程中,我们还鼓励网络参数为非负,创建了输入和参数之间的二 conjugate 函数,这使得可以通过缺乏运行时间的第二次优化算法进行加速,比 Deep Image Prior 快速了 10 倍。通过这些工具,我们在实验中证明了它可以在实时中防止图像分类深度网络被攻击算法攻击。

Deep Attentive Time Warping

  • paper_url: http://arxiv.org/abs/2309.06720
  • repo_url: https://github.com/matsuo-shinnosuke/deep-attentive-time-warping
  • paper_authors: Shinnosuke Matsuo, Xiaomeng Wu, Gantugs Atarsaikhan, Akisato Kimura, Kunio Kashino, Brian Kenji Iwana, Seiichi Uchida
  • for: 本文为了提高时间序列分类中的非线性时间扭曲问题的处理能力,提出了一种基于神经网络的任务适应时间扭曲机制。
  • methods: 本文使用了注意力模型,称为两边注意力模型,来开发一种可靠的时间扭曲机制,并通过度量学学习来训练模型。
  • results: 对比DTW和其他学习型模型,本文的模型在在线签名验证任务中显示出了superior的效果和状态革命性能。
    Abstract Similarity measures for time series are important problems for time series classification. To handle the nonlinear time distortions, Dynamic Time Warping (DTW) has been widely used. However, DTW is not learnable and suffers from a trade-off between robustness against time distortion and discriminative power. In this paper, we propose a neural network model for task-adaptive time warping. Specifically, we use the attention model, called the bipartite attention model, to develop an explicit time warping mechanism with greater distortion invariance. Unlike other learnable models using DTW for warping, our model predicts all local correspondences between two time series and is trained based on metric learning, which enables it to learn the optimal data-dependent warping for the target task. We also propose to induce pre-training of our model by DTW to improve the discriminative power. Extensive experiments demonstrate the superior effectiveness of our model over DTW and its state-of-the-art performance in online signature verification.
    摘要 时序序列相似度评估是时序分类的关键问题。为了处理非线性时间扭曲,广泛使用了动态时间扭曲(DTW)。然而,DTW不是学习的,它受到时间扭曲的质量和数据分类能力之间的负担。在这篇论文中,我们提出了一种基于神经网络的任务适应时间扭曲模型。specifically,我们使用了对称注意力模型,称为双对称注意力模型,来开发一个显式的时间扭曲机制,具有更高的扭曲不变性。与其他使用DTW进行扭曲的学习模型不同,我们的模型预测了两个时序序列之间的所有本地匹配,并基于度量学习来训练,这使得它能够学习目标任务中最佳的数据dependent扭曲。我们还提出了在我们模型中进行DTW预训练,以提高分类能力。广泛的实验证明了我们模型的超过DTW和其他现有方法的表现,并在在线签名验证中达到了最佳性能。

MPI-Flow: Learning Realistic Optical Flow with Multiplane Images

  • paper_url: http://arxiv.org/abs/2309.06714
  • repo_url: https://github.com/sharpiless/mpi-flow
  • paper_authors: Yingping Liang, Jiaming Liu, Debing Zhang, Ying Fu
    for:这个研究旨在提高学习型光流估计模型的实用性,通过将真实世界影像转换为真实光流资料集。methods:我们使用多层深度表示(Multiplane Image,MPI)来创建高度真实的新图像,并使用摄像机矩阵和plane深度来计算每个平面的光流。我们还开发了一个独立物运动模组,以分离摄像机和动物运动的影响。results:我们的方法在实验中表现出色,在真实数据集上实现了最佳性能,并且在无监督和监督式训练中均 achieve 州际表现。代码将在:\url{https://github.com/Sharpiless/MPI-Flow} 中公开。
    Abstract The accuracy of learning-based optical flow estimation models heavily relies on the realism of the training datasets. Current approaches for generating such datasets either employ synthetic data or generate images with limited realism. However, the domain gap of these data with real-world scenes constrains the generalization of the trained model to real-world applications. To address this issue, we investigate generating realistic optical flow datasets from real-world images. Firstly, to generate highly realistic new images, we construct a layered depth representation, known as multiplane images (MPI), from single-view images. This allows us to generate novel view images that are highly realistic. To generate optical flow maps that correspond accurately to the new image, we calculate the optical flows of each plane using the camera matrix and plane depths. We then project these layered optical flows into the output optical flow map with volume rendering. Secondly, to ensure the realism of motion, we present an independent object motion module that can separate the camera and dynamic object motion in MPI. This module addresses the deficiency in MPI-based single-view methods, where optical flow is generated only by camera motion and does not account for any object movement. We additionally devise a depth-aware inpainting module to merge new images with dynamic objects and address unnatural motion occlusions. We show the superior performance of our method through extensive experiments on real-world datasets. Moreover, our approach achieves state-of-the-art performance in both unsupervised and supervised training of learning-based models. The code will be made publicly available at: \url{https://github.com/Sharpiless/MPI-Flow}.
    摘要 “学习基于的光流估算模型准确性很大程度上取决于训练数据的真实性。现有的方法用 sintetic 数据或生成有限的真实性的图像来生成训练数据。然而,这些数据与实际场景之间的域差异会限制训练得到的模型在实际应用中的泛化能力。为了解决这个问题,我们研究如何从实际图像中生成真实的光流数据。首先,我们使用单视图图像生成多层次深度表示(MPI),以生成高真实性的新图像。然后,我们使用相机矩阵和深度信息计算每层的光流,并使用体积投影将层次光流投影到输出光流图中。其次,我们提出了独立物体运动模块,可以在 MPI 中分离 Camera 和动态物体的运动。这个模块解决了 MPI 基于单视图方法中的缺陷,即只能通过相机运动生成光流,而不考虑物体运动。此外,我们还提出了depth-aware填充模块,可以将新图像与动态物体合并,并解决不自然的运动遮挡。我们通过大量实验证明了我们的方法的超越性,并且在不监督和监督训练中都达到了学习基于模型的状态之巅。代码将在:\url{https://github.com/Sharpiless/MPI-Flow} 公开。”

Transparent Object Tracking with Enhanced Fusion Module

  • paper_url: http://arxiv.org/abs/2309.06701
  • repo_url: https://github.com/kalyan0510/totem
  • paper_authors: Kalyan Garigapati, Erik Blasch, Jie Wei, Haibin Ling
  • for: 这个论文是为了提高机器人任务中透明物体的追踪性能,因为这些物体的适应性和反射性环境,传统的追踪算法受到减少性能。
  • methods: 这个论文使用了一种新的特征融合技术,将透明信息融合到固定特征空间中,以便在更广泛的追踪器中使用。融合模组包括一个对应器Encoder和一个多层感知机制模组,通过关键查询基于变数的转换来嵌入透明信息到追踪管道中。
  • results: 这个论文提出了一个新的追踪架构,使用了新的融合技术以 achieve superior 的透明物体追踪 результа。该架构在 TOTB Benchmark 上获得了竞争性的结果,与现有的追踪器相比。
    Abstract Accurate tracking of transparent objects, such as glasses, plays a critical role in many robotic tasks such as robot-assisted living. Due to the adaptive and often reflective texture of such objects, traditional tracking algorithms that rely on general-purpose learned features suffer from reduced performance. Recent research has proposed to instill transparency awareness into existing general object trackers by fusing purpose-built features. However, with the existing fusion techniques, the addition of new features causes a change in the latent space making it impossible to incorporate transparency awareness on trackers with fixed latent spaces. For example, many of the current days transformer-based trackers are fully pre-trained and are sensitive to any latent space perturbations. In this paper, we present a new feature fusion technique that integrates transparency information into a fixed feature space, enabling its use in a broader range of trackers. Our proposed fusion module, composed of a transformer encoder and an MLP module, leverages key query-based transformations to embed the transparency information into the tracking pipeline. We also present a new two-step training strategy for our fusion module to effectively merge transparency features. We propose a new tracker architecture that uses our fusion techniques to achieve superior results for transparent object tracking. Our proposed method achieves competitive results with state-of-the-art trackers on TOTB, which is the largest transparent object tracking benchmark recently released. Our results and the implementation of code will be made publicly available at https://github.com/kalyan0510/TOTEM.
    摘要 准确跟踪透明物体,如镜片、玻璃等,在机器人任务中扮演着关键性的角色。由于透明物体的适应性和反射性Texture,传统的跟踪算法基于通用学习的特征会受到减少性能的影响。最近的研究提出了把透明性知识引入现有的通用物体跟踪器中,通过混合专门设计的特征。然而,现有的混合技术会导致特征空间中的变化,使得在跟踪器中引入透明性知识 become impossible。例如,现在大多数的当今天transformer-based tracker都是完全预训练的,对于特征空间的任何变化都会产生敏感反应。在这篇论文中,我们提出了一种新的特征混合技术,可以将透明信息 embed到固定特征空间中,使其能够在更广泛的跟踪器上使用。我们的提案的混合模块由transformer编码器和多层感知机制组成,利用关键Query-based变换来嵌入透明信息到跟踪管道中。我们还提出了一种新的两步训练策略,以便有效地合并透明特征。我们提议一种新的跟踪架构,使用我们的混合技术来实现superior的透明物体跟踪结果。我们的提议方法在最新的TOTBbenchmark上达到了与状态静态跟踪器相当的竞争水平。我们的结果和实现代码将于https://github.com/kalyan0510/TOTEM中公开。

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

  • paper_url: http://arxiv.org/abs/2309.06680
  • repo_url: https://github.com/palaashagrawal/stupd
  • paper_authors: Palaash Agrawal, Haidi Azaman, Cheston Tan
  • for: This paper aims to improve the ability of computer vision models to perform spatial reasoning and understand temporal relations in visual scenes.
  • methods: The authors propose a large-scale video dataset called STUPD, which includes 150K visual depictions of static and dynamic spatial relationships derived from prepositions of the English language, as well as 50K visual depictions of temporal relations.
  • results: The authors show that pretraining models on the STUPD dataset leads to an increase in performance on real-world datasets (ImageNet-VidVRD and Spatial Senses) compared to other pretraining datasets.
    Abstract Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) -- a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.
    摘要 理解物体之间的关系是视觉Scene的semantics理解的关键,同时也是将视觉和语言模型相连的重要步骤。然而,当前状态的计算机视觉模型仍然缺乏空间逻辑的能力。现有的数据集主要覆盖了一些相对较小的空间关系,其中所有都是静止的关系,不含动态的变化。在这篇论文中,我们提出了空间和时间理解预position数据集(STUPD)——一个大规模的视频数据集,用于理解静止和动态空间关系,从英语预position中提取出来。该数据集包含150万个视觉表示(视频和图像),包括30种不同的空间预position感,通过Unity3D Synthetically生成的对象互动 simulations。此外,我们还提出了50万个视觉表示的10种时间关系,包括视频显示事件/时间点互动。我们知道,没有任何数据集表示了时间关系通过视觉设置。在这个数据集中,我们还提供了对象互动中的帧坐标和使用的对象描述。该人工数据集的目标是帮助模型在实际设置中更好地检测视觉关系。我们在STUPD数据集上进行预训练后,与ImageNet-VidVRD和空间感数据集进行比较,发现模型在这些数据集上表现出了明显的提升。

ShaDocFormer: A Shadow-attentive Threshold Detector with Cascaded Fusion Refiner for document shadow removal

  • paper_url: http://arxiv.org/abs/2309.06670
  • repo_url: None
  • paper_authors: Weiwen Chen, Shenghong Luo, Xuhang Chen, Zinuo Li, Shuqiang Wang, Chi-Man Pun
  • for: 本研究旨在解决手持设备捕捉文档时出现的文档阴影问题,以提高文档的可读性。
  • methods: 该研究提出了一种基于Transformer架构的ShaDocFormer模型,它将传统方法和深度学习技术相结合,以解决文档阴影除去的问题。ShaDocFormer模型包括两个组件:阴影敏感检测器(STD)和彩色融合级联器(CFR)。STD模块使用传统的阈值技术,并通过Transformer的注意机制获取全局信息,以准确检测阴影面积。CFR模块采用级联和汇集结构,以实现从粗到细的修复过程,以捕捉整个图像的变化。
  • results: 实验表明,ShaDocFormer模型在Qualitative和Quantitative两个维度上都能够超越当前状态艺的方法。
    Abstract Document shadow is a common issue that arise when capturing documents using mobile devices, which significantly impacts the readability. Current methods encounter various challenges including inaccurate detection of shadow masks and estimation of illumination. In this paper, we propose ShaDocFormer, a Transformer-based architecture that integrates traditional methodologies and deep learning techniques to tackle the problem of document shadow removal. The ShaDocFormer architecture comprises two components: the Shadow-attentive Threshold Detector (STD) and the Cascaded Fusion Refiner (CFR). The STD module employs a traditional thresholding technique and leverages the attention mechanism of the Transformer to gather global information, thereby enabling precise detection of shadow masks. The cascaded and aggregative structure of the CFR module facilitates a coarse-to-fine restoration process for the entire image. As a result, ShaDocFormer excels in accurately detecting and capturing variations in both shadow and illumination, thereby enabling effective removal of shadows. Extensive experiments demonstrate that ShaDocFormer outperforms current state-of-the-art methods in both qualitative and quantitative measurements.
    摘要 文档阴影是手持设备捕捉文档时常见的问题,对于文档的可读性有很大的影响。现有方法面临着各种挑战,包括不准确的阴影面掩模板和灯光量的估算。本文提出了ShaDocFormer,一种基于Transformer架构的架构,该架构集成了传统方法和深度学习技术,用于解决文档阴影除去的问题。ShaDocFormer架构包括两个组件:阴影感知阈值检测器(STD)和缓存融合修正器(CFR)。STD模块使用传统的阈值技术,并利用Transformer的注意机制,以全局信息的收集,以准确探测阴影面掩模板。CFR模块采用缓存和融合的结构,实现了从粗到细的修复过程,以便整个图像的修复。因此,ShaDocFormer能够准确探测和捕捉阴影和灯光的变化,从而实现有效地除去阴影。经验表明,ShaDocFormer在质量和量度上都超过当前状态的方法。

LCReg: Long-Tailed Image Classification with Latent Categories based Recognition

  • paper_url: http://arxiv.org/abs/2309.07186
  • repo_url: None
  • paper_authors: Weide Liu, Zhonghua Wu, Yiming Wang, Henghui Ding, Fayao Liu, Jie Lin, Guosheng Lin
  • for: long-tailed image recognition
  • methods: 使用类共同尺度特征学习和Semantic数据增强来提高特征表示
  • results: 在五个长尾图像识别数据集上进行了广泛的实验,与基eline进行比较,得到了显著提高的结果
    Abstract In this work, we tackle the challenging problem of long-tailed image recognition. Previous long-tailed recognition approaches mainly focus on data augmentation or re-balancing strategies for the tail classes to give them more attention during model training. However, these methods are limited by the small number of training images for the tail classes, which results in poor feature representations. To address this issue, we propose the Latent Categories based long-tail Recognition (LCReg) method. Our hypothesis is that common latent features shared by head and tail classes can be used to improve feature representation. Specifically, we learn a set of class-agnostic latent features shared by both head and tail classes, and then use semantic data augmentation on the latent features to implicitly increase the diversity of the training sample. We conduct extensive experiments on five long-tailed image recognition datasets, and the results show that our proposed method significantly improves the baselines.
    摘要 在这项工作中,我们解决了长尾图像识别的挑战问题。先前的长尾识别方法主要集中在数据扩展或重新平衡策略上,以给尾类提供更多的注意力 durante 模型训练。然而,这些方法受到尾类训练图像的少量限制,导致feature表示不佳。为解决这个问题,我们提出了Latent Categories based long-tail Recognition(LCReg)方法。我们的假设是,头和尾类共享的潜在特征可以提高特征表示。具体来说,我们学习一组不同类型的潜在特征,然后使用语义数据扩展在这些潜在特征上进行隐式增加训练样本的多样性。我们在五个长尾图像识别 datasets 进行了广泛的实验,结果显示,我们提出的方法可以明显提高基elines。

Generalizable Neural Fields as Partially Observed Neural Processes

  • paper_url: http://arxiv.org/abs/2309.06660
  • repo_url: None
  • paper_authors: Jeffrey Gu, Kuan-Chieh Wang, Serena Yeung
  • for: 代表信号为函数参数化的神经场是一种有前途的代替方案,比传统的离散 вектор或网格基本表示更好地扩展性、连续性和可微性。
  • methods: 我们提出了一种新的思路,视为大规模培育神经表示为部分观察神经过程框架,并利用神经过程算法解决这个问题。
  • results: 我们的方法比现有的梯度基于元学习方法和卷积网络方法更高效,并且可以更好地利用信号之间的共享信息或结构。
    Abstract Neural fields, which represent signals as a function parameterized by a neural network, are a promising alternative to traditional discrete vector or grid-based representations. Compared to discrete representations, neural representations both scale well with increasing resolution, are continuous, and can be many-times differentiable. However, given a dataset of signals that we would like to represent, having to optimize a separate neural field for each signal is inefficient, and cannot capitalize on shared information or structures among signals. Existing generalization methods view this as a meta-learning problem and employ gradient-based meta-learning to learn an initialization which is then fine-tuned with test-time optimization, or learn hypernetworks to produce the weights of a neural field. We instead propose a new paradigm that views the large-scale training of neural representations as a part of a partially-observed neural process framework, and leverage neural process algorithms to solve this task. We demonstrate that this approach outperforms both state-of-the-art gradient-based meta-learning approaches and hypernetwork approaches.
    摘要 neural fields, which represent signals as a function parameterized by a neural network, are a promising alternative to traditional discrete vector or grid-based representations. Compared to discrete representations, neural representations both scale well with increasing resolution, are continuous, and can be many-times differentiable. However, given a dataset of signals that we would like to represent, having to optimize a separate neural field for each signal is inefficient, and cannot capitalize on shared information or structures among signals. Existing generalization methods view this as a meta-learning problem and employ gradient-based meta-learning to learn an initialization which is then fine-tuned with test-time optimization, or learn hypernetworks to produce the weights of a neural field. We instead propose a new paradigm that views the large-scale training of neural representations as a part of a partially-observed neural process framework, and leverage neural process algorithms to solve this task. We demonstrate that this approach outperforms both state-of-the-art gradient-based meta-learning approaches and hypernetwork approaches.Here's the word-for-word translation:神经场, 表示信号为神经网络参数化的函数, 是传统栅格化或简单向量表示的有前途的替代方案。与简单表示相比, 神经表示可以扩展到高分辨率, 连续, 可多次导数。但是, 给定一个信号集, 每个信号都需要优化单独的神经场, 这是不效率的, 无法利用信号之间共享的信息或结构。现有的总结方法视为meta学习问题, 使用梯度基本的meta学习学习初始化, 然后在测试时进行优化, 或学习卷积网络生成神经场的权重。我们则提出了一种新的思路, 视为大规模训练神经表示为部分观察神经过程框架的一部分, 并利用神经过程算法解决这个问题。我们示出了这种方法在状态统计学习术的梯度基本meta学习方法和卷积网络方法之上具有优势。

Event-Driven Imaging in Turbid Media: A Confluence of Optoelectronics and Neuromorphic Computation

  • paper_url: http://arxiv.org/abs/2309.06652
  • repo_url: None
  • paper_authors: Ning Zhang, Timothy Shea, Arto Nurmikko
  • for: 这篇论文旨在探讨如何使用光学计算方法揭示在浓雾媒体中难以识别的目标图像。
  • methods: 这种新方法基于人类视觉,首先将散射光转换为脉冲信号,然后使用神经元模型进行图像重建。
  • results: 研究人员通过对不同的MNIST字体和图像集进行图像重建,成功地解决了透明媒体中图像不可见的问题,并且可以准确地识别出图像的内容。
    Abstract In this paper a new optical-computational method is introduced to unveil images of targets whose visibility is severely obscured by light scattering in dense, turbid media. The targets of interest are taken to be dynamic in that their optical properties are time-varying whether stationary in space or moving. The scheme, to our knowledge the first of its kind, is human vision inspired whereby diffuse photons collected from the turbid medium are first transformed to spike trains by a dynamic vision sensor as in the retina, and image reconstruction is then performed by a neuromorphic computing approach mimicking the brain. We combine benchtop experimental data in both reflection (backscattering) and transmission geometries with support from physics-based simulations to develop a neuromorphic computational model and then apply this for image reconstruction of different MNIST characters and image sets by a dedicated deep spiking neural network algorithm. Image reconstruction is achieved under conditions of turbidity where an original image is unintelligible to the human eye or a digital video camera, yet clearly and quantifiable identifiable when using the new neuromorphic computational approach.
    摘要 在这篇论文中,我们介绍了一种新的光电计算方法,用于揭示受到干扰媒体散射的目标图像。目标图像是动态的,即其光学性质在时间上变化,可能是静止的或者移动的。我们的方法是基于人视系统的,通过将散射媒体中的散射光转化为脉冲 trains,然后使用神经网络模型来重建图像。我们结合了实验和物理学习模型,并使用专门的深度脉冲神经网络算法来实现图像重建。我们发现,在某些情况下,使用我们的方法可以在干扰媒体中揭示出清晰可读的图像,而人类眼或数字摄像头则无法识别到这些图像。