cs.CV - 2023-11-01

A Call to Arms: AI Should be Critical for Social Media Analysis of Conflict Zones

  • paper_url: http://arxiv.org/abs/2311.00810
  • repo_url: None
  • paper_authors: Afia Abedin, Abdul Bais, Cody Buntain, Laura Courchesne, Brian McQuinn, Matthew E. Taylor, Muhib Ullah
  • for: 本研究旨在利用计算机视觉技术来跟踪乌克兰冲突中不同类型武装部队的武器系统和 armed groups的标识符,以及这些武器系统在冲突中的分布。
  • methods: 本研究使用计算机视觉技术来识别和跟踪乌克兰冲突中的武器系统和 armed groups。
  • results: 本研究可能可以跟踪冲突中不同类型武装部队的武器系统和 armed groups的使用情况,以及这些武器系统在冲突中的分布。这种系统可能可以用于实时跟踪冲突,包括人道主义和医疗援助的需求。
    Abstract The massive proliferation of social media data represents a transformative moment in conflict studies. This data can provide unique insights into the spread and use of weaponry, but the scale and types of data are problematic for traditional open-source intelligence. This paper presents preliminary, transdisciplinary work using computer vision to identify specific weapon systems and the insignias of the armed groups using them. There is potential to not only track how weapons are distributed through networks of armed units but also to track which types of weapons are being used by the different types of state and non-state military actors in Ukraine. Such a system could ultimately be used to understand conflicts in real-time, including where humanitarian and medical aid is most needed. We believe that using AI to help automate such processes should be a high-priority goal for our community, with near-term real-world payoffs.
    摘要 “社交媒体数据的庞大扩散 представляет一个转变时刻在冲突研究中。这些数据可以提供唯一的察视武器的扩散和使用情况,但传统的开源情报处理这些数据的规模和类型具有挑战。本文发表了初步的跨学科工作,使用计算机视觉来识别特定的武器系统和武装组织使用的标识符。这种系统可以跟踪武器在武装单位网络中的分布,以及不同类型的国家和非国家军事 acted in ukraine 中的武器使用类型。这种系统可以在实时进行跟踪,包括人道主义和医疗援助的需求。我们认为,使用AI自动化这些过程应该是我们社区的高优先事项,具有近期的实际应用效果。”Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.00807
  • repo_url: None
  • paper_authors: Suraj Jyothi Unni, Raha Moraffah, Huan Liu
  • for: 这个论文的目的是提出一个多模态 benchmark dataset,以便评估视觉问答模型在不同的分布Shift下的一致性。
  • methods: 这个论文使用了一个shift induced pipeline来生成多模态分布Shift,并对现有的VQA模型进行评估。
  • results: 实验表明,VQA-GEN dataset能够暴露现有的VQA模型在多模态分布Shift下的敏感性,并且模型在VQA-GEN dataset上进行训练后,在跨Domains和内Domains中表现出了改善。此外,这个论文还分析了每种分布Shift技术的重要性,以便更好地理解模型在不同的分布Shift下的一致性。
    Abstract Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities. However, their real-world applicability is hindered by a lack of comprehensive benchmark datasets. Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts while VQA being a multi-modal task contains shifts across both visual and textual domains. We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline. Experiments demonstrate VQA-GEN dataset exposes the vulnerability of existing methods to joint multi-modal distribution shifts. validating that comprehensive multi-modal shifts are critical for robust VQA generalization. Models trained on VQA-GEN exhibit improved cross-domain and in-domain performance, confirming the value of VQA-GEN. Further, we analyze the importance of each shift technique of our pipeline contributing to the generalization of the model.
    摘要 <>传入文本转化为简化中文。<>视觉问答(VQA)模型是用来演示视觉文本理解能力的。然而,它们在实际应用中受到了全面的基准数据集的限制。现有的领域泛化数据集 для VQA 具有单一的文本变化预测,而 VQA 是一个多Modal任务,其中包括视觉和文本频率的变化。我们提议 VQA-GEN,是首个基于shift induced pipeline的多Modal基准数据集。实验表明,VQA-GEN数据集会暴露现有方法对于多Modal共同分布的敏感性。这 validate了需要对多Modal分布进行robust化,以确保VQA模型的通用化。模型在 VQA-GEN 上训练后,在跨频道和内频道性能都有所提高,证明了 VQA-GEN 的价值。此外,我们分析了我们的排序队列中每种排序技术的重要性,以确定模型的通用化。

Automatic counting of planting microsites via local visual detection and global count estimation

  • paper_url: http://arxiv.org/abs/2311.00796
  • repo_url: None
  • paper_authors: Ahmed Zgaren, Wassim Bouachir, Nizar Bouguila
  • for: 这篇论文是用于自动估计植树块中垫峰的数量的。
  • methods: 该方法使用了计算机视觉和人工智能技术,将估计任务转化为一个超级vised学习问题,使用了两个预测模型。首先,使用深度特征来探测可见的垫峰,然后使用块级特征来提供最终估计。
  • results: 对于 constructed UAV dataset,实验表明提案方法比人工方法在精度上具有优势,同时可以大幅降低时间和成本。
    Abstract In forest industry, mechanical site preparation by mounding is widely used prior to planting operations. One of the main problems when planning planting operations is the difficulty in estimating the number of mounds present on a planting block, as their number may greatly vary depending on site characteristics. This estimation is often carried out through field surveys by several forestry workers. However, this procedure is prone to error and slowness. Motivated by recent advances in UAV imagery and artificial intelligence, we propose a fully automated framework to estimate the number of mounds on a planting block. Using computer vision and machine learning, we formulate the counting task as a supervised learning problem using two prediction models. A local detection model is firstly used to detect visible mounds based on deep features, while a global prediction function is subsequently applied to provide a final estimation based on block-level features. To evaluate the proposed method, we constructed a challenging UAV dataset representing several plantation blocks with different characteristics. The performed experiments demonstrated the robustness of the proposed method, which outperforms manual methods in precision, while significantly reducing time and cost.
    摘要 在森林工业中,机械场准备通过堆填是广泛使用的,以便进行植树操作。一个主要的问题在规划植树操作时是计算堆填的数量,因为它们的数量可能会很大地变化,很难估算。这个估算通常通过外部考察而进行,但这种方法容易出错和慢。驱动了最近的无人机影像和人工智能的进步,我们提出了一个完全自动化的计算方法,以便计算堆填的数量。使用计算机视觉和机器学习,我们将计算任务转化为一个监督学习问题,使用两个预测模型。首先,我们使用深度特征来检测可见的堆填,然后使用块级特征来提供最终的估算。为评估我们的方法,我们构建了一个具有不同特点的无人机数据集,表示了多个植树块。实验表明,我们的方法比手动方法更加精准,同时可以明显减少时间和成本。

What User Behaviors Make the Differences During the Process of Visual Analytics?

  • paper_url: http://arxiv.org/abs/2311.00690
  • repo_url: None
  • paper_authors: Shahin Doroudian, Zekun Wu, Aidong Lu
  • for: 本研究旨在提高视觉分析过程中的理解,以便提高视觉设计和交互功能的发展。
  • methods: 本研究使用时间序列分类方法进行数据收集和分析,以了解用户在视觉分析过程中的行为。
  • results: 研究发现,用户在视觉分析过程中的行为可以被分 distinguish,并且存在用户物理行为和视觉任务之间的强相关性。此外,我们还提出了一种自动地study sensemaking的方法,无需繁重的手动标注。
    Abstract The understanding of visual analytics process can benefit visualization researchers from multiple aspects, including improving visual designs and developing advanced interaction functions. However, the log files of user behaviors are still hard to analyze due to the complexity of sensemaking and our lack of knowledge on the related user behaviors. This work presents a study on a comprehensive data collection of user behaviors, and our analysis approach with time-series classification methods. We have chosen a classical visualization application, Covid-19 data analysis, with common analysis tasks covering geo-spatial, time-series and multi-attributes. Our user study collects user behaviors on a diverse set of visualization tasks with two comparable systems, desktop and immersive visualizations. We summarize the classification results with three time-series machine learning algorithms at two scales, and explore the influences of behavior features. Our results reveal that user behaviors can be distinguished during the process of visual analytics and there is a potentially strong association between the physical behaviors of users and the visualization tasks they perform. We also demonstrate the usage of our models by interpreting open sessions of visual analytics, which provides an automatic way to study sensemaking without tedious manual annotations.
    摘要 理解视觉分析过程可以为视觉研究人员带来多方面的利益,包括改进视觉设计和开发高级交互功能。然而,用户行为的日志仍然具有复杂的感知和我们对相关用户行为的不了解。这项工作提出了一项全面的用户行为数据收集研究,以及我们的分析方法和时间序列分类方法。我们选择了一个经典的视觉应用程序,即COVID-19数据分析,并在这个应用程序中进行了常见的分析任务,包括地理空间、时间序列和多属性。我们的用户研究收集了用户在多个视觉任务上的行为记录,并对两种相似的系统进行了对比分析。我们总结了三种时间序列机器学习算法的分类结果,并探索了行为特征的影响。我们的结果表明,用户行为在视觉分析过程中可以分辨出来,并且用户的物理行为和他们执行的视觉任务之间可能存在强相关性。此外,我们还证明了我们的模型可以通过自动地解释开放会话,以便无需繁琐的手动标注,来研究感知过程。

Collaboration in Immersive Environments: Challenges and Solutions

  • paper_url: http://arxiv.org/abs/2311.00689
  • repo_url: https://github.com/jettbrains/-L-
  • paper_authors: Shahin Doroudian, Zachary Wartell
  • for: This paper provides an overview of the current state of research on collaboration in immersive environments, including Virtual Reality (VR) and Augmented Reality (AR) settings.
  • methods: The paper discusses the different types of immersive environments, including VR and AR, and the different forms of collaboration that can occur in these environments.
  • results: The paper highlights the challenges and limitations of collaboration in immersive environments, such as the lack of physical cues, cost and usability, and the need for further research in this area.Here is the same information in Simplified Chinese text:
  • for: 这篇论文提供了现有关于在虚拟和扩展实现中进行协作的研究状况概述,包括虚拟和扩展实现环境。
  • methods: 论文讨论了不同类型的虚拟和扩展实现环境,以及在这些环境中可能出现的不同合作形式。
  • results: 论文强调了在虚拟和扩展实现环境中进行协作的挑战和限制,如缺乏物理提示、成本和使用性问题,以及需要进一步的研究。
    Abstract Virtual Reality (VR) and Augmented Reality (AR) tools have been applied in all engineering fields in order to avoid the use of physical prototypes, to train in high-risk situations, and to interpret real or simulated results. In order to complete a shared task or assign tasks to the agents in such immersive environments, collaboration or Shared Cooperative Activities are a necessity. Collaboration in immersive environments is an emerging field of research that aims to study and enhance the ways in which people interact and work together in Virtual and Augmented Reality settings. Collaboration in immersive environments is a complex process that involves different factors such as communication, coordination, and social presence. This paper provides an overview of the current state of research on collaboration in immersive environments. It discusses the different types of immersive environments, including VR and AR, and the different forms of collaboration that can occur in these environments. The paper also highlights the challenges and limitations of collaboration in immersive environments, such as the lack of physical cues, cost and usability and the need for further research in this area. Overall, collaboration in immersive environments is a promising field with a wide range of potential applications, from education to industry, and it can benefit both individuals and groups by enhancing their ability to work together effectively.
    摘要 虚拟现实(VR)和增强现实(AR)工具在所有工程领域中应用,以避免使用物理原型,进行高风险的训练,并解释现实或模拟结果。为完成共同任务或分配任务给代理人,在这些吸引环境中进行合作或共同活动是必要。合作在吸引环境中是一项新兴的研究领域,旨在研究人们在虚拟和增强现实Setting中如何交互和合作工作。在这些环境中进行合作是一个复杂的过程,涉及到不同的因素,如通信、协调和社交存在。本文提供了有关现有研究的总体情况,包括VR和AR等不同类型的吸引环境,以及在这些环境中不同形式的合作。文章还强调了在吸引环境中合作的挑战和限制,如物理冲击缺失、成本和可用性问题,以及需要进一步研究。总的来说,在吸引环境中进行合作是一个有前途的领域,它可以为人们提供更好的合作方式,从教育到行业,并且对个人和团队来说都有益。

ProcSim: Proxy-based Confidence for Robust Similarity Learning

  • paper_url: http://arxiv.org/abs/2311.00668
  • repo_url: None
  • paper_authors: Oriol Barbany, Xiaofan Lin, Muhammet Bastan, Arnab Dhua
  • for: 学习一个嵌入空间,使得输入之间的距离与其内在Semantic相似性有高度相关性。
  • methods: 使用ProcSim框架,对每个样本计算 норamlized距离到类表现者的信任分数。
  • results: 实验结果表明,提议的方法在投入uniform和提议的semantic coherent noise的DMLbenchmark数据集上达到了状态 искусственный智能的性能。
    Abstract Deep Metric Learning (DML) methods aim at learning an embedding space in which distances are closely related to the inherent semantic similarity of the inputs. Previous studies have shown that popular benchmark datasets often contain numerous wrong labels, and DML methods are susceptible to them. Intending to study the effect of realistic noise, we create an ontology of the classes in a dataset and use it to simulate semantically coherent labeling mistakes. To train robust DML models, we propose ProcSim, a simple framework that assigns a confidence score to each sample using the normalized distance to its class representative. The experimental results show that the proposed method achieves state-of-the-art performance on the DML benchmark datasets injected with uniform and the proposed semantically coherent noise.
    摘要 深度度量学(DML)方法目标是学习一个尺度空间,其中距离与输入的内在Semantic相似性 closely related。先前的研究表明,流行的benchmark数据集经常包含大量的错误标签,而DML方法受其影响。为了研究实际噪声的效果,我们创建了一个数据集中类别的 ontology,并使用其来模拟Semantically coherent的标签错误。为了训练Robust DML模型,我们提议ProcSim,一种简单的框架,对每个样本分配一个信任分数,基于样本的normalized distance to its class representative。实验结果显示,我们的方法在DML benchmark数据集中 uniformly和我们所提出的semantically coherent噪声下 дости得了state-of-the-art性能。

TPSeNCE: Towards Artifact-Free Realistic Rain Generation for Deraining and Object Detection in Rain

  • paper_url: http://arxiv.org/abs/2311.00660
  • repo_url: https://github.com/shenzheng2000/tpsence
  • paper_authors: Shen Zheng, Changjie Lu, Srinivasa G. Narasimhan
  • for: 这 paper 的目的是提出一种无对应图像传输框架,以生成更加真实的雨天图像,并减少了雨天图像生成中的噪声和扭曲。
  • methods: 该 paper 使用了一种Triangular Probability Similarity (TPS) 约束,以引导生成的图像向清晰和雨天图像的槽中靠拢,从而减少生成过程中的噪声和扭曲。此外,它还使用了一种Semantic Noise Contrastive Estimation (SeNCE) 策略,重新评估负样本的推动力度,根据清晰和雨天图像之间的语义相似性和特征相似性。
  • results: 实验结果表明,该方法可以生成更加真实的雨天图像,减少了噪声和扭曲。此外,该方法还可以用于生成真实的雪天和夜天图像,强调其普遍应用性。代码可以在 https://github.com/ShenZheng2000/TPSeNCE 上获取。
    Abstract Rain generation algorithms have the potential to improve the generalization of deraining methods and scene understanding in rainy conditions. However, in practice, they produce artifacts and distortions and struggle to control the amount of rain generated due to a lack of proper constraints. In this paper, we propose an unpaired image-to-image translation framework for generating realistic rainy images. We first introduce a Triangular Probability Similarity (TPS) constraint to guide the generated images toward clear and rainy images in the discriminator manifold, thereby minimizing artifacts and distortions during rain generation. Unlike conventional contrastive learning approaches, which indiscriminately push negative samples away from the anchors, we propose a Semantic Noise Contrastive Estimation (SeNCE) strategy and reassess the pushing force of negative samples based on the semantic similarity between the clear and the rainy images and the feature similarity between the anchor and the negative samples. Experiments demonstrate realistic rain generation with minimal artifacts and distortions, which benefits image deraining and object detection in rain. Furthermore, the method can be used to generate realistic snowy and night images, underscoring its potential for broader applicability. Code is available at https://github.com/ShenZheng2000/TPSeNCE.
    摘要 雨生成算法有potential以提高雨天情况下的泛化和场景理解。然而,在实践中,它们会生成artefacts和扭曲,并且控制雨水生成的量因缺乏合适的约束而困难。在这篇论文中,我们提出了一种不带对应图像的图像到图像翻译框架,用于生成真实的雨天图像。我们首先引入了三角形概率相似(TPS)约束,以导引生成的图像向清晰和雨天图像的权重空间中迁移,从而减少artefacts和扭曲在雨水生成过程中。不同于传统的对比学习方法,我们提出了semantic Noise Contrastive Estimation(SeNCE)策略,并重新评估负样本的推动力度基于清晰和雨天图像之间的semantic相似性和特征相似性。实验表明可以生成真实的雨天图像,同时减少artefacts和扭曲,这对图像抢掉和物体检测在雨天有益。此外,该方法还可以用于生成真实的雪天和夜天图像,这augments its potential for broader applicability。代码可以在https://github.com/ShenZheng2000/TPSeNCE上获取。

De-Diffusion Makes Text a Strong Cross-Modal Interface

  • paper_url: http://arxiv.org/abs/2311.00618
  • repo_url: None
  • paper_authors: Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu
  • for: 这 paper 是用于提出一种新的文本-图像 interfaces,它可以将图像转换为文本形式,从而提高图像和文本之间的交互性和灵活性。
  • methods: 这 paper 使用了一种自适应文本-图像干涉模型,其中的编码器将输入图像转换为文本形式,然后使用固定的文本-图像干涉解码器来重建输入图像。这种方法被称为 De-Diffusion。
  • results: 实验表明,De-Diffusion 可以准确地将图像转换为文本形式,并且可以让这种文本形式被Off-the-shelf text-to-image工具和大型自然语言处理器(LLMs)进行多样化的多模态任务。例如,一个 De-Diffusion 模型可以通过提供不同的文本描述来生成可重用的描述,并且在开放式视觉语言任务中达到了新的州OF-THE-ART。
    Abstract We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, our approach represents an image as text, from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input -- a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools, and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.
    摘要 我们展示了文本作为强大的跨模态界面。而不是依靠深度嵌入来连接图像和语言作为界面表示,我们的方法将图像转换为文本,从而得到了自然语言中的可读性和灵活性。我们使用一个自适应Encoder,使用预训练的文本到图像扩散模型进行解码。解码器被训练以将输入图像转换为文本,然后将其传递给固定的文本到图像扩散解码器进行重建原始输入——一个过程我们称为“拆解”。实验证明了我们的拆解方法可以准确地表示图像,并且可以轻松地被普通的文本到图像工具和大语言模型进行多样化的多模态任务。例如,单个拆解模型可以泛化提供不同文本到图像工具的可移植提示,同时也实现了开放式视觉语言任务的新 государ录录。

Occluded Person Re-Identification with Deep Learning: A Survey and Perspectives

  • paper_url: http://arxiv.org/abs/2311.00603
  • repo_url: None
  • paper_authors: Enhao Ning, Changshuo Wang, Huang Zhangc, Xin Ning, Prayag Tiwari
  • for: 本研究评估了多种遮盲人识别技术,以提高人识别系统的可靠性和精度。
  • methods: 本研究使用了深度学习技术,对 occluded person Re-ID 方法进行了系统性的比较和分析,并提出了未来发展的想法。
  • results: 研究发现了一些状态级别的 occluded person Re-ID 方法,并对这些方法进行了系统性的评估和比较。
    Abstract Person re-identification (Re-ID) technology plays an increasingly crucial role in intelligent surveillance systems. Widespread occlusion significantly impacts the performance of person Re-ID. Occluded person Re-ID refers to a pedestrian matching method that deals with challenges such as pedestrian information loss, noise interference, and perspective misalignment. It has garnered extensive attention from researchers. Over the past few years, several occlusion-solving person Re-ID methods have been proposed, tackling various sub-problems arising from occlusion. However, there is a lack of comprehensive studies that compare, summarize, and evaluate the potential of occluded person Re-ID methods in detail. In this review, we start by providing a detailed overview of the datasets and evaluation scheme used for occluded person Re-ID. Next, we scientifically classify and analyze existing deep learning-based occluded person Re-ID methods from various perspectives, summarizing them concisely. Furthermore, we conduct a systematic comparison among these methods, identify the state-of-the-art approaches, and present an outlook on the future development of occluded person Re-ID.
    摘要 人识别(Re-ID)技术在智能监测系统中发挥越来越重要的作用。广泛的遮挡会导致人识别的性能下降。遮挡人识别指的是一种受到人员信息损失、干扰噪声和视角偏移等挑战的人识别方法。这一问题在过去几年内吸引了大量研究人员的关注。在这篇评论中,我们首先提供了遮挡人识别数据集和评价方法的详细介绍。然后,我们科学地分类和分析了现有的深度学习基于遮挡人识别方法,从多种角度进行了系统性的总结。此外,我们进行了系统性的比较,确定了状态之最佳方法,并对未来人识别领域的发展提出了一个前look。

PAUMER: Patch Pausing Transformer for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.00586
  • repo_url: None
  • paper_authors: Evann Courdier, Prabhu Teja Sivaprasad, François Fleuret
  • for: 提高图像分割器的效率,通过不同的计算量来区分不同部分的图像。
  • methods: 使用预测结果的熵作为挫止计算的标准,以实现在最终解码器之前停止计算。
  • results: 在 Cityscapes 和 ADE20K 两个标准图像分割数据集上,我们的方法可以实现约 $50%$ 的高速运行,并且对 mIoU 的下降只有 $0.65%$ 和 $4.6%$ 分别。
    Abstract We study the problem of improving the efficiency of segmentation transformers by using disparate amounts of computation for different parts of the image. Our method, PAUMER, accomplishes this by pausing computation for patches that are deemed to not need any more computation before the final decoder. We use the entropy of predictions computed from intermediate activations as the pausing criterion, and find this aligns well with semantics of the image. Our method has a unique advantage that a single network trained with the proposed strategy can be effortlessly adapted at inference to various run-time requirements by modulating its pausing parameters. On two standard segmentation datasets, Cityscapes and ADE20K, we show that our method operates with about a $50\%$ higher throughput with an mIoU drop of about $0.65\%$ and $4.6\%$ respectively.
    摘要 我团队研究如何通过不同的计算量来提高分割变换器的效率。我们的方法PAUMER使用停止计算的方式来实现这一点,在最终解码器之前停止计算某些patches。我们使用Intermediate activation的 entropy来决定停止计算的标准,发现它与图像 semantics 吻合得非常好。我们的方法有一个独特的优势,可以在执行时根据需要适应不同的运行时参数,只需要调整停止计算的参数即可。在Cityscapes和ADE20K两个标准分割dataset上,我们证明了我们的方法可以实现约50%的更高的throughput,并且相应的mIoU下降约0.65%和4.6%。

A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images

  • paper_url: http://arxiv.org/abs/2311.00567
  • repo_url: None
  • paper_authors: Ni Yao, Hang Hu, Kaicong Chen, Chen Zhao, Yuan Guo, Boya Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Weihua Zhou, Li Tian
  • for: 这个研究的目的是为了使用深度学习模型来帮助医生在骨髓癌手术前进行不同类型的肾癌诊断,以提高诊断的准确性和有效性。
  • methods: 这个研究使用了5-fold横推分来运行深度学习模型,并在模型中包含不确定性估计,以提高模型的准确性和可靠性。
  • results: 研究结果显示,这个深度学习模型在5-fold横推分中的AUC值为0.868(95% CI:0.826-0.923),并在外部验证集中的AUC值为0.856(95% CI:0.838-0.882)。这表示这个模型在预后诊断肾癌不同类型的过程中表现了良好的准确性和可靠性。
    Abstract Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC.
    摘要 目的:通过深度学习模型并实现uncertainty estimation,帮助放射学家在 préoperative 阶段 diferenciate 肾癌细型(RCC)的 PATHOLOGICAL 亚型,基于 CT 图像。方法:收集了 668 例 consecutive 病例数据,确诊为 RCC,从 Center 1 进行了 Retrospective 收集。通过五fold 交叉验证,我们开发了一种 incorporating uncertainty estimation 的深度学习模型,用于分类 RCC 亚型为 clear cell RCC(ccRCC)、papillary RCC(pRCC)和 chromophobe RCC(chRCC)。外验证集中心 2 中的 78 例病例进一步评估了模型的性能。结果:在五fold 交叉验证中,模型的 Receiver Operating Characteristic Curve(AUC)为 ccRCC、pRCC 和 chRCC 的分类为 0.868(95% CI:0.826-0.923)、0.846(95% CI:0.812-0.886)和 0.839(95% CI:0.802-0.88),分别。在外验证集中,AUCs 为 0.856(95% CI:0.838-0.882)、0.787(95% CI:0.757-0.818)和 0.793(95% CI:0.758-0.831)。结论:我们开发的深度学习模型在预测 RCC 亚型方面表现了robust,同时 incorporated uncertainty 强调了理解模型confidence的重要性,这对于帮助诊断肾肿瘤病人具有重要的价值。临床实践意义:我们的深度学习方法,integrated with uncertainty estimation,为放射学家提供了 dual advantage:准确地预测 RCC 亚型,同时提供了诊断 confidence 参考,为肾肿瘤病人提供了 informed decision-making。

CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

  • paper_url: http://arxiv.org/abs/2311.00566
  • repo_url: https://github.com/antofuller/croma
  • paper_authors: Anthony Fuller, Koreen Millard, James R. Green
  • for: 这个研究旨在开发一个基于自类超级学习的框架,以 learns rich unimodal和多modal表现。
  • methods: 这个框架 combine了对比和重建自我超级学习目标,分别将陌生 multispectral 和 Synthetic Aperture Radar 标本处理为 masked-out 状态,并在空间和时间Alignment的情况下进行 Cross-modal contrastive learning。
  • results: 这个框架可以实现高效地 extrapolate 到大型测试影像(最大17.6倍),并且在四个分类 bencmark 上进行评估,包括 fine-tuning、线性和非线性 probing、kNN 分类和 K-means 对应。
    Abstract A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples -- aligned in space and time -- and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X- and 2D-ALiBi, which spatially biases our cross- and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to 17.6x larger at test-time. CROMA outperforms the current SoTA multispectral model, evaluated on: four classification benchmarks -- finetuning (avg. 1.8%), linear (avg. 2.4%) and nonlinear (avg. 1.4%) probing, kNN classification (avg. 3.5%), and K-means clustering (avg. 8.4%); and three segmentation benchmarks (avg. 6.4%). CROMA's rich, optionally multimodal representations can be widely leveraged across remote sensing applications.
    摘要 一个非常重要和快速发展的应用程序,远程感知提供了庞大但罕见标注的、空间对齐的多模态数据,这使得无监督学习算法成为了非常重要的。我们提出了 CROMA 框架,该框架将对比和重建自我监督目标结合在一起,以学习丰富的单模态和多模态表示。我们在 espacio 和时间方面对多普通频谱光学和Synthetic Aperture Radar 样本进行了隐藏masking,然后通过对各个感知器进行交叉模式对比学习来学习单模态和多模态表示。另一个Encoder 将这些感知器进行融合,生成了联合多模态编码,并使用轻量级解码器来预测隐藏的补充。我们表明了这些目标之间的对比性,并引入了 X-和2D-ALiBi 的空间偏好矩阵,以提高表示和使模型能够在测试时 extrapolate 到大小为 17.6 倍的图像。CROMA 在四个分类标准 bencmarks 上取得了 SoTA 的最佳成绩,包括:四个分类 benchmarks 的 fine-tuning (平均 1.8%)、直接学习(平均 2.4%)和非直接学习(平均 1.4%) probing、kNN 分类(平均 3.5%)和 K-means 归一化(平均 8.4%)。此外,CROMA 在三个分割标准 bencmarks 上取得了平均 6.4% 的成绩。CROMA 的丰富、可选的多模态表示可以广泛应用于远程感知领域。

MNN: Mixed Nearest-Neighbors for Self-Supervised Learning

  • paper_url: http://arxiv.org/abs/2311.00562
  • repo_url: https://github.com/pc-cp/mnn
  • paper_authors: Chen Peng, Xianzhong Long, Yun Li
  • for: 提高自动学习中的自我超视标注的性能和训练效率
  • methods: 基于权重融合和图像混合操作的简单自我超视标注框架Mixed Nearest-Neighbors for Self-Supervised Learning (MNN)
  • results: 在四个标准数据集上表现出色的泛化性和训练效率
    Abstract In contrastive self-supervised learning, positive samples are typically drawn from the same image but in different augmented views, resulting in a relatively limited source of positive samples. An effective way to alleviate this problem is to incorporate the relationship between samples, which involves including the top-k nearest neighbors of positive samples in the framework. However, the problem of false neighbors (i.e., neighbors that do not belong to the same category as the positive sample) is an objective but often overlooked challenge due to the query of neighbor samples without human supervision. In this paper, we present a simple Self-supervised learning framework called Mixed Nearest-Neighbors for Self-Supervised Learning (MNN). MNN optimizes the influence of neighbor samples on the semantics of positive samples through an intuitive weighting approach and image mixture operations. The results of our study demonstrate that MNN exhibits exceptional generalization performance and training efficiency on four benchmark datasets.
    摘要 contrastive self-supervised learning中,正样本通常来自同一幅图像,但在不同的扩展视图中,导致正样本的数量相对受限。一种有效的解决方法是利用样本之间的关系,这包括将正样本的top-k最近邻居包含在框架中。然而,false neighbors(即不属于正样本类别的邻居)是一个目标,但通常被忽略的挑战,因为查询邻居样本没有人工监督。在本文中,我们提出了一种简单的自动学习框架,名为混合最近邻居自我超vised学习(MNN)。MNN通过对正样本的语义优化邻居样本的影响,使用直观的权重方法和图像混合操作。我们的研究结果表明,MNN在四个标准 benchmark dataset上展现出了非常出色的泛化性和训练效率。

ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab

  • paper_url: http://arxiv.org/abs/2311.00556
  • repo_url: None
  • paper_authors: Jieming Cui, Ziren Gong, Baoxiong Jia, Siyuan Huang, Zilong Zheng, Jianzhu Ma, Yixin Zhu
  • for: 解决分子生物领域中复制研究结果的困难
  • methods: 使用现代智能系统进行研究,并在 BioLab settings 中进行 activity understanding 研究
  • results: 提供了一个完整的多模态数据集(ProBio)和两个挑战性的标准(透明解决方案跟踪和多模态行为识别),以便研究现代 AI 技术在分子生物领域的应用。
    Abstract The challenge of replicating research results has posed a significant impediment to the field of molecular biology. The advent of modern intelligent systems has led to notable progress in various domains. Consequently, we embarked on an investigation of intelligent monitoring systems as a means of tackling the issue of the reproducibility crisis. Specifically, we first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. This dataset comprises fine-grained hierarchical annotations intended for the purpose of studying activity understanding in BioLab. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings. Finally, we provide a thorough experimental evaluation of contemporary video understanding models and highlight their limitations in this specialized domain to identify potential avenues for future research. We hope ProBio with associated benchmarks may garner increased focus on modern AI techniques in the realm of molecular biology.
    摘要 科学研究复现困难问题在分子生物学领域内存着重要的阻碍。现代智能系统的出现在不同领域中带来了显著的进步,因此我们决定通过研究智能监测系统来解决复现危机。我们首先筹集了一个全面的多Modal数据集,名为ProBio,作为这个目标的初步步骤。这个数据集包括细化的层次标注,用于研究 BioLab 中活动理解的目的。然后,我们设计了两个具有挑战性的标准,透明解决方案跟踪和多Modal动作认知,以强调 BioLab 中活动理解的特殊特征和挑战。最后,我们进行了详细的实验评估当今视频理解模型,并指出其在这个特殊领域的限制,以便未来研究的可能性。我们希望 ProBio 和相关的标准可以吸引更多关注现代AI技术在分子生物学领域的应用。

Continual atlas-based segmentation of prostate MRI

  • paper_url: http://arxiv.org/abs/2311.00548
  • repo_url: https://github.com/meclabtuda/atlas-replay
  • paper_authors: Amin Ranem, Camila González, Daniel Pinto dos Santos, Andreas Michael Bucher, Ahmed Ezzat Othman, Anirban Mukhopadhyay
  • for: 这篇论文是为了解决自然图像分类中的持续学习(Continual Learning,CL)方法对医疗图像分类的问题。
  • methods: 这篇论文使用了 atlas-based segmentation 方法,利用对区域 инте点的domain knowledge,实现了semantically coherent的预测。此外,这篇论文还使用了隐私保护的 prototype,以确保模型不会受到训练分布的影响。
  • results: 这篇论文的结果显示,Atlas Replay 方法可以在七个公开的 проstate segmentation 数据集上提供高品质的分类mask,并且能够在不同的训练分布下保持知识。此外,Atlas Replay 方法还能够对 yet-unseen 领域进行Robust 的预测,而不是end-to-end segmentation 方法。
    Abstract Continual learning (CL) methods designed for natural image classification often fail to reach basic quality standards for medical image segmentation. Atlas-based segmentation, a well-established approach in medical imaging, incorporates domain knowledge on the region of interest, leading to semantically coherent predictions. This is especially promising for CL, as it allows us to leverage structural information and strike an optimal balance between model rigidity and plasticity over time. When combined with privacy-preserving prototypes, this process offers the advantages of rehearsal-based CL without compromising patient privacy. We propose Atlas Replay, an atlas-based segmentation approach that uses prototypes to generate high-quality segmentation masks through image registration that maintain consistency even as the training distribution changes. We explore how our proposed method performs compared to state-of-the-art CL methods in terms of knowledge transferability across seven publicly available prostate segmentation datasets. Prostate segmentation plays a vital role in diagnosing prostate cancer, however, it poses challenges due to substantial anatomical variations, benign structural differences in older age groups, and fluctuating acquisition parameters. Our results show that Atlas Replay is both robust and generalizes well to yet-unseen domains while being able to maintain knowledge, unlike end-to-end segmentation methods. Our code base is available under https://github.com/MECLabTUDA/Atlas-Replay.
    摘要

Improving Cardiovascular Disease Prediction Through Comparative Analysis of Machine Learning Models: A Case Study on Myocardial Infarction

  • paper_url: http://arxiv.org/abs/2311.00517
  • repo_url: None
  • paper_authors: Jonayet Miah, Duc M Ca, Md Abu Sayed, Ehsanur Rashid Lipu, Fuad Mahmud, S M Yasir Arafat
    for: 这个研究旨在预测心肺病,即医学研究中的一项挑战。methods: 这个研究使用了六种不同的机器学习模型进行比较分析,包括Logistic Regression、Support Vector Machine、决策树、Bagging、XGBoost和LightGBM。results: 研究结果显示,XGBoost模型表现最佳,准确率达到92.72%。此外,Logistic Regression、Support Vector Machine和LightGBM模型也达到了 relativamente高的准确率。这些结果表明,通过采用高级机器学习技术,可以提高心肺病预测的精度。
    Abstract Cardiovascular disease remains a leading cause of mortality in the contemporary world. Its association with smoking, elevated blood pressure, and cholesterol levels underscores the significance of these risk factors. This study addresses the challenge of predicting myocardial illness, a formidable task in medical research. Accurate predictions are pivotal for refining healthcare strategies. This investigation conducts a comparative analysis of six distinct machine learning models: Logistic Regression, Support Vector Machine, Decision Tree, Bagging, XGBoost, and LightGBM. The attained outcomes exhibit promise, with accuracy rates as follows: Logistic Regression (81.00%), Support Vector Machine (75.01%), XGBoost (92.72%), LightGBM (90.60%), Decision Tree (82.30%), and Bagging (83.01%). Notably, XGBoost emerges as the top-performing model. These findings underscore its potential to enhance predictive precision for coronary infarction. As the prevalence of cardiovascular risk factors persists, incorporating advanced machine learning techniques holds the potential to refine proactive medical interventions.
    摘要 心血管疾病仍然是当今世界上主要的死亡原因之一。它与吸烟、高血压和凝血水平的关系,表明了这些风险因素的重要性。这项研究面临着预测心肌疾病的挑战,这是医学研究中的一项困难任务。准确的预测是医疗战略的重要组成部分。这个研究对六种不同的机器学习模型进行了比较分析:Logistic Regression、Support Vector Machine、决策树、Bagging、XGBoost和LightGBM。所获得的结果展示了批处的批处,准确率如下:Logistic Regression(81.00%)、Support Vector Machine(75.01%)、XGBoost(92.72%)、LightGBM(90.60%)、决策树(82.30%)和Bagging(83.01%)。值得注意的是,XGBoost在这些模型中表现出色,这表明它在预测心肌疾病方面的潜在能力。随着卡路дии血管风险因素的存在,利用高级机器学习技术可能会更加细化的预测和投入措施。

Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

  • paper_url: http://arxiv.org/abs/2311.00489
  • repo_url: None
  • paper_authors: Daniel Neururer, Volker Dellwo, Thilo Stadelmann
  • for: 本研究旨在探讨深度神经网络在自动人识唱任务中的成功原因,以及它们是如何模型各种特征的。
  • methods: 本研究使用了一种新的测试方法,用于评估现有的状态对 speaker recognition 的性能是否受到各种特征的影响。同时,研究还提出了一些使得网络更加注重各种特征的方法,并评估了它们的效果。
  • results: 研究发现,现有的 CNN 和 RNN 网络 Architecture 对 speaker recognition 并不充分模型各种特征,即使被迫了。这些结果为未来更好地利用语音信号和深度学习的研究提供了一个高度相关的基础,同时也提供了对这些网络的解释性的深入了解。
    Abstract While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
    摘要 深度神经网络在自动识别人谱和相关任务中表现出色,但是currently little is understood about what exactly is responsible for these results. Prior work has attributed some of the success to their ability to model supra-segmental temporal information (SST), i.e., learn the rhythmic and prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify the extent to which the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force these networks to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to a sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.Here's the translation in Traditional Chinese as well:深度神经网络在自动识别人谱和相关任务中表现出色,但目前对其成功的根本原因所知甚少。 Prior work将一部分成功归功于它们能够模型超 Segmental temporal information (SST),即学习语音的几何和声振特征。在这篇论文中,我们(i)提出和应用一个新的测试来评估现代神经网络对人识别的表现是否可以被归因于模型SST; 以及(ii)提出了让这些网络更加专注于SST的多种方法,并评估其效果。我们发现现代CNN和RNN基于神经网络架构的对人识别表现并不充分靠拢SST,甚至在强制性下也不会。结果提供了深刻的基础 для未来对整个语音信号的更好利用,并对神经网络内部的运作给出了更多的解释,对于语音科技的深度学习提供了新的思路。

DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Macular Hole Reconstruction with Stochastic Retinal Defect Augmentation and Dynamic Weight Composition

  • paper_url: http://arxiv.org/abs/2311.00483
  • repo_url: https://github.com/iipl-hangzhoudianziuniversity/defn-pytorch
  • paper_authors: Xingru Huang, Yihao Guo, Jian Huang, Zhi Li, Tianyun Zhang, Kunyan Cai, Gaopeng Huang, Wenhao Chen, Zhaoyang Xu, Liangqiong Qu, Ji Hu, Tinyu Wang, Shaowei Jiang, Chenggang Yan, Yaoqi Sun, Xin Ye, Yaqi Wang
  • for: 这个论文的目的是提供一种基于深度学习的三维重建方法,以帮助诊断和治疗棘狭血管病变。
  • methods: 该论文使用了一种名为DEFN的三维 segmentation网络,该网络包括三个创新模块:Fourier Group Harmonics(FuGH)、Simplified 3D Spatial Attention(S3DSA)和Harmonic Squeeze-and-Excitation Module(HSE)。此外,该论文还提出了一种新的数据增强方法 named Stochastic Retinal Defect Injection(SRDI)和一种网络优化策略 named DynamicWeightCompose(DWC)。
  • results: 相比13个基线方法,DEFN表现最佳,并可以提供高精度的三维retinal重建和量化指标,为ophthalmologists提供革命性的诊断和治疗决策工具,对棘狭血管病变的诊断和治疗具有启示性的影响。
    Abstract The spatial and quantitative parameters of macular holes are vital for diagnosis, surgical choices, and post-op monitoring. Macular hole diagnosis and treatment rely heavily on spatial and quantitative data, yet the scarcity of such data has impeded the progress of deep learning techniques for effective segmentation and real-time 3D reconstruction. To address this challenge, we assembled the world's largest macular hole dataset, Retinal OCTfor Macular Hole Enhancement (ROME-3914), and a Comprehensive Archive for Retinal Segmentation (CARS-30k), both expertly annotated. In addition, we developed an innovative 3D segmentation network, the Dual-Encoder FuGH Network (DEFN), which integrates three innovative modules: Fourier Group Harmonics (FuGH), Simplified 3D Spatial Attention (S3DSA) and Harmonic Squeeze-and-Excitation Module (HSE). These three modules synergistically filter noise, reduce computational complexity, emphasize detailed features, and enhance the network's representation ability. We also proposed a novel data augmentation method, Stochastic Retinal Defect Injection (SRDI), and a network optimization strategy DynamicWeightCompose (DWC), to further improve the performance of DEFN. Compared with 13 baselines, our DEFN shows the best performance. We also offer precise 3D retinal reconstruction and quantitative metrics, bringing revolutionary diagnostic and therapeutic decision-making tools for ophthalmologists, and is expected to completely reshape the diagnosis and treatment patterns of difficult-to-treat macular degeneration. The source code is publicly available at: https://github.com/IIPL-HangzhouDianUniversity/DEFN-Pytorch.
    摘要 “ macular hole 的空间和量化参数非常重要 для诊断、手术选择以及后期监测。然而,这些数据的缺乏使得深度学习技术的应用在macular hole 的准确分割和实时3D重建方面受到了阻碍。为解决这个挑战,我们组建了全球最大的macular hole数据集,Retinal OCT for Macular Hole Enhancement (ROME-3914),以及一个丰富的Retinal Segmentation Archive (CARS-30k),均有专业的注释。此外,我们开发了一种创新的3D分割网络,双核Encoder FuGH网络 (DEFN),该网络包括三个创新模块: fourier group harmonics (FuGH)、简化3D空间注意力 (S3DSA) 和响应式压缩模块 (HSE)。这三个模块紧密协作,减少噪声、降低计算复杂度、强调细节特征、提高网络的表征能力。我们还提出了一种新的数据增强方法,随机retinal defect injection (SRDI),以及一种网络优化策略,动态 веса组合 (DWC),以进一步提高 DEFN 的性能。与13个基线相比,我们的 DEFN 表现最佳。此外,我们还提供了高精度的3D retinal重建和量化指标,为各位眼科医生提供革命性的诊断和治疗决策工具,并预计将完全重塑difficult-to-treat macular degeneration 的诊断和治疗模式。数据集源代码可以在以下链接获取:https://github.com/IIPL-HangzhouDianUniversity/DEFN-Pytorch。”

Group Distributionally Robust Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2311.00476
  • repo_url: None
  • paper_authors: Konstantinos Vilouras, Xiao Liu, Pedro Sanchez, Alison Q. O’Neil, Sotirios A. Tsaftaris
  • for: 这篇论文旨在解决专业医疗影像分析中的sub-population shift问题,即训练模型在不同医院或扫描机上取得的数据不寻常的情况。
  • methods: 本文提出了一种分布式Robust optimization(DRO)技术,即集合权重更新方法,以解决在训练过程中的各组别损失问题。
  • results: 本文透过实验 validate了我们的方法,GroupDistil,在两个 benchmark 数据集(自然图像和心脏MRI)上,以提高worst-group accuracy。
    Abstract Knowledge distillation enables fast and effective transfer of features learned from a bigger model to a smaller one. However, distillation objectives are susceptible to sub-population shifts, a common scenario in medical imaging analysis which refers to groups/domains of data that are underrepresented in the training set. For instance, training models on health data acquired from multiple scanners or hospitals can yield subpar performance for minority groups. In this paper, inspired by distributionally robust optimization (DRO) techniques, we address this shortcoming by proposing a group-aware distillation loss. During optimization, a set of weights is updated based on the per-group losses at a given iteration. This way, our method can dynamically focus on groups that have low performance during training. We empirically validate our method, GroupDistil on two benchmark datasets (natural images and cardiac MRIs) and show consistent improvement in terms of worst-group accuracy.
    摘要 知识填充可以快速和有效地将大型模型中学习的特征传递到小型模型中。然而,液态目标函数容易受到次群体变化的影响,这是医学图像分析中常见的问题,即训练数据中具有少量表达的组/领域。例如,通过多个扫描仪或医院获得的健康数据训练模型可能会导致少数群体的性能下降。在这篇论文中,我们 inspirited by distributionally robust optimization(DRO)技术,提出了一种群体意识的填充损失函数。在优化过程中,一组参数会根据每个组的损失值进行更新。这样,我们的方法可以在训练过程中动态地关注表现不佳的组。我们在两个标准 benchmark 数据集(自然图像和心脏 MR)上进行了实验,并显示了适用于最差组的精度。

PET Tracer Conversion among Brain PET via Variable Augmented Invertible Network

  • paper_url: http://arxiv.org/abs/2311.00735
  • repo_url: None
  • paper_authors: Bohui Shen, Wei Zhang, Xubiao Liu, Pengfei Yu, Shirui Jiang, Xinchong Shi, Xiangsong Zhang, Xiaoyu Zhou, Weirui Zhang, Bingxuan Li, Qiegen Liu
  • for: 用于诊断脑病和脑科研究
  • methods: 使用深度学习的追踪转换神经网络(TC-INN)将FDG图像映射到DOPA图像上
  • results: 实现了FDG图像与DOPA图像之间的图像映射,获得了更多的诊断信息
    Abstract Positron emission tomography (PET), as an imaging technique with high biochemical sensitivity, has been widely used in diagnosis of encephalopathy and brain science research used in brain disease diagnosis and brain science research. Since different tracers present different effects on the same focal area, the choice of tracers is getting more significant for PET imaging. Nowadays, with the wide application of PET imaging in neuropsychiatric treatment, 6-18F-fluoro-3, 4-dihydroxy-L-phenylalanine (DOPA) has been found to be more effective than 18F-labeled fluorine-2-deoxyglucose (FDG) in this field. However, due to the complexity of its preparation and other limitations, DOPA is far less widely used than FDG. To address this issue, a tracer conversion invertible neural network (TC-INN) for image projection is developed to map FDG images to DOPA images through deep learning. More diagnostic information is obtained by generating PET images from FDG to DOPA. Specifically, the proposed TC-INN consists of two separate phases, one for training the traceable data, the other for re-building the new data. The reference DOPA PET image is used as the learning target for the corresponding network during the training process of tracer conversion. Mean-while, the invertible network iteratively estimates the resultant DOPA PET data and compares it to the reference DOPA PET data. Notably, the reversible model employed variable enhancement techniques to achieve better power generation. Moreover, image registration needs to be performed before training due to the angular deviation of the acquired FDG and DOPA data information. Experimental results show generative ability in mapping be-tween FDG images and DOPA images. It demonstrates great potential for PET image conversion in the case of limited tracer applications.
    摘要 Positron emission tomography (PET) 技术,因其高度的生物化敏感度,在脑病诊断和脑科研究中广泛应用。由于不同的追踪物在同一个点上有不同的效果,因此选择追踪物的选择变得更加重要。目前,随着PET成像在神经精神疾病治疗中的广泛应用,6-18F-fluoro-3,4-二氢苯乙酸(DOPA)在这个领域被发现比18F-标记的氟德氧糖(FDG)更有效。然而,由于DOPA的制备复杂和其他限制,它在实际应用中远远不如FDG广泛应用。为解决这个问题,我们开发了一种tc-inn(追踪转换深度学习网络),用于将FDG成像映射到DOPA成像上。通过深度学习,我们可以从FDG成像中获得更多的诊断信息。特别是,tc-inn包括两个不同阶段:一个用于训练追踪数据,另一个用于重新构建新数据。参考DOPA PET成像作为学习目标,tc-inn在训练过程中使用深度学习来转化FDG成像为DOPA成像。同时,tc-inn还使用可变增强技术来提高能量生成。此外,由于FDG和DOPA数据信息的angular偏移,需要在训练前进行图像 региSTR,以保证模型的准确性。实验结果表明,tc-inn可以有效地将FDG成像映射到DOPA成像上。这表明tc-inn在追踪物应用有限时具有潜在的潜力。

Single-view 3D Scene Reconstruction with High-fidelity Shape and Texture

  • paper_url: http://arxiv.org/abs/2311.00457
  • repo_url: https://github.com/DaLi-Jack/SSR-code
  • paper_authors: Yixin Chen, Junfeng Ni, Nan Jiang, Yaowei Zhang, Yixin Zhu, Siyuan Huang
  • for: 本研究旨在提高单视图像中的场景重建细节,以提高场景理解和3D场景编辑等应用。
  • methods: 该方法基于单视神经隐式形状和颜色场景场景(SSR)表示,利用了explicit 3D形状超级视图和volume rendering来恢复高精度的 объек形和表面质感。
  • results: 该方法可以提高对象的细节重建率,并且可以在不同的视图角度下渲染图像。此外,该方法还可以组合对象水平的表示,以实现场景的整体理解和3D场景编辑等应用。
    Abstract Reconstructing detailed 3D scenes from single-view images remains a challenging task due to limitations in existing approaches, which primarily focus on geometric shape recovery, overlooking object appearances and fine shape details. To address these challenges, we propose a novel framework for simultaneous high-fidelity recovery of object shapes and textures from single-view images. Our approach utilizes the proposed Single-view neural implicit Shape and Radiance field (SSR) representations to leverage both explicit 3D shape supervision and volume rendering of color, depth, and surface normal images. To overcome shape-appearance ambiguity under partial observations, we introduce a two-stage learning curriculum incorporating both 3D and 2D supervisions. A distinctive feature of our framework is its ability to generate fine-grained textured meshes while seamlessly integrating rendering capabilities into the single-view 3D reconstruction model. This integration enables not only improved textured 3D object reconstruction by 27.7% and 11.6% on the 3D-FRONT and Pix3D datasets, respectively, but also supports the rendering of images from novel viewpoints. Beyond individual objects, our approach facilitates composing object-level representations into flexible scene representations, thereby enabling applications such as holistic scene understanding and 3D scene editing. We conduct extensive experiments to demonstrate the effectiveness of our method.
    摘要 <>传统方法有限,单视图图像 reconstruction 仍然是一个挑战,主要集中于几何形状回归,忽略物体外观和细节形状。为解决这些挑战,我们提出了一种新的框架,可同时高精度地恢复物体形状和текстура从单视图图像。我们的方法利用我们提出的单视图神经隐式形状场和颜色场(SSR)来利用Explicit 3D形状超级视图和Volume Rendering技术。为了解决形状外观不确定性,我们提出了一个两阶段学习课程,包括3D和2D超级视图。我们的框架可以生成细节rich的纹理化链 mesh,并同时将渲染功能集成到单视图3D重建模型中。这种集成不仅提高了纹理3D物体重建的精度,还支持从新视角渲染图像。我们的方法不仅可以应用于个体物体,还可以组合物体水平的表示,以实现全景场理解和3D场景编辑。我们进行了广泛的实验,以证明我们的方法的有效性。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Progressive Recurrent Network for Shadow Removal

  • paper_url: http://arxiv.org/abs/2311.00455
  • repo_url: None
  • paper_authors: Yonghui Wang, Wengang Zhou, Hao Feng, Li Li, Houqiang Li
  • for: removes shadows from images in a coarse-to-fine fashion
  • methods: Progressive Recurrent Network (PRNet) with shadow feature extraction and progressive shadow removal
  • results: superior performance in removing shadows compared to existing deep learning-based approaches, with 29% fewer network parameters.
    Abstract Single-image shadow removal is a significant task that is still unresolved. Most existing deep learning-based approaches attempt to remove the shadow directly, which can not deal with the shadow well. To handle this issue, we consider removing the shadow in a coarse-to-fine fashion and propose a simple but effective Progressive Recurrent Network (PRNet). The network aims to remove the shadow progressively, enabing us to flexibly adjust the number of iterations to strike a balance between performance and time. Our network comprises two parts: shadow feature extraction and progressive shadow removal. Specifically, the first part is a shallow ResNet which constructs the representations of the input shadow image on its original size, preventing the loss of high-frequency details caused by the downsampling operation. The second part has two critical components: the re-integration module and the update module. The proposed re-integration module can fully use the outputs of the previous iteration, providing input for the update module for further shadow removal. In this way, the proposed PRNet makes the whole process more concise and only uses 29% network parameters than the best published method. Extensive experiments on the three benchmarks, ISTD, ISTD+, and SRD, demonstrate that our method can effectively remove shadows and achieve superior performance.
    摘要 单图阴影除除是一项仍未解决的重要任务。现有的深度学习基本方法都是直接除阴影,这会导致阴影处理不够好。为了解决这个问题,我们提出了一种分解阴影的方法,并提出了一种简单 yet effective的进程回归网络(PRNet)。该网络的目标是逐步除阴影,以便根据需要调整迭代次数,以达到性能和时间之间的平衡。我们的网络包括两部分:阴影特征提取和进程阴影除法。特别是,第一部分是一个浅层ResNet,可以在输入阴影图像的原始大小上构建阴影图像的表示,避免由下采样操作导致的高频率细节丢失。第二部分包括两个关键组件:重新集成模块和更新模块。我们提出的重新集成模块可以全面使用上一轮的输出,提供更新模块的输入,从而实现更好的阴影除法。这样,我们的PRNet可以让整个过程更简洁,仅使用29%的网络参数,比最佳发布方法少。广泛的实验表明,我们的方法可以有效地除阴影,并达到高性能。

CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.00453
  • repo_url: None
  • paper_authors: Xuhai Chen, Jiangning Zhang, Guanzhong Tian, Haoyang He, Wuhao Zhang, Yabiao Wang, Chengjie Wang, Yunsheng Wu, Yong Liu
  • for: 这篇论文针对零例异常检测(AD)进行研究,AD 是一个有价值但未受到充分研究的任务,它可以在测试物件没有任何参考图像时进行检测。
  • methods: 本文使用语言导向策略,提出了简单又有效的架构 CLIP-AD,利用大型感知语言模型 CLIP 的zero-shot分类能力。我们直接计算文本/图像特征之间的相似性,但我们发现这会导致错误的预测和无关的高亮。因此,我们引入了一个阶段双轻量级模型(SDP),它可以有效地使用不同层次的特征,并通过架构和特征切除来解决这些问题。
  • results: 实验结果显示,SDP 可以在 VisA 上优于 SOTA 的表现,例如在分类/分 segmentation F1 分数上提高 +1.0/+1.2 分,而 SDP+ 则可以在 VisA 上提高 +1.9/+11.7 分。
    Abstract This paper considers zero-shot Anomaly Detection (AD), a valuable yet under-studied task, which performs AD without any reference images of the test objects. Specifically, we employ a language-guided strategy and propose a simple-yet-effective architecture CLIP-AD, leveraging the superior zero-shot classification capabilities of the large vision-language model CLIP. A natural idea for anomaly segmentation is to directly calculate the similarity between text/image features, but we observe opposite predictions and irrelevant highlights in the results. Inspired by the phenomena, we introduce a Staged Dual-Path model (SDP) that effectively uses features from various levels and applies architecture and feature surgery to address these issues. Furthermore, delving beyond surface phenomena, we identify the problem arising from misalignment of text/image features in the joint embedding space. Thus, we introduce a fine-tuning strategy by adding linear layers and construct an extended model SDP+, further enhancing the performance. Abundant experiments demonstrate the effectiveness of our approach, e.g., on VisA, SDP outperforms SOTA by +1.0/+1.2 in classification/segmentation F1 scores, while SDP+ achieves +1.9/+11.7 improvements.
    摘要 However, we observe that directly calculating the similarity between text/image features can lead to opposite predictions and irrelevant highlights in the results. To address this issue, we introduce a Staged Dual-Path (SDP) model that effectively uses features from various levels and applies architecture and feature surgery.Furthermore, we identify the problem of misaligned text/image features in the joint embedding space as the root cause of the issues. To address this, we propose a fine-tuning strategy that involves adding linear layers and constructing an extended model called SDP+. This approach further enhances the performance of our method.The effectiveness of our approach is demonstrated through abundant experiments, where SDP outperforms state-of-the-art (SOTA) methods by +1.0/+1.2 in classification/segmentation F1 scores, while SDP+ achieves +1.9/+11.7 improvements.

On Manipulating Scene Text in the Wild with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.00734
  • repo_url: None
  • paper_authors: Joshua Santoso, Christian Simon, Williem Pao
  • for: 这篇论文是为了提出一种基于扩散模型的场景文本修改方法,以提高图像修改的精度和稳定性。
  • methods: 该方法使用了两种适应策略, namely one-shot style adaptation和文本识别导航,以使用扩散模型来替换图像中的文本。
  • results: 在多个场景文本 dataset上进行了广泛的比较和ablation study,并证明了该方法的高效性和稳定性。 更进一步,该方法可以在 Synthesize scene text tasks 中实现高度的 Optical Character Recognition (OCR) 精度。
    Abstract Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
    摘要 Diffusion models 已经吸引了关注,用于图像编辑,并且在文本至图像任务中取得了出众的结果。然而,可能会注意到,Diffusion models 生成的图像中的细节会受到影响,导致图像编辑任务中的信息损失。这个问题特别影响了场景文本编辑任务,例如修改场景中的文本。为了解决这个问题,我们在这里提出了Diffusion-BasEd Scene Text manipulation Network,简称DBEST。我们设计了两种适应策略,即一次式风格适应和文本识别引导。在实验中,我们详细评估和比较我们的提议方法与当前最佳方法在不同的场景文本集合上。此外,我们还进行了每个细节水平的拓展研究,以分析我们的性能提升。此外,我们还证明了我们的提议方法可以生成高质量的场景文本,并且与高精度Optical Character Recognition (OCR)相匹配。我们的方法在 COCO-text 和 ICDAR2013 数据集上取得了94.15%和98.12%的字符级评估结果。

Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion

  • paper_url: http://arxiv.org/abs/2311.00436
  • repo_url: None
  • paper_authors: Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, Fei-Yue Wang
  • for: 本研究旨在提高遥感摄像头中的交通物体检测精度,并且可以在不同的照明条件下提供高效的检测结果。
  • methods: 本研究使用生物体ahnspired事件摄像头和Structure-aware Fusion Network (SFNet),通过跨Modalities的融合来补做了图像中的信息损失,从而获得适用于不同照明条件的交通物体检测表现。
  • results: 实验结果表明,SFNet可以超越传统摄像头的视觉界限,并且在比较照明条件下比 Frame-based方法提高了8.0%的mAP50和5.9%的mAP50:95。
    Abstract Traffic object detection under variable illumination is challenging due to the information loss caused by the limited dynamic range of conventional frame-based cameras. To address this issue, we introduce bio-inspired event cameras and propose a novel Structure-aware Fusion Network (SFNet) that extracts sharp and complete object structures from the event stream to compensate for the lost information in images through cross-modality fusion, enabling the network to obtain illumination-robust representations for traffic object detection. Specifically, to mitigate the sparsity or blurriness issues arising from diverse motion states of traffic objects in fixed-interval event sampling methods, we propose the Reliable Structure Generation Network (RSGNet) to generate Speed Invariant Frames (SIF), ensuring the integrity and sharpness of object structures. Next, we design a novel Adaptive Feature Complement Module (AFCM) which guides the adaptive fusion of two modality features to compensate for the information loss in the images by perceiving the global lightness distribution of the images, thereby generating illumination-robust representations. Finally, considering the lack of large-scale and high-quality annotations in the existing event-based object detection datasets, we build a DSEC-Det dataset, which consists of 53 sequences with 63,931 images and more than 208,000 labels for 8 classes. Extensive experimental results demonstrate that our proposed SFNet can overcome the perceptual boundaries of conventional cameras and outperform the frame-based method by 8.0% in mAP50 and 5.9% in mAP50:95. Our code and dataset will be available at https://github.com/YN-Yang/SFNet.
    摘要 天然语言探测交通对象在不同照明条件下是挑战,因为传统的帧基本摄像机的动态范围有限,导致信息损失。为解决这问题,我们引入生物体鼓励的事件摄像机和一种新的结构意识网络(SFNet),以提取事件流中的锐利和完整的对象结构,并通过交叉模式融合,赋予网络获取不同照明条件下的对象检测表现。Specifically,我们提出了可靠结构生成网络(RSGNet),以生成速度不变的帧(SIF),以避免由交通对象不同运动状态所导致的缺失或模糊问题。然后,我们设计了一种适应性特征补充模块(AFCM),以便适应性融合两种模式特征,以补偿图像中的信息损失。最后,由于现有的事件基本对象检测数据集缺乏大规模和高质量的标注,我们建立了DSEC-Det数据集,该数据集包括53个序列,63931张图像和208178个标注,对8个类进行标注。我们的实验结果表明,我们的提出的SFNet可以超越传统摄像机的感知边界,并在MAP50和MAP50:95上比Frame-based方法高8.0%和5.9%。我们的代码和数据集将在https://github.com/YN-Yang/SFNet上提供。

Event-based Background-Oriented Schlieren

  • paper_url: http://arxiv.org/abs/2311.00434
  • repo_url: https://github.com/tub-rip/event_based_bos
  • paper_authors: Shintaro Shiba, Friedhelm Hamann, Yoshimitsu Aoki, Guillermo Gallego
  • For: This paper is written to explore the use of event cameras for schlieren imaging, a technique used to observe the flow of transparent media such as air or water.* Methods: The paper uses a novel technique that combines event data and frames to perceive air convection, and formulates the problem as a variational optimization problem.* Results: The proposed method is shown to obtain on par results with existing frame-based optical flow techniques, and works under dark conditions where frame-based schlieren fails. Additionally, the method enables slow-motion analysis.Here is the information in Simplified Chinese text:* For: 这篇论文是用来探索使用事件摄像机进行斜扫图像技术,用于观察透明媒体如空气或水的流动。* Methods: 这篇论文使用了一种新的方法,将事件数据和帧数据结合起来感知空气循环,并将问题转化为一个可变优化问题。* Results: 提议的方法能够与现有的帧基于的光流计算机相比,并在黑暗条件下工作,而 frame-based 斜扫技术失败。此外,方法还可以进行慢动作分析。
    Abstract Schlieren imaging is an optical technique to observe the flow of transparent media, such as air or water, without any particle seeding. However, conventional frame-based techniques require both high spatial and temporal resolution cameras, which impose bright illumination and expensive computation limitations. Event cameras offer potential advantages (high dynamic range, high temporal resolution, and data efficiency) to overcome such limitations due to their bio-inspired sensing principle. This paper presents a novel technique for perceiving air convection using events and frames by providing the first theoretical analysis that connects event data and schlieren. We formulate the problem as a variational optimization one combining the linearized event generation model with a physically-motivated parameterization that estimates the temporal derivative of the air density. The experiments with accurately aligned frame- and event camera data reveal that the proposed method enables event cameras to obtain on par results with existing frame-based optical flow techniques. Moreover, the proposed method works under dark conditions where frame-based schlieren fails, and also enables slow-motion analysis by leveraging the event camera's advantages. Our work pioneers and opens a new stack of event camera applications, as we publish the source code as well as the first schlieren dataset with high-quality frame and event data. https://github.com/tub-rip/event_based_bos
    摘要 《Schlieren成像技术是一种用于观察透明媒体流动的光学技术,无需添加任何粒子杂料。然而,传统的帧基技术需要高分辨率和高时间分辨率的摄像机,这会带来严重的照明和计算限制。事件摄像机具有优势(高动态范围、高时间分辨率和数据效率),这些优势可以让我们超越传统的约束。本文提出了一种使用事件和帧来实现空气擦拂的观察方法,并提供了首次对事件数据和约束之间的连接的理论分析。我们将问题定义为一种变分优化问题,将线性化事件生成模型与物理上有理的参数化联合起来,以估计空气密度的时间导数。实验结果表明,我们提出的方法可以使事件摄像机与现有的帧基光学流动技术相当。此外,我们的方法在黑暗条件下也可以工作,并且可以利用事件摄像机的优势进行慢动作分析。我们的工作开启了新的事件摄像机应用领域,同时我们也发布了高质量帧和事件数据的首个Schlieren数据集。请参考我们的GitHub地址:https://github.com/tub-rip/event_based_bos。

Feature-oriented Deep Learning Framework for Pulmonary Cone-beam CT (CBCT) Enhancement with Multi-task Customized Perceptual Loss

  • paper_url: http://arxiv.org/abs/2311.00412
  • repo_url: https://github.com/zhujiarui42/cfp-loss
  • paper_authors: Jiarui Zhu, Werxing Chen, Hongfei Sun, Shaohua Zhi, Jing Qin, Jing Cai, Ge Ren
    for: This paper aims to enhance the quality of cone-beam computed tomography (CBCT) images for cancer treatment planning by using a deep learning-based feature-oriented framework.methods: The proposed framework consists of two main components: a multi-task learning feature-selection network (MTFS-Net) and a CBCT-to-CT translation network guided by feature-to-feature perceptual loss. The MTFS-Net customizes a perceptual loss function, while the CBCT-to-CT translation network uses advanced generative models such as U-Net, GAN, and CycleGAN.results: The proposed framework can generate synthesized CT (sCT) images for the lung that have a high similarity to CT images, with an average SSIM index of 0.9869 and an average PSNR index of 39.9621. The sCT images also exhibit visually pleasing performance with effective artifacts suppression, noise reduction, and distinctive anatomical details preservation. The proposed framework outperforms state-of-the-art models for pulmonary CBCT enhancement.
    Abstract Cone-beam computed tomography (CBCT) is routinely collected during image-guided radiation therapy (IGRT) to provide updated patient anatomy information for cancer treatments. However, CBCT images often suffer from streaking artifacts and noise caused by under-rate sampling projections and low-dose exposure, resulting in low clarity and information loss. While recent deep learning-based CBCT enhancement methods have shown promising results in suppressing artifacts, they have limited performance on preserving anatomical details since conventional pixel-to-pixel loss functions are incapable of describing detailed anatomy. To address this issue, we propose a novel feature-oriented deep learning framework that translates low-quality CBCT images into high-quality CT-like imaging via a multi-task customized feature-to-feature perceptual loss function. The framework comprises two main components: a multi-task learning feature-selection network(MTFS-Net) for customizing the perceptual loss function; and a CBCT-to-CT translation network guided by feature-to-feature perceptual loss, which uses advanced generative models such as U-Net, GAN and CycleGAN. Our experiments showed that the proposed framework can generate synthesized CT (sCT) images for the lung that achieved a high similarity to CT images, with an average SSIM index of 0.9869 and an average PSNR index of 39.9621. The sCT images also achieved visually pleasing performance with effective artifacts suppression, noise reduction, and distinctive anatomical details preservation. Our experiment results indicate that the proposed framework outperforms the state-of-the-art models for pulmonary CBCT enhancement. This framework holds great promise for generating high-quality anatomical imaging from CBCT that is suitable for various clinical applications.
    摘要 通常情况下, cone-beam computed tomography(CBCT)在image-guided radiation therapy(IGRT)中被 Routinely collected to provide updated patient anatomy information for cancer treatments。然而, CBCT 图像经常受到弧形artefacts和噪声的影响,这些噪声和artefacts是由于低 sampling rate和低剂量暴露而导致的。虽然最近的深度学习基于 CBCT 改进方法已经显示出了良好的表现,但它们在保留 анатомиче细节方面有限的表现,因为传统的像素到像素损失函数无法描述细节的anaatomy。为解决这个问题,我们提出了一种新的 feature-oriented 深度学习框架。该框架包括两个主要组成部分:一个多任务特征选择网络(MTFS-Net),用于定制特征损失函数;以及一个CBCT 到 CT 翻译网络,通过特征与特征的损失函数来 guid。该网络使用了先进的生成模型,如 U-Net、GAN 和 CycleGAN。我们的实验结果表明,提议的框架可以生成高质量的 CT 图像,与 CT 图像的相似性平均值为 0.9869,PSNR 值平均值为 39.9621。这些synthesized CT(sCT)图像也达到了较好的视觉表现,有效地减少了噪声和artefacts,同时保留了细节的anaatomy。我们的实验结果表明,提议的框架超过了现状最佳模型,用于肺部 CBCT 改进。这种框架具有大量应用前景,可以生成高质量的 анатомиче imaging,适用于各种临床应用。

Open-Set Face Recognition with Maximal Entropy and Objectosphere Loss

  • paper_url: http://arxiv.org/abs/2311.00400
  • repo_url: None
  • paper_authors: Rafael Henrique Vareto, Yu Linghu, Terrance E. Boult, William Robson Schwartz, Manuel Günther
  • for: 本研究针对开放集成识别问题,即在训练和投入阶段未见过的未知个体出现在运行阶段。
  • methods: 本文提出了一种嵌入式网络,该网络可以通过额外的负面图像和特定的成本函数(如物体镜像损失和提议的最大熵损失)得到改进。
  • results: 研究人员通过使用预训练的深度神经网络(DNN)作为特征提取器,然后使用嵌入式网络来替换预训练DNN的输出层,实现了在开放集成卷积上的出色表现。
    Abstract Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits from additional negative face images when combined with distinct cost functions, such as Objectosphere Loss (OS) and the proposed Maximal Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of increasing the entropy for negative samples and attaches a penalty to known target classes in pursuance of gallery specialization. The proposed approach adopts pre-trained deep neural networks (DNNs) for face recognition as feature extractors. Then, the adapter network takes deep feature representations and acts as a substitute for the output layer of the pre-trained DNN in exchange for an agile domain adaptation. Promising results have been achieved following open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well as state-of-the-art performance when supplementary negative data is properly selected to fine-tune the adapter network.
    摘要

Towards Omni-supervised Referring Expression Segmentation

  • paper_url: http://arxiv.org/abs/2311.00397
  • repo_url: https://github.com/nineblu/omni-res
  • paper_authors: Minglang Huang, Yiyi Zhou, Gen Luo, Guannan Jiang, Weilin Zhuang, Xiaoshuai Sun
  • for: 提高 Referring Expression Segmentation (RES) 训练效率,使用不同类型的数据,如无标注数据、部分标注数据和弱标注数据,进行efficient RES 训练。
  • methods: 提出 Omni-supervised Referring Expression Segmentation (Omni-RES) 任务,使用教师学生学习方法,选择和改进高质量 Pseudo-masks,以提高 RES 性能。
  • results: 对一些 state-of-the-art RES 模型进行了广泛的实验,并证明了 Omni-RES 方法的效iveness,比如使用Only 10% 全标注数据,Omni-RES 可以帮助基本模型达到完全标注数据的性能水平,并且在半标注数据上超过半标注数据上超过 semi-supervised 方法,提高 RefCOCO 和 RefCOCO+ 的性能。
    Abstract Referring Expression Segmentation (RES) is an emerging task in computer vision, which segments the target instances in images based on text descriptions. However, its development is plagued by the expensive segmentation labels. To address this issue, we propose a new learning task for RES called Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to make full use of unlabeled, fully labeled and weakly labeled data, e.g., referring points or grounding boxes, for efficient RES training. To accomplish this task, we also propose a novel yet strong baseline method for Omni-RES based on the recently popular teacher-student learning, where where the weak labels are not directly transformed into supervision signals but used as a yardstick to select and refine high-quality pseudo-masks for teacher-student learning. To validate the proposed Omni-RES method, we apply it to a set of state-of-the-art RES models and conduct extensive experiments on a bunch of RES datasets. The experimental results yield the obvious merits of Omni-RES than the fully-supervised and semi-supervised training schemes. For instance, with only 10% fully labeled data, Omni-RES can help the base model achieve 100% fully supervised performance, and it also outperform the semi-supervised alternative by a large margin, e.g., +14.93% on RefCOCO and +14.95% on RefCOCO+, respectively. More importantly, Omni-RES also enable the use of large-scale vision-langauges like Visual Genome to facilitate low-cost RES training, and achieve new SOTA performance of RES, e.g., 80.66 on RefCOCO.
    摘要 “参考表达分割(RES)是计算机视觉领域的一种新趋势,它基于图像中的文本描述来分割目标实例。然而,其发展受到严重的分割标签成本的限制。为解决这个问题,我们提出了一种新的学习任务,即多种指导 Referring Expression Segmentation(Omni-RES),它通过利用无标签数据、全标签数据和弱标签数据,例如引用点或权重标签,进行高效的RES训练。为实现这个任务,我们还提出了一种新的基线方法,基于最近受欢迎的教师学生学习,其中弱标签不直接转化为监督信号,而是用于选择和改进高质量的 pseudo-mask。为验证我们的 Omni-RES 方法,我们对一些状态前的 RES 模型进行了广泛的实验,并在一些 RES 数据集上进行了测试。实验结果表明,Omni-RES 方法在比 Fully-supervised 和 Semi-supervised 训练方案更有优势。例如,只有 10% 全标签数据,Omni-RES 可以帮助基础模型达到全标签性能,并且也在 Semi-supervised 方案中高于差分较大,例如 RefCOCO 和 RefCOCO+ 上的 +14.93% 和 +14.95%,分别。此外,Omni-RES 还可以使用大规模的视觉语言,如 Visual Genome,来实现低成本的 RES 训练,并实现新的 SOTA 性能,例如 RefCOCO 上的 80.66%。”

Fixation-based Self-calibration for Eye Tracking in VR Headsets

  • paper_url: http://arxiv.org/abs/2311.00391
  • repo_url: None
  • paper_authors: Ryusei Uramune, Sei Ikeda, Hiroki Ishizuka, Osamu Oshiro
    for:这种研究旨在提出一种基于自由视点和对象表面上的点 fixes 的自适应均衡方法,以便在虚拟现实(VR)头戴式设备中进行眼动跟踪。methods:该方法基于视点可以自由移动以及在不同视点下 fixes 的点的分布在对象表面上的假设。首先,通过对三维场景数据进行扩展的 I-VDT 算法,检测出 fixations。然后,通过最小化 PoRs 的散度指标来优化均衡参数。results:这种方法可能可以无需显式用户均衡、图像处理或标记代理物来确定用户依赖的偏移量从视场轴到视场轴,并且在 18 名参与者在两个 VR 环境中步行时,该方法的精度为 2.1$^\circ$,与平均偏移相比明显低。这是第一种在三维环境中的自适应均衡方法,其精度低于 3$^\circ$。此外,通过修改检测 fixations 或优化优化算法可以提高方法的精度。
    Abstract This study proposes a novel self-calibration method for eye tracking in a virtual reality (VR) headset. The proposed method is based on the assumptions that the user's viewpoint can freely move and that the points of regard (PoRs) from different viewpoints are distributed within a small area on an object surface during visual fixation. In the method, fixations are first detected from the time-series data of uncalibrated gaze directions using an extension of the I-VDT (velocity and dispersion threshold identification) algorithm to a three-dimensional (3D) scene. Then, the calibration parameters are optimized by minimizing the sum of a dispersion metrics of the PoRs. The proposed method can potentially identify the optimal calibration parameters representing the user-dependent offset from the optical axis to the visual axis without explicit user calibration, image processing, or marker-substitute objects. For the gaze data of 18 participants walking in two VR environments with many occlusions, the proposed method achieved an accuracy of 2.1$^\circ$, which was significantly lower than the average offset. Our method is the first self-calibration method with an average error lower than 3$^\circ$ in 3D environments. Further, the accuracy of the proposed method can be improved by up to 1.2$^\circ$ by refining the fixation detection or optimization algorithm.
    摘要 Translated into Simplified Chinese:这个研究提出了一种新的自适应准备方法 для视觉跟踪器在虚拟现实(VR)头戴式设备中。该方法基于用户视点可以自由移动以及不同视点下的点关注(PoR)在物体表面上分布在小区域内的假设。在该方法中,首先从未加工的时间序列数据中检测了不加工视线方向的fixation,使用了三维场景中的I-VDT(速度和杂散阈值识别)算法的扩展。然后,通过最小化点关注的杂散度矩阵来优化准备参数。该方法可能可以无需显式用户准备、图像处理或marker substitute对象来确定用户依赖的偏移量从光学轴到视觉轴。对18名参与者在两个VR环境中走动的视觉数据进行了分析,该方法的精度为2.1°,与平均偏移值有 statistically significant difference。我们的方法是3D环境中第一个自适应准备方法,其平均错误低于3°。此外,通过改进fixation检测或优化算法,可以提高该方法的精度。

NeuralGF: Unsupervised Point Normal Estimation by Learning Neural Gradient Function

  • paper_url: http://arxiv.org/abs/2311.00389
  • repo_url: https://github.com/leoqli/neuralgf
  • paper_authors: Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han
  • for: 本研究旨在提出一种能够直接从点云数据中提取方向法的深度学习方法,而无需使用地面真实方向的监督。
  • methods: 我们引入了一种新的神经网络学习方法,该方法鼓励神经网络 fits 输入点云数据,并且在点上产生单位方向的梯度。我们还引入了损失函数,使得查询点能够逐渐到达移动目标点,并且在approximated surface上归一化。
  • results: 我们的方法可以更加准确地估计方向,并且能够抗抗噪、缺失和density变化。我们的result在广泛使用的benchmark上达到了最佳性能,超过了latest方法。
    Abstract Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing. The state-of-the-art methods rely on priors of fitting local surfaces learned from normal supervision. However, normal supervision in benchmarks comes from synthetic shapes and is usually not available from real scans, thereby limiting the learned priors of these methods. In addition, normal orientation consistency across shapes remains difficult to achieve without a separate post-processing procedure. To resolve these issues, we propose a novel method for estimating oriented normals directly from point clouds without using ground truth normals as supervision. We achieve this by introducing a new paradigm for learning neural gradient functions, which encourages the neural network to fit the input point clouds and yield unit-norm gradients at the points. Specifically, we introduce loss functions to facilitate query points to iteratively reach the moving targets and aggregate onto the approximated surface, thereby learning a global surface representation of the data. Meanwhile, we incorporate gradients into the surface approximation to measure the minimum signed deviation of queries, resulting in a consistent gradient field associated with the surface. These techniques lead to our deep unsupervised oriented normal estimator that is robust to noise, outliers and density variations. Our excellent results on widely used benchmarks demonstrate that our method can learn more accurate normals for both unoriented and oriented normal estimation tasks than the latest methods. The source code and pre-trained model are publicly available at https://github.com/LeoQLi/NeuralGF.
    摘要 普通估计3D点云是3D形状处理的基本任务。现状领域的方法都是基于点云上的本地表面适应学习得到的约束。然而,实际扫描数据中的正常监督信息通常不可获得,因此这些方法中学习的约束有限。另外,在不同形状之间均匀的正常方向 consistency 还是一个难以实现的问题。为解决这些问题,我们提出了一种新的方法, direct 从点云中估计方向正常场,不使用地面 truth 监督。我们通过引入一种新的 neural gradient 函数学习 paradigm,让神经网络适应输入点云,并且在点上得到单位 нор 的梯度。具体来说,我们引入了一种新的损失函数,使得查询点能够逐步到达移动目标,并将查询点积累到估计的表面上,从而学习全局表面 Representation 的数据。同时,我们将梯度 integrate 到表面估计中,以测量查询点与表面之间的最小积分差,从而获得一个均匀的梯度场,与表面相关的梯度场。这些技术导致我们的深度无监督方向正常估计器,可以抗抵达噪音、异常点和density 变化。我们的Result 在广泛使用的标准 benchmark 上表现出色,表明我们的方法可以更加准确地估计方向正常场,超过最新的方法。我们的源代码和预训练模型可以在 上获取。

Learning Cooperative Trajectory Representations for Motion Forecasting

  • paper_url: http://arxiv.org/abs/2311.00371
  • repo_url: https://github.com/air-thu/dair-v2x-seq
  • paper_authors: Hongzhi Ruan, Haibao Yu, Wenxian Yang, Siqi Fan, Yingjuan Tang, Zaiqing Nie
  • for: This paper focuses on motion forecasting for autonomous driving, specifically using cooperative information from infrastructure and other vehicles to enhance forecasting capabilities.
  • methods: The proposed method is called V2X-Graph, which is an interpretable and end-to-end learning framework that leverages cooperative motion and interaction contexts using an interpretable graph.
  • results: The paper demonstrates the effectiveness of V2X-Graph on the V2I motion forecasting dataset V2X-Seq, and also constructs a real-world V2X motion forecasting dataset V2X-Traj to further evaluate the method. The results show the advantage of the proposed method.
    Abstract Motion forecasting is an essential task for autonomous driving, and the effective information utilization from infrastructure and other vehicles can enhance motion forecasting capabilities. Existing research have primarily focused on leveraging single-frame cooperative information to enhance the limited perception capability of the ego vehicle, while underutilizing the motion and interaction information of traffic participants observed from cooperative devices. In this paper, we first propose the cooperative trajectory representations learning paradigm. Specifically, we present V2X-Graph, the first interpretable and end-to-end learning framework for cooperative motion forecasting. V2X-Graph employs an interpretable graph to fully leverage the cooperative motion and interaction contexts. Experimental results on the vehicle-to-infrastructure (V2I) motion forecasting dataset, V2X-Seq, demonstrate the effectiveness of V2X-Graph. To further evaluate on V2X scenario, we construct the first real-world vehicle-to-everything (V2X) motion forecasting dataset V2X-Traj, and the performance shows the advantage of our method. We hope both V2X-Graph and V2X-Traj can facilitate the further development of cooperative motion forecasting. Find project at https://github.com/AIR-THU/V2X-Graph, find data at https://github.com/AIR-THU/DAIR-V2X-Seq.
    摘要 传感器预测是自动驾驶中的关键任务,利用基础设施和其他车辆的有效信息可以提高传感器预测能力。现有研究主要是利用单一帧合作信息来增强egos车辆的有限感知能力,而未充分利用交通参与者的运动和互动信息。在本文中,我们首先提出了合作轨迹表示学习思路。特别是,我们提出了V2X-Graph,第一个可解释的终端学习框架 для合作运动预测。V2X-Graph使用可解释的图来完全利用合作运动和互动上下文。实验结果表明,V2X-Graph在V2I动作预测数据集上具有显著优势。为进一步评估V2X场景,我们构建了第一个真实世界的 Vehicle-to-Everything(V2X)动作预测数据集V2X-Traj,并发现了我们的方法的优势。我们希望V2X-Graph和V2X-Traj可以促进合作动作预测的进一步发展。项目可以在https://github.com/AIR-THU/V2X-Graph找到,数据可以在https://github.com/AIR-THU/DAIR-V2X-Seq找到。

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

  • paper_url: http://arxiv.org/abs/2311.00353
  • repo_url: None
  • paper_authors: Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo Jin, Kaiye Wang, Pengfei Yan
  • for: zero-shot video-to-video translation with temporal coherence
  • methods: incorporates warping operation in the latent space to constrain query tokens and improve temporal consistency
  • results: superior video-to-video translation with enhanced visual temporal coherence compared to previous methods
    Abstract Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.
    摘要 利用生成能力的图像扩散模型可以提供无需预训练的视频到视频翻译的潜在潜力。关键在于如何在生成的视频帧中保持时间一致性。先前的方法通常采用cross-frame注意力,即在不同帧的注意力中共享键和值token,以促进时间一致性。然而,这些工作中可能并未完全解决时间不一致的问题,因此生成的视频的准确性有限。现状的潜在抑制方法是保持细腻的视觉细节的同时,仍然面临着时间协调问题。在这篇论文中,我们发现瓶颈在不受限制的查询token上,因此我们提出了一种新的无需预训练视频到视频翻译框架,名为LatentWarp。我们的方法简单:在生成过程中,基于原始视频中获得的光流,将生成的最后一帧的秘密特征截图进行扭曲,以使得当前帧和相邻帧的相关区域可以共享相似的查询token和注意力输出,从而进一步提高秘密水平的一致性,以提高生成的视频的视觉时间准确性。我们的实验结果表明,LatentWarp可以在无需预训练的情况下实现视频到视频翻译的时间准确性。

Analyzing Head Orientation of Neurotypical and Autistic Individuals in Triadic Conversations

  • paper_url: http://arxiv.org/abs/2311.00343
  • repo_url: None
  • paper_authors: Onur N. Tepencelik, Wenchuan Wei, Pamela C. Cosman, Sujit Dey
  • for: 这个论文是用来估计人体和头部的方向偏移的系统。
  • methods: 这个系统使用了低分辨率点云数据来估计人体和头部的方向偏移。模型使用椭球适应和人工神经网络回归器来准确地估计人体和头部的方向偏移。
  • results: 模型的测试结果表明,与其他使用RGB摄像头的身体和头部偏移估计系统相比,使用LiDAR感知器保护用户隐私,并达到相似的准确性。此外,模型不需要在前方 Specified placement,并且可以在实际会话中准确地估计人体和头部的方向偏移。
    Abstract We propose a system that estimates people's body and head orientations using low-resolution point cloud data from two LiDAR sensors. Our models make accurate estimations in real-world conversation settings where the subject moves naturally with varying head and body poses. The body orientation estimation model uses ellipse fitting while the head orientation estimation model is a pipeline of geometric feature extraction and an ensemble of neural network regressors. Compared with other body and head orientation estimation systems using RGB cameras, our proposed system uses LiDAR sensors to preserve user privacy, while achieving comparable accuracy. Unlike other body/head orientation estimation systems, our sensors do not require a specified placement in front of the subject. Our models achieve a mean absolute estimation error of 5.2 degrees for body orientation and 13.7 degrees for head orientation. We use our models to quantify behavioral differences between neurotypical and autistic individuals in triadic conversations. Tests of significance show that people with autism spectrum disorder display significantly different behavior compared to neurotypical individuals in terms of distributing attention between participants in a conversation, suggesting that the approach could be a component of a behavioral analysis or coaching system.
    摘要 我们提出了一个系统,使用低分辨率点云数据从两个LiDAR感知器来估算人体和头部orientation。我们的模型在真实世界的对话 Setting中具有高精度,可以捕捉人体在自然的移动和变化pose下的orientation。bodyorientation estimation模型使用椭球拟合,而headorientation estimation模型则是一个包括几何特征提取和多个神经网络回归器的管道。与其他使用RGB摄像头的body和headorientation estimation系统相比,我们的提议的系统使用LiDAR感知器,保持用户隐私,同时实现相似的准确性。与其他body/headorientation estimation系统不同的是,我们的感知器不需要在前方的主题处置于特定的位置。我们的模型的 mean absolute estimation error为5.2度 дляbodyorientation和13.7度 дляheadorientation。我们使用我们的模型来量化 между neurtypical和autism Spectrum Disorder(ASD)个体在多人对话中的行为差异。测试显示,人类ASD Displayed significant differences in attention distribution between participants in a conversation compared to neurotypical individuals, suggesting that the approach could be a component of a behavioral analysis or coaching system.

fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding

  • paper_url: http://arxiv.org/abs/2311.00342
  • repo_url: None
  • paper_authors: Xuelin Qian, Yun Wang, Jingyang Huo, Jianfeng Feng, Yanwei Fu
  • for: This paper aims to develop a novel approach for pre-training fMRI data, addressing the challenges of individual brain differences and improving the quality of brain activity decoding.
  • methods: The proposed approach, called fMRI-PTE, uses an auto-encoder to transform fMRI signals into unified 2D representations, leveraging a novel learning strategy and image generators to enhance the quality of reconstruction and facilitate downstream tasks.
  • results: The authors demonstrate the effectiveness of fMRI-PTE through extensive experiments, showing improved performance in brain activity decoding compared to traditional methods and offering a promising foundation for future research in this area.
    Abstract The exploration of brain activity and its decoding from fMRI data has been a longstanding pursuit, driven by its potential applications in brain-computer interfaces, medical diagnostics, and virtual reality. Previous approaches have primarily focused on individual subject analysis, highlighting the need for a more universal and adaptable framework, which is the core motivation behind our work. In this work, we propose fMRI-PTE, an innovative auto-encoder approach for fMRI pre-training, with a focus on addressing the challenges of varying fMRI data dimensions due to individual brain differences. Our approach involves transforming fMRI signals into unified 2D representations, ensuring consistency in dimensions and preserving distinct brain activity patterns. We introduce a novel learning strategy tailored for pre-training 2D fMRI images, enhancing the quality of reconstruction. fMRI-PTE's adaptability with image generators enables the generation of well-represented fMRI features, facilitating various downstream tasks, including within-subject and cross-subject brain activity decoding. Our contributions encompass introducing fMRI-PTE, innovative data transformation, efficient training, a novel learning strategy, and the universal applicability of our approach. Extensive experiments validate and support our claims, offering a promising foundation for further research in this domain.
    摘要 <>translate(The exploration of brain activity and its decoding from fMRI data has been a longstanding pursuit, driven by its potential applications in brain-computer interfaces, medical diagnostics, and virtual reality. Previous approaches have primarily focused on individual subject analysis, highlighting the need for a more universal and adaptable framework, which is the core motivation behind our work. In this work, we propose fMRI-PTE, an innovative auto-encoder approach for fMRI pre-training, with a focus on addressing the challenges of varying fMRI data dimensions due to individual brain differences. Our approach involves transforming fMRI signals into unified 2D representations, ensuring consistency in dimensions and preserving distinct brain activity patterns. We introduce a novel learning strategy tailored for pre-training 2D fMRI images, enhancing the quality of reconstruction. fMRI-PTE's adaptability with image generators enables the generation of well-represented fMRI features, facilitating various downstream tasks, including within-subject and cross-subject brain activity decoding. Our contributions encompass introducing fMRI-PTE, innovative data transformation, efficient training, a novel learning strategy, and the universal applicability of our approach. Extensive experiments validate and support our claims, offering a promising foundation for further research in this domain.)Here's the translation:<>翻译(脑活动的探索和从fMRI数据中的解码已经是一项长期的追求,它的潜在应用包括脑机器交互、医疗诊断和虚拟现实等。先前的方法主要集中在个体Subject的分析上,这 highlights the need for a more universal and adaptable framework, which is the core motivation behind our work。在这项工作中,我们提出了fMRI-PTE,一种创新的自编码方法 для fMRI预训练,强调解决因个体脑difficulties而导致的fMRI数据维度的变化挑战。我们的方法涉及将fMRI信号转换成统一的2D表示形式,确保维度的一致性和保持脑活动特征的分布。我们介绍了一种适应于预训练2D fMRI图像的新学习策略,提高了重建质量。fMRI-PTE的图像生成器的可靠性使得可以生成高质量的fMRI特征,促进了多种下游任务,包括在个体和跨个体脑活动解码。我们的贡献包括引入fMRI-PTE、创新数据转换、高效训练、适应学习策略和我们的方法的 universality。广泛的实验证明和支持我们的声明,提供了一个可靠的基础 для进一步的研究在这个领域。)

Space Narrative: Generating Images and 3D Scenes of Chinese Garden from Text using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.00339
  • repo_url: None
  • paper_authors: Jiaxi Shi1, Hao Hua1
  • for: 这篇论文主要针对的是传统中国园林研究和修复的困难问题,即缺乏直接资料。
  • methods: 该论文提出了一种基于深度学习方法的园林图像生成方法,使用了文本描述和历史园林画作为数据集,并使用了LoRA进行精度调整。
  • results: 该论文通过使用文本描述和历史园林画的数据集,使用了深度学习方法生成了具有明朝风格的园林图像,并可以在Unity 3D 中实现三维展示。
    Abstract The consistent mapping from poems to paintings is essential for the research and restoration of traditional Chinese gardens. But the lack of firsthand ma-terial is a great challenge to the reconstruction work. In this paper, we pro-pose a method to generate garden paintings based on text descriptions using deep learning method. Our image-text pair dataset consists of more than one thousand Ming Dynasty Garden paintings and their inscriptions and post-scripts. A latent text-to-image diffusion model learns the mapping from de-scriptive texts to garden paintings of the Ming Dynasty, and then the text description of Jichang Garden guides the model to generate new garden paintings. The cosine similarity between the guide text and the generated image is the evaluation criterion for the generated images. Our dataset is used to fine-tune the pre-trained diffusion model using Low-Rank Adapta-tion of Large Language Models (LoRA). We also transformed the generated images into a panorama and created a free-roam scene in Unity 3D. Our post-trained model is capable of generating garden images in the style of Ming Dynasty landscape paintings based on textual descriptions. The gener-ated images are compatible with three-dimensional presentation in Unity 3D.
    摘要 “ Traditional Chinese gardens 的研究和修复受到描述与画作之间的一致性很大帮助。但是,由于缺乏直接证据,修复工作受到很大挑战。本文提出了一种基于深度学习方法生成园林画作的方法。我们的图文对照集包括明朝园林画作和其附注和词汇,共计超过一千个。我们使用文本描述和图像生成模型,将描述文本引导模型生成新的园林画作。我们使用 cosine 相似性作为生成图像的评价标准。我们的模型可以在 Unity 3D 中生成园林画作,并将其转换为漫游场景。我们的模型可以基于文本描述生成园林画作,并且可以与三维表现相匹配。”

SDF4CHD: Generative Modeling of Cardiac Anatomies with Congenital Heart Defects

  • paper_url: http://arxiv.org/abs/2311.00332
  • repo_url: None
  • paper_authors: Fanwei Kong, Sascha Stocker, Perry S. Choi, Michael Ma, Daniel B. Ennis, Alison Marsden
    for:This paper aims to improve the diagnosis and treatment planning of congenital heart disease (CHD) by generating virtual cardiac anatomies using deep learning (DL) methods.methods:The proposed approach uses a type- and shape-disentangled generative model based on signed distance fields (SDF) to capture the wide spectrum of cardiac anatomies observed in different CHD types. The approach also learns invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes.results:The proposed approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation, which can improve the diagnosis and treatment planning of CHD patients.
    Abstract Congenital heart disease (CHD) encompasses a spectrum of cardiovascular structural abnormalities, often requiring customized treatment plans for individual patients. Computational modeling and analysis of these unique cardiac anatomies can improve diagnosis and treatment planning and may ultimately lead to improved outcomes. Deep learning (DL) methods have demonstrated the potential to enable efficient treatment planning by automating cardiac segmentation and mesh construction for patients with normal cardiac anatomies. However, CHDs are often rare, making it challenging to acquire sufficiently large patient cohorts for training such DL models. Generative modeling of cardiac anatomies has the potential to fill this gap via the generation of virtual cohorts; however, prior approaches were largely designed for normal anatomies and cannot readily capture the significant topological variations seen in CHD patients. Therefore, we propose a type- and shape-disentangled generative approach suitable to capture the wide spectrum of cardiac anatomies observed in different CHD types and synthesize differently shaped cardiac anatomies that preserve the unique topology for specific CHD types. Our DL approach represents generic whole heart anatomies with CHD type-specific abnormalities implicitly using signed distance fields (SDF) based on CHD type diagnosis, which conveniently captures divergent anatomical variations across different types and represents meaningful intermediate CHD states. To capture the shape-specific variations, we then learn invertible deformations to morph the learned CHD type-specific anatomies and reconstruct patient-specific shapes. Our approach has the potential to augment the image-segmentation pairs for rarer CHD types for cardiac segmentation and generate cohorts of CHD cardiac meshes for computational simulation.
    摘要 Congenital heart disease (CHD) 包括一系列心血管结构畸形,经常需要根据各个患者的特殊情况制定个性化的治疗方案。计算模型和分析这些特殊的心血管结构可以提高诊断和治疗规划,并最终可能导致更好的结果。深度学习(DL)方法已经表明可以通过自动化心血管分 segmentation和心血管建模来提高诊断和治疗规划的效率。但是,CHD 的发生率很低,因此困难以获得足够的患者组合来训练这些 DL 模型。生成模型可以填补这一空白,通过生成虚拟患者组合来模拟不同类型的 CHD。但是,先前的方法主要针对正常的心血管结构,无法轻松地捕捉 CHD 患者的重要的拓扑变化。因此,我们提出了一种类型和形状分解的生成方法,可以 capture 不同类型的 CHD 的各种拓扑变化,并生成具有不同形状的心血管结构。我们的 DL 方法使用了签名距离场(SDF)基于 CHD 类型诊断,以便捕捉不同类型的 CHD 中的多样化拓扑变化。然后,我们学习了可逆变形,以将学习的 CHD 类型特定的心血管结构变换为患者特定的形状。我们的方法有望增加较少seen的 CHD 类型的图像分割对,以及生成不同类型的 CHD 心血管核心,以便计算模拟。

Enhancing Clustering Representations with Positive Proximity and Cluster Dispersion Learning

  • paper_url: http://arxiv.org/abs/2311.00731
  • repo_url: None
  • paper_authors: Abhishek Kumar, Dong-Gyu Lee
  • for: 这篇论文主要用于提出一种新的深度划分方法,即PIPCDR方法,用于解决现代深度划分中的问题。
  • methods: PIPCDR方法使用了一种新的正例邻近损失函数和一种减少划分趋势的补偿器,以兼顾两种方法的优点,并消除其缺点。
  • results: PIPCDR方法在一系列模拟和实际 datasets上表现出色,能够生成良好的划分结果,同时避免维度缩合和类别碰撞问题,提高划分精度。
    Abstract Contemporary deep clustering approaches often rely on either contrastive or non-contrastive techniques to acquire effective representations for clustering tasks. Contrastive methods leverage negative pairs to achieve homogenous representations but can introduce class collision issues, potentially compromising clustering performance. On the contrary, non-contrastive techniques prevent class collisions but may produce non-uniform representations that lead to clustering collapse. In this work, we propose a novel end-to-end deep clustering approach named PIPCDR, designed to harness the strengths of both approaches while mitigating their limitations. PIPCDR incorporates a positive instance proximity loss and a cluster dispersion regularizer. The positive instance proximity loss ensures alignment between augmented views of instances and their sampled neighbors, enhancing within-cluster compactness by selecting genuinely positive pairs within the embedding space. Meanwhile, the cluster dispersion regularizer maximizes inter-cluster distances while minimizing within-cluster compactness, promoting uniformity in the learned representations. PIPCDR excels in producing well-separated clusters, generating uniform representations, avoiding class collision issues, and enhancing within-cluster compactness. We extensively validate the effectiveness of PIPCDR within an end-to-end Majorize-Minimization framework, demonstrating its competitive performance on moderate-scale clustering benchmark datasets and establishing new state-of-the-art results on large-scale datasets.
    摘要 现代深度划分方法 oftentimes 依赖于 either 对比或非对比技术来获得有效的划分表示。对比方法 利用负对比来实现同一类型的表示,但可能会引入类冲突问题,可能会降低划分性能。然而,非对比技术 避免类冲突问题,但可能会生成不均匀的表示,导致划分崩溃。在这项工作中,我们提出了一种新的端到端深度划分方法,名为PIPCDR。PIPCDR 结合了正例邻接损失和群分散正则化。正例邻接损失 确保在扩展视图中的实例和其采样的邻居之间的Alignment,提高同一个分组内的紧凑性,通过选择真正的正例对在嵌入空间中进行选择。同时,群分散正则化 最大化间分组距离,最小化同一个分组内的紧凑性,使得学习的表示具有均匀性。PIPCDR 在生成均匀的分组、避免类冲突问题、提高同一个分组内的紧凑性和端到端 Majorize-Minimization 框架中展现出了竞争性的性能,并在大规模数据集上达到了新的国际纪录。

Flooding Regularization for Stable Training of Generative Adversarial Networks

  • paper_url: http://arxiv.org/abs/2311.00318
  • repo_url: None
  • paper_authors: Iu Yahiro, Takashi Ishida, Naoto Yokoya
  • for: 提高生成 adversarial network (GAN) 的稳定性。
  • methods: 直接对抗损失函数进行补偿,使用涌流法来防止批判器损失值过低。
  • results: 实验表明,涌流法可以稳定 GAN 的训练,并且可以与其他稳定技术结合使用。此外,研究还发现,对批判器损失值进行限制,可以使训练过程更加稳定,即使涌流水平较高。
    Abstract Generative Adversarial Networks (GANs) have shown remarkable performance in image generation. However, GAN training suffers from the problem of instability. One of the main approaches to address this problem is to modify the loss function, often using regularization terms in addition to changing the type of adversarial losses. This paper focuses on directly regularizing the adversarial loss function. We propose a method that applies flooding, an overfitting suppression method in supervised learning, to GANs to directly prevent the discriminator's loss from becoming excessively low. Flooding requires tuning the flood level, but when applied to GANs, we propose that the appropriate range of flood level settings is determined by the adversarial loss function, supported by theoretical analysis of GANs using the binary cross entropy loss. We experimentally verify that flooding stabilizes GAN training and can be combined with other stabilization techniques. We also reveal that by restricting the discriminator's loss to be no greater than flood level, the training proceeds stably even when the flood level is somewhat high.
    摘要

An Empirical Study of Frame Selection for Text-to-Video Retrieval

  • paper_url: http://arxiv.org/abs/2311.00298
  • repo_url: None
  • paper_authors: Mengxia Wu, Min Cao, Yang Bai, Ziyin Zeng, Chen Chen, Liqiang Nie, Min Zhang
  • for: 文本-视频重现(TVR)旨在在大量视频库中找到基于文本查询的最相关视频。
  • methods: exist 方法通常选择视频中的一 subset of frames来表示视频内容,以提高 TVR 的性能和效率。
  • results: 根据多个 TVR benchmark 的全面分析,我们证明了 proper frame 选择可以显著提高 TVR 的检索效率,而无需牺牲检索性能。
    Abstract Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.
    摘要

Graph Representation Learning for Infrared and Visible Image Fusion

  • paper_url: http://arxiv.org/abs/2311.00291
  • repo_url: None
  • paper_authors: Jing Li, Lu Bai, Bin Yang, Chang Li, Lingfei Ma, Edwin R. Hancock
  • for: 本研究的目的是提出一种基于图表示的多模态图像重构方法,以提高图像重构的精度和效率。
  • methods: 本研究使用图 convolutional neural networks (GCNs) 抽取非本地自相关特征 (NLss),通过图可以提供细致的结构来归一化特征并传递信息。
  • results: 实验结果表明,提出的方法可以更好地捕捉图像中的NLss,并且与传统方法相比,具有更高的重构精度和效率。
    Abstract Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably leads to information loss. In addition, the transformer structure extracts long-range dependence by considering the correlativity among all image patches, leading to information redundancy of such transformer-based methods. However, graph representation is more flexible than grid (CNN) or sequence (transformer structure) representation to address irregular objects, and graph can also construct the relationships among the spatially repeatable details or texture with far-space distance. Therefore, to address the above issues, it is significant to convert images into the graph space and thus adopt graph convolutional networks (GCNs) to extract NLss. This is because the graph can provide a fine structure to aggregate features and propagate information across the nearest vertices without introducing redundant information. Concretely, we implement a cascaded NLss extraction pattern to extract NLss of intra- and inter-modal by exploring interactions of different image pixels in intra- and inter-image positional distance. We commence by preforming GCNs on each intra-modal to aggregate features and propagate information to extract independent intra-modal NLss. Then, GCNs are performed on the concatenate intra-modal NLss features of infrared and visible images, which can explore the cross-domain NLss of inter-modal to reconstruct the fused image. Ablation studies and extensive experiments illustrates the effectiveness and superiority of the proposed method on three datasets.
    摘要 infrared和可见图像融合的目标是提取 complementary 特征来生成合并的单独图像。许多方法使用卷积神经网络(CNN)提取本地特征,因为它们的翻译不变性和本地性。然而,CNN失去了图像的非本地自相似性(NLss),尽管它可以通过抽象操作扩大感知范围,但仍然会导致信息损失。此外,transformer结构可以捕捉图像的长距离相关性,但是它会导致图像的信息重复。然而,图像表示更加灵活于网格(CNN)或序列(transformer结构)表示,可以 Address Irregular Objects。因此,为了解决这些问题,需要将图像转换为图像空间,并采用图像卷积神经网络(GCNs)提取NLss。这是因为图像可以提供细致的结构来聚合特征和在最近邻居中传递信息,而无需引入重复信息。具体来说,我们实现了一种卷积NLss提取模式,通过探索不同图像像素之间的交互来提取NLss。我们开始是在每个intra-modal中使用GCNs来聚合特征和传递信息,以提取独立的intra-modalNLss。然后,我们在 concatenate intra-modalNLss特征上使用GCNs来探索跨频道NLss,以重构融合图像。aborlation studies和广泛的实验表明了我们提posed方法的有效性和优越性。

Mixture-of-Experts for Open Set Domain Adaptation: A Dual-Space Detection Approach

  • paper_url: http://arxiv.org/abs/2311.00285
  • repo_url: None
  • paper_authors: Zhenbang Du, Jiayu An, Jiahao Hong, Dongrui Wu
  • for: 这篇论文的目的是实现开放集领域适束(OSDA),以实现源领域和目标领域之间的分布和标签迁移,并实现精准的分类结果。
  • methods: 这篇论文使用了混合专家(MoE)的方法,将不同的专家处理不同的输入特征,从而生成不同的专家路由模式,以便更好地识别未知的类别样本。
  • results: 这篇论文的实验结果显示,使用了Graph Router来更好地利用图像组件之间的空间信息,可以更好地识别未知的类别样本,并且比较精准地预测未知类别的结果。
    Abstract Open Set Domain Adaptation (OSDA) aims to cope with the distribution and label shifts between the source and target domains simultaneously, performing accurate classification for known classes while identifying unknown class samples in the target domain. Most existing OSDA approaches, depending on the final image feature space of deep models, require manually-tuned thresholds, and may easily misclassify unknown samples as known classes. Mixture-of-Expert (MoE) could be a remedy. Within an MoE, different experts address different input features, producing unique expert routing patterns for different classes in a routing feature space. As a result, unknown class samples may also display different expert routing patterns to known classes. This paper proposes Dual-Space Detection, which exploits the inconsistencies between the image feature space and the routing feature space to detect unknown class samples without any threshold. Graph Router is further introduced to better make use of the spatial information among image patches. Experiments on three different datasets validated the effectiveness and superiority of our approach. The code will come soon.
    摘要

TLMCM Network for Medical Image Hierarchical Multi-Label Classification

  • paper_url: http://arxiv.org/abs/2311.00282
  • repo_url: None
  • paper_authors: Meng Wu, Siyan Luo, Qiyu Wu, Wenbin Ouyang
    for:本研究旨在提高现代医疗领域的医学影像层次多标签分类(MI-HMC)任务中的两大挑战:数据不均衡和层次约束。现有的解决方案通常包括复杂的模型建立设计或域专业的预处理,需要较大的专业知识或努力进行实现。methods:为解决这些限制,本研究提出了传输学习与最大约束模块(TLMCM)网络,用于MI-HMC任务。TLMCM网络提供了一种新的方法,以超越现有方法,根据Area Under the Average Precision and Recall Curve($AU\overline{(PRC)}$)指标。此外,本研究还提出了两种新的准确率指标:$EMR$和$HammingAccuracy$,在MI-HMC任务中尚未得到广泛的探讨。results:实验结果表明,TLMCM网络在MI-HMC任务中可以达到高多标签预测率(80%-90%),使其成为医疗领域应用中的有价值贡献。
    Abstract Medical Image Hierarchical Multi-Label Classification (MI-HMC) is of paramount importance in modern healthcare, presenting two significant challenges: data imbalance and \textit{hierarchy constraint}. Existing solutions involve complex model architecture design or domain-specific preprocessing, demanding considerable expertise or effort in implementation. To address these limitations, this paper proposes Transfer Learning with Maximum Constraint Module (TLMCM) network for the MI-HMC task. The TLMCM network offers a novel approach to overcome the aforementioned challenges, outperforming existing methods based on the Area Under the Average Precision and Recall Curve($AU\overline{(PRC)}$) metric. In addition, this research proposes two novel accuracy metrics, $EMR$ and $HammingAccuracy$, which have not been extensively explored in the context of the MI-HMC task. Experimental results demonstrate that the TLMCM network achieves high multi-label prediction accuracy($80\%$-$90\%$) for MI-HMC tasks, making it a valuable contribution to healthcare domain applications.
    摘要 医疗图像层次多标签分类(MI-HMC)在现代医疗中具有重要意义,存在两个主要挑战:数据不均衡和层次约束。现有的解决方案包括复杂的模型建立设计或域pecific的预处理,需要较大的专业知识或努力进行实现。为了解决这些限制,本文提出了基于传输学习的最大约束模块(TLMCM)网络,用于MI-HMC任务。TLMCM网络提供了一种新的方法,超越现有方法,根据Area Under the Average Precision and Recall Curve($AU\overline{(PRC)}$)指标。此外,本研究还提出了两个新的准确率指标,$EMR$和$HammingAccuracy$,在MI-HMC任务中尚未得到广泛的探讨。实验结果表明,TLMCM网络在MI-HMC任务中实现了80%-90%的多标签预测率,使其成为医疗领域应用中的有价值贡献。

OpenForest: A data catalogue for machine learning in forest monitoring

  • paper_url: http://arxiv.org/abs/2311.00277
  • repo_url: https://github.com/rolnicklab/openforest
  • paper_authors: Arthur Ouaknine, Teja Kattenborn, Etienne Laliberté, David Rolnick
  • for: 本研究旨在提供一个开放的数据集,以便应用机器学习方法进行大规模森林监测。
  • methods: 本研究使用了86个开放数据集,包括森林资源调查、地面、空中和卫星数据记录等,以探讨森林生态系统的变化。
  • results: 本研究提供了一个动态目录,名为OpenForest,用于收集和总结所有可用的开放数据集,以便推广大规模森林监测的研究。
    Abstract Forests play a crucial role in Earth's system processes and provide a suite of social and economic ecosystem services, but are significantly impacted by human activities, leading to a pronounced disruption of the equilibrium within ecosystems. Advancing forest monitoring worldwide offers advantages in mitigating human impacts and enhancing our comprehension of forest composition, alongside the effects of climate change. While statistical modeling has traditionally found applications in forest biology, recent strides in machine learning and computer vision have reached important milestones using remote sensing data, such as tree species identification, tree crown segmentation and forest biomass assessments. For this, the significance of open access data remains essential in enhancing such data-driven algorithms and methodologies. Here, we provide a comprehensive and extensive overview of 86 open access forest datasets across spatial scales, encompassing inventories, ground-based, aerial-based, satellite-based recordings, and country or world maps. These datasets are grouped in OpenForest, a dynamic catalogue open to contributions that strives to reference all available open access forest datasets. Moreover, in the context of these datasets, we aim to inspire research in machine learning applied to forest biology by establishing connections between contemporary topics, perspectives and challenges inherent in both domains. We hope to encourage collaborations among scientists, fostering the sharing and exploration of diverse datasets through the application of machine learning methods for large-scale forest monitoring. OpenForest is available at https://github.com/RolnickLab/OpenForest .
    摘要 森林扮演着地球系维程序中重要的角色,提供了一系列社会和经济生态系统服务,但是人类活动对森林产生了明显的干扰,导致生态系统内部的平衡状态出现了明显的异常。推进全球森林监测可以有助于减轻人类对森林的影响,提高我们对森林结构的理解,以及气候变化的影响。在森林生物学中,统计学模型传统上找到了应用,但现在机器学习和计算机视觉在使用遥感数据时已经取得了重要的进步,例如树种识别、树冠分割和森林生物质评估。为此,开放数据的重要性仍然是不可或缺的,以提高这些数据驱动的算法和方法。本文提供了86个开放访问森林数据集,覆盖不同的空间尺度,包括森林资源库、地面、航空、卫星记录等,并分类在OpenForest中,一个动态目录开放至贡献。此外,在这些数据集的背景下,我们希望通过与机器学习领域的连接来激发研究,探索森林生物学中的当代话题、视角和挑战,并促进科学家之间的合作,共享和探索多种数据集,通过机器学习方法进行大规模森林监测。OpenForest可以在https://github.com/RolnickLab/OpenForest上获取。

Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study

  • paper_url: http://arxiv.org/abs/2311.00265
  • repo_url: https://github.com/jongdory/aldm
  • paper_authors: Jonghun Kim, Hyunjin Park
  • for: This paper proposes a model for image-to-image translation in 3D medical images without patch cropping, which can be used for comprehensive evaluations in medical image analysis.
  • methods: The proposed model uses a latent diffusion model (LDM) with switchable blocks, specifically multiple switchable spatially adaptive normalization (MS-SPADE), to generate high-quality target modalities in 3D.
  • results: The model demonstrated successful image synthesis across different source-target modality scenarios and outperformed other models in quantitative evaluations tested on multi-modal brain magnetic resonance imaging datasets of four different modalities and an independent IXI dataset.
    Abstract Multi-modal images play a crucial role in comprehensive evaluations in medical image analysis providing complementary information for identifying clinically important biomarkers. However, in clinical practice, acquiring multiple modalities can be challenging due to reasons such as scan cost, limited scan time, and safety considerations. In this paper, we propose a model based on the latent diffusion model (LDM) that leverages switchable blocks for image-to-image translation in 3D medical images without patch cropping. The 3D LDM combined with conditioning using the target modality allows generating high-quality target modality in 3D overcoming the shortcoming of the missing out-of-slice information in 2D generation methods. The switchable block, noted as multiple switchable spatially adaptive normalization (MS-SPADE), dynamically transforms source latents to the desired style of the target latents to help with the diffusion process. The MS-SPADE block allows us to have one single model to tackle many translation tasks of one source modality to various targets removing the need for many translation models for different scenarios. Our model exhibited successful image synthesis across different source-target modality scenarios and surpassed other models in quantitative evaluations tested on multi-modal brain magnetic resonance imaging datasets of four different modalities and an independent IXI dataset. Our model demonstrated successful image synthesis across various modalities even allowing for one-to-many modality translations. Furthermore, it outperformed other one-to-one translation models in quantitative evaluations.
    摘要 多Modal图像在医学影像分析中扮演着关键性的角色,提供了补充信息,用于标识临床重要的生物标志物。然而,在临床实践中,获取多Modal的可能性受到了多种因素的限制,如扫描成本、扫描时间和安全考虑。在这篇论文中,我们提出了基于秘密扩散模型(LDM)的模型,使用可 switchable 块(MS-SPADE)来实现图像到图像翻译。3D LDM 与 Conditioning 使用目标模式Allow generating high-quality target modality in 3D, overcoming the shortcomings of missing out-of-slice information in 2D generation methods. MS-SPADE block dynamically transforms source latents to the desired style of the target latents to help with the diffusion process. Our model can handle many translation tasks of one source modality to various targets, eliminating the need for multiple translation models for different scenarios. We tested our model on multi-modal brain magnetic resonance imaging datasets of four different modalities and an independent IXI dataset, and it exhibited successful image synthesis across different source-target modality scenarios and outperformed other models in quantitative evaluations. Our model also demonstrated successful image synthesis across various modalities, even allowing for one-to-many modality translations, and outperformed other one-to-one translation models in quantitative evaluations.

Solutions to Elliptic and Parabolic Problems via Finite Difference Based Unsupervised Small Linear Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2311.00259
  • repo_url: None
  • paper_authors: Adrian Celaya, Keegan Kirk, David Fuentes, Beatrice Riviere
  • for: 解决Partial Differential Equations (PDEs)问题,尤其是使用深度学习和神经网络来解决PDEs。
  • methods: 提出了一种 Fully Unsupervised Approach,不需要训练数据或标注输入输出对。
  • results: 对一些选择的elliptic和parabolic问题进行比较,与finite difference方法相当准确。
    Abstract In recent years, there has been a growing interest in leveraging deep learning and neural networks to address scientific problems, particularly in solving partial differential equations (PDEs). However, current neural network-based PDE solvers often rely on extensive training data or labeled input-output pairs, making them prone to challenges in generalizing to out-of-distribution examples. To mitigate the generalization gap encountered by conventional neural network-based methods in estimating PDE solutions, we formulate a fully unsupervised approach, requiring no training data, to estimate finite difference solutions for PDEs directly via small convolutional neural networks. Our proposed algorithms demonstrate a comparable accuracy to the true solution for several selected elliptic and parabolic problems compared to the finite difference method.
    摘要 近年来,有越来越多的关注利用深度学习和神经网络解决科学问题,特别是解决部分偏微分方程(PDEs)。然而,现有的神经网络基于PDE解决方法通常需要大量的训练数据或标注输入输出对,使其容易遇到对不同示例的泛化问题。为了解决传统神经网络基于方法中的泛化差距,我们提出了一种完全无监督的方法,不需要任何训练数据,直接通过小型 convolutional neural networks 来估算部分偏微分解决方案。我们的提议的算法在选择的椭球和偏微分问题中与finite difference方法相比证明了相似的准确性。

RAUNE-Net: A Residual and Attention-Driven Underwater Image Enhancement Method

  • paper_url: http://arxiv.org/abs/2311.00246
  • repo_url: https://github.com/fansuregrin/raune-net
  • paper_authors: Wangzhen Peng, Chenghao Zhou, Runze Hu, Jingchao Cao, Yutao Liu
  • For: 提高水下图像的清晰度和质量* Methods: 使用深度学习和含义学习的策略,包括高级特征径 residual 学习和两种注意力控制技术* Results: 对水下图像进行了改进,提高了对水下图像的恢复和修复性能,并且在不同的水下环境下都能够保持良好的视觉效果
    Abstract Underwater image enhancement (UIE) poses challenges due to distinctive properties of the underwater environment, including low contrast, high turbidity, visual blurriness, and color distortion. In recent years, the application of deep learning has quietly revolutionized various areas of scientific research, including UIE. However, existing deep learning-based UIE methods generally suffer from issues of weak robustness and limited adaptability. In this paper, inspired by residual and attention mechanisms, we propose a more reliable and reasonable UIE network called RAUNE-Net by employing residual learning of high-level features at the network's bottle-neck and two aspects of attention manipulations in the down-sampling procedure. Furthermore, we collect and create two datasets specifically designed for evaluating UIE methods, which contains different types of underwater distortions and degradations. The experimental validation demonstrates that our method obtains promising objective performance and consistent visual results across various real-world underwater images compared to other eight UIE methods. Our example code and datasets are publicly available at https://github.com/fansuregrin/RAUNE-Net.
    摘要 水下图像提高(UIE)在水下环境中存在许多挑战,包括低对比度、高混杂度、视觉模糊和颜色扭曲。在最近几年,深度学习在科研领域中的应用已经革命化了许多领域,包括UIE。然而,现有的深度学习基于UIE方法通常具有弱Robustness和有限的适应性。在这篇论文中,我们提出一种更可靠和合理的UIE网络,称为RAUNE-Net,通过在网络的瓶颈位置使用高级特征的径向学习和下采样过程中使用两种注意力 manipulate。此外,我们收集了和制作了专门用于评估UIE方法的两个数据集,这些数据集包含不同类型的水下扭曲和降低。实验验证表明,我们的方法在各种真实水下图像上获得了显著的目标性能和一致的视觉结果,与其他八个UIE方法相比。我们的示例代码和数据集公开在https://github.com/fansuregrin/RAUNE-Net。

1DFormer: Learning 1D Landmark Representations via Transformer for Facial Landmark Tracking

  • paper_url: http://arxiv.org/abs/2311.00241
  • repo_url: None
  • paper_authors: Shi Yin, Shijie Huan, Defu Lian, Shangfei Wang, Jinshui Hu, Tao Guo, Bing Yin, Baocai Yin, Cong Liu
  • for: 该 paper targets 提高 facial landmark tracking 的性能,并 explore 1D landmark representations 的潜在能力。
  • methods: 该 paper 提出了一种基于 Transformer 架构的方法,名为 1DFormer,它通过在时间和空间维度进行 token 交互,捕捉 facial landmark 的动态和几何特征,并通过多头注意力机制和循环混合机制来适应长期 landmark 动态。
  • results: 实验结果表明,1DFormer 能够模型 facial landmark 序列中的长期顺序模式以及内在的面部结构特征,并在 facial landmark tracking 中 achieve state-of-the-art 性能。
    Abstract Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a recurrent token mixing mechanism, an axis-landmark-positional embedding mechanism, as well as a confidence-enhanced multi-head attention mechanism to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group structure modeling mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking.
    摘要 最近,基于1D landmark表示方法的热图回归方法已经在识别人脸特征点方面表现出了显著的表现。然而,之前的方法忽略了对1D landmark表示的深入探索,以挖掘出多个特征点的序列和结构模型化的潜在可能性。为此,我们提议一种名为1DFormer的Transformer架构,该架构通过在时间和空间维度进行token通信来学习有用的1D landmark表示。为了模elling长期序列动态,我们提出了循环token混合机制、轴点附加嵌入机制以及信息增强多头注意机制,以适应地 Adaptively和可靠地将长期特征点动态嵌入其1D表示中。为了结构模型化,我们设计了间组和组间结构模型化机制,以编码组件水平以及全局水平的脸部结构模式,并通过1D卷积层进行Token交互,以便在空间维度进行1D表示的修正。实验结果表明,1DFormer成功地模elling了长期序列的特征点动态,以及脸部结构模式,并在人脸特征点跟踪方面实现了状态对应的最佳性能。

DINO-Mix: Enhancing Visual Place Recognition with Foundational Vision Model and Feature Mixing

  • paper_url: http://arxiv.org/abs/2311.00230
  • repo_url: None
  • paper_authors: Gaoshuang Huang, Yang Zhou, Xiaofei Hu, Chenglong Zhang, Luying Zhao, Wenjian Gan, Mingbo Hou
  • for: 本研究旨在提高现实世界中的视觉位置识别(VPR)技术的精度和可靠性,使其能够在复杂的环境下(包括光照变化、季节变化和遮挡物)提供高精度的位置识别结果。
  • methods: 本研究使用DINOv2模型作为基础网络,进行trimming和精度调整,以提取图像特征。我们提出了一种新的VPR架构,称为DINO-Mix,它将基础视觉模型与特征聚合模块结合,以提取全球稳定和普适的图像特征。我们使用MLP-Mixer型混合模块,将图像特征进行混合,以获得高精度和普适的位置识别结果。
  • results: 我们在包括光照变化、季节变化和遮挡物的测试集(Tokyo24/7、Nordland、SF-XL-Testv1)中,利用我们提出的DINO-Mix架构,实现了Top-1准确率为91.75%、80.18%和82%,分别比现状态艺术方法提高5.14%的精度。
    Abstract Utilizing visual place recognition (VPR) technology to ascertain the geographical location of publicly available images is a pressing issue for real-world VPR applications. Although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions caused by moving objects, is generally unsatisfactory. In this study, we utilize the DINOv2 model as the backbone network for trimming and fine-tuning to extract robust image features. We propose a novel VPR architecture called DINO-Mix, which combines a foundational vision model with feature aggregation. This architecture relies on the powerful image feature extraction capabilities of foundational vision models. We employ an MLP-Mixer-based mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. We experimentally demonstrate that the proposed DINO-Mix architecture significantly outperforms current state-of-the-art (SOTA) methods. In test sets having lighting variations, seasonal changes, and occlusions (Tokyo24/7, Nordland, SF-XL-Testv1), our proposed DINO-Mix architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively. Compared with SOTA methods, our architecture exhibited an average accuracy improvement of 5.14%.
    摘要 utilizing visual place recognition (VPR) technology to determine the geographical location of publicly available images is a pressing issue for real-world VPR applications. although most current VPR methods achieve favorable results under ideal conditions, their performance in complex environments, characterized by lighting variations, seasonal changes, and occlusions caused by moving objects, is generally unsatisfactory. in this study, we utilize the DINOv2 model as the backbone network for trimming and fine-tuning to extract robust image features. we propose a novel VPR architecture called DINO-Mix, which combines a foundational vision model with feature aggregation. this architecture relies on the powerful image feature extraction capabilities of foundational vision models. we employ an MLP-Mixer-based mix module to aggregate image features, resulting in globally robust and generalizable descriptors that enable high-precision VPR. we experimentally demonstrate that the proposed DINO-Mix architecture significantly outperforms current state-of-the-art (SOTA) methods. in test sets having lighting variations, seasonal changes, and occlusions (Tokyo24/7, Nordland, SF-XL-Testv1), our proposed DINO-Mix architecture achieved Top-1 accuracy rates of 91.75%, 80.18%, and 82%, respectively. compared with SOTA methods, our architecture exhibited an average accuracy improvement of 5.14%.