cs.CV - 2023-09-27

Diagnosis of Helicobacter pylori using AutoEncoders for the Detection of Anomalous Staining Patterns in Immunohistochemistry Images

  • paper_url: http://arxiv.org/abs/2309.16053
  • repo_url: None
  • paper_authors: Pau Cano, Álvaro Caravaca, Debora Gil, Eva Musulen
  • for: 检测人类胃癌病毒Helicobacter pylori
  • methods: 使用自适应神经网络模型(autoencoder),从健康组织图像中学习异常特征,检测H. pylori
  • results: 模型精度91%,敏感性86%,特异性96%,AUC0.97,能够高效地检测H. pylori
    Abstract This work addresses the detection of Helicobacter pylori a bacterium classified since 1994 as class 1 carcinogen to humans. By its highest specificity and sensitivity, the preferred diagnosis technique is the analysis of histological images with immunohistochemical staining, a process in which certain stained antibodies bind to antigens of the biological element of interest. This analysis is a time demanding task, which is currently done by an expert pathologist that visually inspects the digitized samples. We propose to use autoencoders to learn latent patterns of healthy tissue and detect H. pylori as an anomaly in image staining. Unlike existing classification approaches, an autoencoder is able to learn patterns in an unsupervised manner (without the need of image annotations) with high performance. In particular, our model has an overall 91% of accuracy with 86\% sensitivity, 96% specificity and 0.97 AUC in the detection of H. pylori.
    摘要

Handbook on Leveraging Lines for Two-View Relative Pose Estimation

  • paper_url: http://arxiv.org/abs/2309.16040
  • repo_url: None
  • paper_authors: Petr Hruby, Shaohui Liu, Rémi Pautrat, Marc Pollefeys, Daniel Barath
  • for: 本文旨在提出一种可以结合点、线和它们的重合的方法来估算准确的图像对比pose的方法。
  • methods: 本文使用了融合点、线和重合的方法,并评估了现有文献中的最小解算法。
  • results: 实验表明, compared to点基本方法,本文的方法在各种indoor和outdoor数据集上提高了AUC@10$^\circ$的值,提高了1-7个点,并且运行速度相对较快。Here’s the English version of the information:
  • for: The paper proposes a method for estimating the relative pose between calibrated image pairs by jointly exploiting points, lines, and their coincidences in a hybrid manner.
  • methods: The method combines the advantages of all possible configurations where these data modalities can be used together, and reviews the minimal solvers available in the literature.
  • results: Experiments on various indoor and outdoor datasets show that the proposed approach outperforms point-based methods, improving AUC@10$^\circ$ by 1-7 points while running at comparable speeds.
    Abstract We propose an approach for estimating the relative pose between calibrated image pairs by jointly exploiting points, lines, and their coincidences in a hybrid manner. We investigate all possible configurations where these data modalities can be used together and review the minimal solvers available in the literature. Our hybrid framework combines the advantages of all configurations, enabling robust and accurate estimation in challenging environments. In addition, we design a method for jointly estimating multiple vanishing point correspondences in two images, and a bundle adjustment that considers all relevant data modalities. Experiments on various indoor and outdoor datasets show that our approach outperforms point-based methods, improving AUC@10$^\circ$ by 1-7 points while running at comparable speeds. The source code of the solvers and hybrid framework will be made public.
    摘要 我们提出了一种方法,用于估算投影图像对的相对pose,通过同时利用点、线和它们的重合来实现。我们审查了所有可能的数据模式,并评估了文献中可用的最小解。我们的混合框架结合了所有配置的优点,可以在具有挑战性的环境中提供稳定和准确的估算。此外,我们还设计了用于在两个图像中同时估算多个消失点匹配的方法,以及考虑所有相关数据模式的缓冲调整。在各种室内和室外数据集上进行了实验,我们的方法与点基本方法相比,提高了AUC@10$^\circ$的值,从1-7个点中增加了1-7个点,并且在相同的速度下运行。我们计划将解决方案和混合框架的源代码公开。

Q-REG: End-to-End Trainable Point Cloud Registration with Surface Curvature

  • paper_url: http://arxiv.org/abs/2309.16023
  • repo_url: None
  • paper_authors: Shengze Jin, Daniel Barath, Marc Pollefeys, Iro Armeni
  • for: 这种论文主要用于提出一种新的点云注册方法,以便更好地进行点云注册问题的解决。
  • methods: 这种方法使用了学习基于方法,包括对匹配的优化,以及使用RANSAC-like框架进行评估。
  • results: 这种方法可以提供更加稳定和有效的点云注册结果,并且可以在实时应用中使用。它在3DMatch、KITTI和ModelNet测试数据集上达到了新的状态平衡。
    Abstract Point cloud registration has seen recent success with several learning-based methods that focus on correspondence matching and, as such, optimize only for this objective. Following the learning step of correspondence matching, they evaluate the estimated rigid transformation with a RANSAC-like framework. While it is an indispensable component of these methods, it prevents a fully end-to-end training, leaving the objective to minimize the pose error nonserved. We present a novel solution, Q-REG, which utilizes rich geometric information to estimate the rigid pose from a single correspondence. Q-REG allows to formalize the robust estimation as an exhaustive search, hence enabling end-to-end training that optimizes over both objectives of correspondence matching and rigid pose estimation. We demonstrate in the experiments that Q-REG is agnostic to the correspondence matching method and provides consistent improvement both when used only in inference and in end-to-end training. It sets a new state-of-the-art on the 3DMatch, KITTI, and ModelNet benchmarks.
    摘要 “对点云注册进行了最近的成功,使用了一些学习基于方法,强调对应匹配和优化这个目标。在学习步骤中,它们使用RANSAC类框架进行评估估算的稳定性,但是这会阻碍完整的端到端训练,使得最小化pose错误的目标未被服务。我们提出了一种新的解决方案,即Q-REG,它利用了丰富的几何信息来估算点云的稳定性。Q-REG允许我们对稳定性进行排序的极限搜索,因此可以进行端到端训练,并且可以同时优化对应匹配和稳定性估算的两个目标。我们在实验中证明了Q-REG是不同对应匹配方法的agnostic,并在推理和端到端训练中提供了一致的改进。它在3DMatch、KITTI和ModelNet标准测试 benchmark上设置了新的状态。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization

  • paper_url: http://arxiv.org/abs/2309.16020
  • repo_url: None
  • paper_authors: Vicente Vivanco Cepeda, Gaurav Kumar Nayak, Mubarak Shah
  • for: 准确地定位全球任意位置的图像
  • methods: 提出了一种基于CLIP的图像-GPS匹配方法,使用位置编码和幂等分辨率表示来模型地球,并通过对图像和GPS位置进行对齐来实现地图地标注。
  • results: 通过广泛的实验和简要的ablation,证明了该方法的有效性,只需使用20%的训练数据就能达到竞争力水平,并且通过文本查询示例展示了图像地理标注的可行性。
    Abstract Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. However, their performance is limited by the predefined classes and often results in inaccurate localizations when an image's location significantly deviates from its class center. To overcome these limitations, we propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations. GeoCLIP's location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich high-dimensional feature suitable to use even beyond geo-localization. To the best of our knowledge, this is the first work employing GPS encoding for geo-localization. We demonstrate the efficacy of our method via extensive experiments and ablations on benchmark datasets. We achieve competitive performance with just 20% of training data, highlighting its effectiveness even in limited-data settings. Furthermore, we qualitatively demonstrate geo-localization using a text query by leveraging CLIP backbone of our image encoder.
    摘要 全球地理位置 pinpoint 任何地点的精准位置是全球地理位置定位的挑战。由于地理景观的巨大差异,图像到图像检索方法无法在全球范围内解决这个问题。现有的方法将地球分成精确的地理维度单元,将问题转化为一个分类任务,但其性能受限于预先定义的类别,常导致图像的位置偏差从类别中心偏离。为了超越这些限制,我们提出了 GeoCLIP,一种基于 CLIP 的图像到 GPS Retrieval 方法。GeoCLIP 的位置编码器使用随机傅里埃特性编码 Earth 为一个连续函数,并使用层次表示,以捕捉图像与 GPS 位置之间的对应关系。这使得 GeoCLIP 可以在有限数据量下达到竞争性性能。我们通过广泛的实验和剔除研究证明 GeoCLIP 的效果。此外,我们通过 CLIP 的背景网络,用文本查询来实现地理位置定位。

Assessment of Local Climate Zone Products via Simplified Classification Rule with 3D Building Maps

  • paper_url: http://arxiv.org/abs/2309.15978
  • repo_url: None
  • paper_authors: Hunsoo Song, Gaia Cervini, Jinha Jung
  • for: 本研究evaluates the performance of a global Local Climate Zone (LCZ) product.
  • methods: 研究使用了一种简单的规则生成法 constructed a reference LCZ using high-resolution 3D building maps.
  • results: 研究发现,全球LCZ产品很难 differentiate classes that demand precise building footprint information (Classes 6 and 9), and classes that necessitate the identification of subtle differences in building elevation (Classes 4-6). Additionally, 研究发现了不一致的趋势,城市间LCZ分布不同, suggesting the presence of a data distribution shift problem in the machine learning-based LCZ classifier.
    Abstract This study assesses the performance of a global Local Climate Zone (LCZ) product. We examined the built-type classes of LCZs in three major metropolitan areas within the U.S. A reference LCZ was constructed using a simple rule-based method based on high-resolution 3D building maps. Our evaluation demonstrated that the global LCZ product struggles to differentiate classes that demand precise building footprint information (Classes 6 and 9), and classes that necessitate the identification of subtle differences in building elevation (Classes 4-6). Additionally, we identified inconsistent tendencies, where the distribution of classes skews differently across different cities, suggesting the presence of a data distribution shift problem in the machine learning-based LCZ classifier. Our findings shed light on the uncertainties in global LCZ maps, help identify the LCZ classes that are the most challenging to distinguish, and offer insight into future plans for LCZ development and validation.
    摘要 Translation notes:* "Local Climate Zone" (LCZ) is translated as "地方气候区" (dìfāng kīhào qū)* "built-type classes" is translated as "建筑类别" (jiànzhù làibie)* "high-resolution 3D building maps" is translated as "高分辨率3D建筑地图" (gāo fēnbianhé lǐ 3D jiànzhù dìtú)* "machine learning-based LCZ classifier" is translated as "基于机器学习的LCZ分类器" (jīyù jīshì xuéxí de LCZ fēngròngqì)* "data distribution shift problem" is translated as "数据分布偏移问题" (shùzhì fāngbù pénduì wèn tí)

Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields

  • paper_url: http://arxiv.org/abs/2309.15977
  • repo_url: None
  • paper_authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
  • for: 这个论文的目的是提出一种基于神经网络场函数的听音场景参数化方法,以提高听音场景的准确性。
  • methods: 这个方法使用多个听音上下文,如干擦性、形态特征和空间信息,来Parameterize听音场景。它还使用时间相关模块和多尺度能量衰减标准来适应RIR的独特性。
  • results: 实验结果显示,NACF方法在比较 existed 场景下表现出了明显的优异,超过了现有的场景基于场函数方法。
    Abstract Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactory performance. This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene by leveraging multiple acoustic contexts, such as geometry, material property, and spatial information. Driven by the unique properties of RIR, i.e., temporal un-smoothness and monotonic energy attenuation, we design a temporal correlation module and multi-scale energy decay criterion. Experimental results show that NACF outperforms existing field-based methods by a notable margin. Please visit our project page for more qualitative results.
    摘要 <使用 neural field 函数来表示室内声学环境的室内响应(RIR)已经有一些前期工作,但这些方法并不充分考虑了声音场景的音学性质,导致效果不够满意。这封信提议一种新的声学上下文场景方法(NACF),利用多种声学上下文,如几何、物理性和空间信息,来参数化声音场景。驱动了RIR的特有性,如时间不整合和单调能量减衰,我们设计了时间相关模块和多scale能量减衰标准。实验结果表明,NACF在场景基于方法中表现出优于其他方法。更多资讯请访问我们项目页面。Here's a breakdown of the translation:* "室内响应" (RIR) is translated as "室内响应" (also RIR).* " neural field function" is translated as "声学上下文场景方法" (NACF).* "acoustic properties" is translated as "音学性质" (音学性质).* "audio scene" is translated as "声音场景" (声音场景).* "temporal un-smoothness" is translated as "时间不整合" (时间不整合).* "monotonic energy attenuation" is translated as "单调能量减衰" (单调能量减衰).* "field-based methods" is translated as "场景基于方法" (场景基于方法).Please note that the translation is done using Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.

The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering

  • paper_url: http://arxiv.org/abs/2309.15954
  • repo_url: None
  • paper_authors: Haichao Yu, Yu Tian, Sateesh Kumar, Linjie Yang, Heng Wang
  • for: 本研究旨在评估不同数据筛选方法的性能,以提高基础模型的表现。
  • methods: 本研究使用了三个阶段的筛选策略:单模态筛选、交叉模态筛选和数据分布对接。我们还提出了新的解决方案,如计算 CLIP 分数在水平翻转图像上以减少场景文本的干扰,使用视觉和语言模型来检索下游任务的训练样本,重新平衡数据分布以改善计算资源的分配效率等。
  • results: 我们的方法比 DataComp 论文中最佳方法平均表现提高了4%, ImageNet 上表现提高了2%。
    Abstract The quality of pre-training data plays a critical role in the performance of foundation models. Popular foundation models often design their own recipe for data filtering, which makes it hard to analyze and compare different data filtering approaches. DataComp is a new benchmark dedicated to evaluating different methods for data filtering. This paper describes our learning and solution when participating in the DataComp challenge. Our filtering strategy includes three stages: single-modality filtering, cross-modality filtering, and data distribution alignment. We integrate existing methods and propose new solutions, such as computing CLIP score on horizontally flipped images to mitigate the interference of scene text, using vision and language models to retrieve training samples for target downstream tasks, rebalancing the data distribution to improve the efficiency of allocating the computational budget, etc. We slice and dice our design choices, provide in-depth analysis, and discuss open questions. Our approach outperforms the best method from the DataComp paper by over 4% on the average performance of 38 tasks and by over 2% on ImageNet.
    摘要 “数据预训模型的质量具有关键作用,但是popular基础模型经常设计自己的数据筛选方法,这使得分析和比较不同数据筛选方法的困难。为了解决这个问题,DataComp是一个新的竞赛benchmark,用于评估不同数据筛选方法。这篇论文描述了我们在DataComp挑战中学习和解决的经验。我们的筛选策略包括三个阶段:单模态筛选、交叉模态筛选和数据分布对齐。我们将现有方法与新的解决方案相结合,例如在横向翻转图像上计算CLIP分数以避免场景文本的干扰,使用视觉和语言模型来收集下游任务的训练样本,重新规划数据分布以提高计算预算的效率等。我们将slice和dice我们的设计选择,进行深入分析,并讨论开放问题。我们的方法在38个任务的平均性能上超过DataComp文章中最佳方法的4%,并在ImageNet上超过2%。”

AutoEncoding Tree for City Generation and Applications

  • paper_url: http://arxiv.org/abs/2309.15941
  • repo_url: None
  • paper_authors: Wenyu Han, Congcong Wen, Lazarus Chok, Yan Liang Tan, Sheung Lung Chan, Hang Zhao, Chen Feng
  • for: 这paper的目的是为了提出一种基于树状自编码器的城市生成模型,以解决城市数据的巨量和缺乏公共数据的问题。
  • methods: 该paper使用了一种新的空间几何距离(SGD)度量来衡量建筑布局的相似性,然后将其转化为一棵树状网络,其中encoder部分会逐级提取和合并空间信息。
  • results: 实验结果表明,提出的AETree模型可以有效地进行2D和3D城市生成,同时学习的缓存特征可以用于下游城市规划应用。
    Abstract City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the development of methods for city generation. In this paper, we first collect over 3,000,000 geo-referenced objects for the city of New York, Zurich, Tokyo, Berlin, Boston and several other large cities. Based on this dataset, we propose AETree, a tree-structured auto-encoder neural network, for city generation. Specifically, we first propose a novel Spatial-Geometric Distance (SGD) metric to measure the similarity between building layouts and then construct a binary tree over the raw geometric data of building based on the SGD metric. Next, we present a tree-structured network whose encoder learns to extract and merge spatial information from bottom-up iteratively. The resulting global representation is reversely decoded for reconstruction or generation. To address the issue of long-dependency as the level of the tree increases, a Long Short-Term Memory (LSTM) Cell is employed as a basic network element of the proposed AETree. Moreover, we introduce a novel metric, Overlapping Area Ratio (OAR), to quantitatively evaluate the generation results. Experiments on the collected dataset demonstrate the effectiveness of the proposed model on 2D and 3D city generation. Furthermore, the latent features learned by AETree can serve downstream urban planning applications.
    摘要 城市模型化和生成在各种应用中受到了越来越多的关注,包括游戏、城市规划和自动驾驶。与前一些关注单个 объек或室内场景生成的研究不同,城市的巨量数据带来了生成模型的挑战。此外,有限公共可用的3D实际城市数据也限制了城市生成方法的发展。在这篇论文中,我们首先收集了纽约、苏黎世、东京、柏林和波士顿等城市的3,000,000个地理引用对象。基于这些数据,我们提议了AETree,一种树状自动编码网络,用于城市生成。具体来说,我们首先提出了一种新的空间几何距离(SGD)度量,用于衡量建筑布局之间的相似性。然后,我们将建筑的原始几何数据拼接成一棵二叉树,基于SGD度量。接下来,我们介绍了一个树状网络,其编码器可以从底向上iteratively提取和合并空间信息。结果的全局表示可以 reversely 解码为重建或生成。为了解决生成结果中层级增长的长期依赖问题,我们采用了一个长期记忆(LSTM)细胞作为AETree的基本网络元素。此外,我们还提出了一个新的度量, overlap 区域比率(OAR),用于评估生成结果的质量。实验表明,提议的模型在2D和3D城市生成中表现效果。此外,AETree所学习的潜在特征可以服务于下游城市规划应用。

Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

  • paper_url: http://arxiv.org/abs/2309.15940
  • repo_url: https://github.com/changhaonan/ovsg
  • paper_authors: Haonan Chang, Kowndinya Boyalakuntla, Shiyang Lu, Siwei Cai, Eric Jing, Shreesh Keskar, Shijie Geng, Adeeb Abbas, Lifeng Zhou, Kostas Bekris, Abdeslam Boularias
  • for: 用于提供一个开放词汇3D场景图(OVSG),用于对各种实体(例如物体实例、代理人和区域)进行识别,并支持自由文本查询。
  • methods: 使用自由文本查询,而不是传统的semantic-based对象定位方法,以提供上下文意识感知定位。
  • results: 在使用ScanNet数据集和自我收集的数据集进行比较实验中,我们的提议方法在对前期 semantic-based定位技术的比较中显著超越了性能。此外,我们还探讨了OVSG在实际 робоNavigation和操作实验中的实际应用。
    Abstract We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as ``pick up a cup on a kitchen table" or ``navigate to a sofa on which someone is sitting". In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying. Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments.
    摘要 我们提出了一个开放词汇3D场景图(OVSG),这是一种正式框架,用于将多种实体,如物品实例、代理人和区域,与自由形式文本查询相关联。与传统的意义基于对象定位方法不同,我们的系统支持上下文意识实体定位,allowing for queries such as "pick up a cup on a kitchen table" or "navigate to a sofa on which someone is sitting". 与现有的3D场景图研究不同,OVSG支持自由形式文本输入和开放词汇查询。通过对ScanNet数据集和自我收集的数据集进行比较实验,我们示出了我们提议的方法在前一个 semantic-based定位技术的性能方面明显超越。此外,我们还 highlighted the practical application of OVSG in real-world robot navigation and manipulation experiments。

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts

  • paper_url: http://arxiv.org/abs/2309.15915
  • repo_url: https://github.com/engindeniz/vitis
  • paper_authors: Deniz Engin, Yannis Avrithis
  • for: 这个论文的目的是解决大规模预训练模型在有限数据上适应问题中的挑战,包括过拟合、跨Modal汇总和语义识别等问题。
  • methods: 该论文提出了一种效率的参数方法, combinig 多modal prompt学习和基于transformer的映射网络,以适应预训练模型的冰封。
  • results: 我们在多个视频问答 benchmark上进行了实验,并证明了我们的方法在零shot和几shot设置下具有优秀的性能和参数效率。我们的代码可以在 https://engindeniz.github.io/vitis 上获取。
    Abstract Recent vision-language models are driven by large-scale pretrained models. However, adapting pretrained models on limited data presents challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language. We introduce a parameter-efficient method to address these challenges, combining multimodal prompt learning and a transformer-based mapping network, while keeping the pretrained models frozen. Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency on both zero-shot and few-shot settings. Our code is available at https://engindeniz.github.io/vitis.
    摘要 现代视力语言模型受大规模预训练模型的驱动。然而,在有限数据上适应预训练模型存在困难,如预测溢出、跨模态差距和语言视觉 gap。我们提出一种 parameter-efficient 方法,结合多模态提示学习和基于 transformer 的映射网络,保持预训练模型冻结。我们的实验表明,我们的方法在多个视频问答 benchmark 上具有表现和参数效率的优势,包括零shot 和几shot 设置。我们的代码可以在 上找到。

Exploiting the Signal-Leak Bias in Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.15842
  • repo_url: None
  • paper_authors: Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta
  • for: 本研究旨在探讨 diffusion 模型中存在的偏见问题,并提出一种方法来控制生成的图像。
  • methods: 研究者使用了现有的 diffusion 模型,并在其中引入了一种信号泄漏来控制生成的图像。
  • results: 研究者通过模型 signal-leak 的分布在空间频谱和像素域来控制生成的图像,并可以通过不需要进一步训练来生成符合预期结果的图像。
    Abstract There is a bias in the inference pipeline of most diffusion models. This bias arises from a signal leak whose distribution deviates from the noise distribution, creating a discrepancy between training and inference processes. We demonstrate that this signal-leak bias is particularly significant when models are tuned to a specific style, causing sub-optimal style matching. Recent research tries to avoid the signal leakage during training. We instead show how we can exploit this signal-leak bias in existing diffusion models to allow more control over the generated images. This enables us to generate images with more varied brightness, and images that better match a desired style or color. By modeling the distribution of the signal leak in the spatial frequency and pixel domains, and including a signal leak in the initial latent, we generate images that better match expected results without any additional training.
    摘要 多种扩散模型中的推理管道存在偏见。这种偏见来自信号泄漏,其分布与噪声分布不同,导致训练和推理过程之间的差异。我们示示了这种信号泄漏偏见在特定风格下训练模型时特别 significannot。 current research aims to avoid signal leakage during training. 我们则示了如何在现有的扩散模型中利用这种信号泄漏偏见,以获得更多的控制权 над生成图像。通过在空间频率和像素域中模型信号泄漏分布,并在初始干扰中包含信号泄漏,我们生成了更好地匹配预期结果的图像。这些图像不需要任何额外训练。

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

  • paper_url: http://arxiv.org/abs/2309.15818
  • repo_url: https://github.com/showlab/show-1
  • paper_authors: David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou
  • for: 本研究旨在提出一种混合型文本到视频生成模型(Show-1),结合像素基于的VDM和秘密基于的VDM进行文本到视频生成。
  • methods: 我们的模型首先使用像素基于的VDM生成一个低分辨率的视频,然后提出了一种新的专家翻译方法,使用秘密基于的VDM进行更高的视频 upsample。
  • results: 与秘密VDM相比,Show-1可以生成高质量的视频,具有精确的文本-视频对应性;与像素VDM相比,Show-1具有许多更高效的特点(GPU内存使用率 durante la inferencia es de 15G vs 72G)。我们还 validate了我们的模型在标准视频生成 bencmarks 上。
    Abstract Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.
    摘要 大量的进步已经在文本到视频扩散模型(VDM)领域取得了成果。然而,之前的方法都是靠坐标基于的VDM或者是基于隐藏变量的VDM,这两者都有缺点。在这篇论文中,我们是第一个提出了拥有坐标基于和隐藏变量基于VDM的混合模型,我们称之为Show-1。我们的模型首先使用坐标基于VDM生成一个低分辨率的视频,并且通过我们提出的一种新的专家翻译方法,使用隐藏变量基于VDM来进一步提高低分辨率视频的分辨率。相比于隐藏VDM,Show-1可以生成高质量的文本视频匹配;相比于坐标VDM,Show-1 Much more efficient(GPU内存使用率 durante la inferencia es de 15G vs 72G)。我们还验证了我们的模型在标准视频生成 benchmarks 上的性能。我们的代码和模型权重可以在https://github.com/showlab/Show-1 上获取。

Convolutional Networks with Oriented 1D Kernels

  • paper_url: http://arxiv.org/abs/2309.15812
  • repo_url: https://github.com/princeton-vl/oriented1d
  • paper_authors: Alexandre Kirchmeyer, Jia Deng
  • for: 这个论文的目的是探讨ConvNet是否可以没有2D卷积。
  • methods: 这个论文使用了1D卷积,并且发现了一种叫做方向1D卷积的技术,可以将2D卷积完全替代。
  • results: 这个论文的实验结果表明,使用方向1D卷积可以达到与2D卷积相同的准确率,而且可以降低计算量。
    Abstract In computer vision, 2D convolution is arguably the most important operation performed by a ConvNet. Unsurprisingly, it has been the focus of intense software and hardware optimization and enjoys highly efficient implementations. In this work, we ask an intriguing question: can we make a ConvNet work without 2D convolutions? Surprisingly, we find that the answer is yes -- we show that a ConvNet consisting entirely of 1D convolutions can do just as well as 2D on ImageNet classification. Specifically, we find that one key ingredient to a high-performing 1D ConvNet is oriented 1D kernels: 1D kernels that are oriented not just horizontally or vertically, but also at other angles. Our experiments show that oriented 1D convolutions can not only replace 2D convolutions but also augment existing architectures with large kernels, leading to improved accuracy with minimal FLOPs increase. A key contribution of this work is a highly-optimized custom CUDA implementation of oriented 1D kernels, specialized to the depthwise convolution setting. Our benchmarks demonstrate that our custom CUDA implementation almost perfectly realizes the theoretical advantage of 1D convolution: it is faster than a native horizontal convolution for any arbitrary angle. Code is available at https://github.com/princeton-vl/Oriented1D.
    摘要 在计算机视觉中,2D卷积是无可争议的最重要的操作,它在ConvNet中扮演着关键的角色。不奇怪的是,它已经得到了极高效的软件和硬件优化,并且有高效的实现。在这项工作中,我们提出了一个有趣的问题:可以不使用2D卷积来实现ConvNet吗?奇怪的是,我们发现答案是Yes,我们表明了一个完全由1D卷积组成的ConvNet可以与2D卷积相当地表现,甚至在ImageNet分类任务上达到相同的准确率。具体来说,我们发现一个关键的组成部分是方向卷积:卷积不仅可以水平或垂直进行卷积,还可以在其他角度进行卷积。我们的实验表明,方向卷积不仅可以替换2D卷积,还可以补充现有的架构,从而提高准确率,而且减少FLOPs。我们的一个重要贡献是对方向卷积的高度优化的自定义CUDA实现,特制为深度卷积设置。我们的测试表明,我们的自定义CUDA实现几乎完全实现了理论上的优势,它在任意角度下比本地水平卷积更快。代码可以在https://github.com/princeton-vl/Oriented1D上获取。

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

  • paper_url: http://arxiv.org/abs/2309.15807
  • repo_url: None
  • paper_authors: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh
  • For: 这个论文的目的是提出一种基于Web scale image-text对的文本至图模型训练方法,以生成高质量的视觉概念图像。* Methods: 该论文提出了一种名为“质量调整”的方法,通过精心选择一些高质量且极其视觉吸引人的图像进行监督训练,以使文本至图模型产生更高质量的图像。* Results: 论文的实验结果显示,使用“质量调整”方法可以使文本至图模型产生更高质量的图像,并且比传统的文本至图模型更具有视觉吸引力。
    Abstract Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.
    摘要 培训文本到图像模型使得可以生成广泛的视觉概念从文本。然而,这些预训练模型经常在生成高度美观的图像时遇到问题。这创造了美观对齐的需求。在这篇论文中,我们提出了质量调整来有效地引导预训练后的模型仅生成高度视觉吸引人的图像,而保持视觉概念的通用性。我们的关键发现是,在一小群非常美观但极其吸引人的图像上进行监督微调可以很大程度上提高生成质量。我们在110亿个图像-文本对的基础上预训练了一个抽象扩散模型,然后通过只有几千个精选高质量图像进行微调。得到的模型被称为Emu,其赢得了与预训练只的对手的比赛,其中胜率为82.9%。相比之下,与状态艺术SDXLv1.0进行比赛,Emu被选择了68.4%和71.3%的时间在标准的 PartiPrompts 和我们的实际用途基准测试中的视觉吸引力方面。此外,我们还证明了质量调整是一种通用的方法,也是有效的 для其他架构,包括像素扩散和受Mask的生成变换模型。

A Quantum-Classical Hybrid Block-Matching Algorithm in Noisy Environment using Dissimilarity Measure

  • paper_url: http://arxiv.org/abs/2309.15792
  • repo_url: None
  • paper_authors: M. Martínez-Felipe, J. Montiel-Pérez, V. Onofre-González, A. Maldonado-Romo, Ricky Young
  • for: 这个论文是为了解决图像块匹配问题,即在搜索区域内找到一组相似图像块。
  • methods: 这个论文使用了类比图像处理技术,包括 Gaussian 噪声和图像尺寸减小,以及 phase 图像编码和量子快速幂transform。
  • results: 该论文提出了一种基于 phase 图像编码和 swap 测试的不同性度量,并在理想和噪声掺杂的 simulate 环境中进行了实验,并在 IBM 和 Ionq 量子设备上进行了 Swap 测试。
    Abstract A block-matching algorithm finds a group of similar image patches inside a search area. Similarity/dissimilarity measures can help to solve this problem. In different practical applications, finding groups of similar image blocks within an ample search area is often necessary, such as video compression, image clustering, vector quantization, and nonlocal noise reduction. In this work, classical image processing is performed using Gaussian noise and image size reduction with a fit of a Low-Pass Filter or Domain Transform. A hierarchical search technique is implemented to encode the images by phase operator. Using phase image coding with the quantum Fourier transform and the Swap test, we propose a dissimilarity measure. Results were obtained with perfect and noisy simulations and in the case of the Swap test with the IBM and Ionq quantum devices.
    摘要 algorithm 寻找内部 search 区域中相似的图像块。相似性/不同性度量可以解决这个问题。在实际应用中,寻找内部 search 区域中的相似图像块是非常重要的,例如影像压缩、影像集群、向量量化和非本地噪音减少。在这个工作中,使用 Gaussian 噪声和影像缩小以适应低通滤过或领域转换。使用层次搜寻技术实现影像编码,使用相位操作器进行编码。使用相位图像编码、量子 fourier 转换和交换测试,我们提出了一个不同度量。实际成果是在完美和噪音 simulation 中获得,以及在交换测试中使用 IBM 和 Ionq 量子设备。

Partial Transport for Point-Cloud Registration

  • paper_url: http://arxiv.org/abs/2309.15787
  • repo_url: None
  • paper_authors: Yikun Bai, Huy Tran, Steven B. Damelin, Soheil Kolouri
  • for: 非静止点云注册问题在机器人、计算机图形和医疗成像等领域中扮演着关键角色,其中面临非静止运动和部分可见性(如干扰或感测器噪声)的问题。
  • methods: 本文通过优化运输问题和其不均衡变种(如优化部分运输问题)来解决非静止点云注册问题,并提出一系列基于优化partial运输问题的非静止注册方法。然后,通过利用一 dimensional优化partial运输问题的有效解决方法的扩展,提高了算法的计算效率,从而实现了快速和稳定的非静止注册算法。
  • results: 本文通过对多个3D和2D非静止注册问题进行测试和比较,证明了我们提出的方法的有效性和稳定性。在扰动和噪声的情况下,我们的方法可以快速和精度地解决非静止注册问题。
    Abstract Point cloud registration plays a crucial role in various fields, including robotics, computer graphics, and medical imaging. This process involves determining spatial relationships between different sets of points, typically within a 3D space. In real-world scenarios, complexities arise from non-rigid movements and partial visibility, such as occlusions or sensor noise, making non-rigid registration a challenging problem. Classic non-rigid registration methods are often computationally demanding, suffer from unstable performance, and, importantly, have limited theoretical guarantees. The optimal transport problem and its unbalanced variations (e.g., the optimal partial transport problem) have emerged as powerful tools for point-cloud registration, establishing a strong benchmark in this field. These methods view point clouds as empirical measures and provide a mathematically rigorous way to quantify the `correspondence' between (the transformed) source and target points. In this paper, we approach the point-cloud registration problem through the lens of optimal transport theory and first propose a comprehensive set of non-rigid registration methods based on the optimal partial transportation problem. Subsequently, leveraging the emerging work on efficient solutions to the one-dimensional optimal partial transport problem, we extend our proposed algorithms via slicing to gain significant computational efficiency, resulting in fast and robust non-rigid registration algorithms. We demonstrate the effectiveness of our proposed methods and compare them against baselines on various 3D and 2D non-rigid registration problems where the source and target point clouds are corrupted by random noise.
    摘要 点云注册在不同领域中扮演着关键的角色,包括机器人学、计算机图形学和医学影像。这个过程涉及到不同集点之间的空间关系的确定,通常在3D空间中。在实际应用中,复杂性来自于非RIGID运动和部分可见性,如遮挡或感器噪声,使得非RIGID注册成为一个具有挑战性的问题。 классические非RIGID注册方法通常具有计算昂贵、性能不稳定和有限的理论保证。优化运输问题和其不均变种(例如优化部分运输问题)在点云注册中发挥了强大的作用,成为这个领域的标准准则。这些方法视点云为实际测量,并提供了数学上的正式方式来衡量(变换后)源点云和目标点云之间的匹配程度。在本文中,我们通过优化运输理论的镜像来解决点云注册问题,并首次提出了基于优化部分运输问题的全面非RIGID注册方法。然后,通过利用emerging工作在一维优化部分运输问题上的高效解决方法,我们扩展了我们的提议算法,从而获得了快速和可靠的非RIGID注册算法。我们在不同的3D和2D非RIGID注册问题上证明了我们的提议方法的有效性,并与基准方法进行比较。

One For All: Video Conversation is Feasible Without Video Instruction Tuning

  • paper_url: http://arxiv.org/abs/2309.15785
  • repo_url: https://github.com/farewellthree/BT-Adapter
  • paper_authors: Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li
  • for: 提高视频对话系统的效能,使用现有的图像对话模型进行扩展。
  • methods: 提出了一种名为 Branching Temporal Adapter(BT-Adapter)的新方法,可以将图像语言预测模型扩展到视频领域。BT-Adapter acting as a temporal modeling branch alongside the pretrained visual encoder,并且在保持backbone冻结的情况下进行调教。
  • results: 通过BT-Adapter,可以让现有的多Modal对话模型具备强大的视频理解能力,而无需耗费过多的GPU资源。BT-Adapter可以在少量的GPU时间内达到state-of-the-art的零基eline结果,并且在不带视频指导的情况下达到更好的性能。
    Abstract The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
    摘要 Recent progress in Large Language Models (LLM) has led to advancements in image-language conversation agents, but how to build a proficient video-based dialogue system is still being explored. Due to the extensive scale of LLM and visual backbone, there is limited GPU memory available for effective temporal modeling, which is crucial for understanding and providing feedback on videos. To address this challenge, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. With just one pretraining, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Additionally, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, which facilitates faster convergence and better results. Thanks to BT-Adapter, we can empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without any bells and whistles, BT-Adapter achieves the following:1. State-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.2. Better performance than current video chatbots without any video instruction tuning.3. State-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.

Joint-YODNet: A Light-weight Object Detector for UAVs to Achieve Above 100fps

  • paper_url: http://arxiv.org/abs/2309.15782
  • repo_url: None
  • paper_authors: Vipin Gautam, Shitala Prasad, Sharad Sinha
  • for: 这篇论文旨在提高无人航空车(UAV)影像中小物体检测的精度。
  • methods: 本论文提出了一个新的联合损失函数(JointYODNet),用于强化小物体检测的精度。这个联合损失函数结合了对小物体检测的特有损失函数。
  • results: 经过广泛的实验,我们发现我们提出的联合损失函数可以优化小物体检测的精度。特别是,我们的方法在不同环境下检测小物体的精度是97.1%,F1 Score是97.5%,并且实现了mAP@.5的98.6%。
    Abstract Small object detection via UAV (Unmanned Aerial Vehicle) images captured from drones and radar is a complex task with several formidable challenges. This domain encompasses numerous complexities that impede the accurate detection and localization of small objects. To address these challenges, we propose a novel method called JointYODNet for UAVs to detect small objects, leveraging a joint loss function specifically designed for this task. Our method revolves around the development of a joint loss function tailored to enhance the detection performance of small objects. Through extensive experimentation on a diverse dataset of UAV images captured under varying environmental conditions, we evaluated different variations of the loss function and determined the most effective formulation. The results demonstrate that our proposed joint loss function outperforms existing methods in accurately localizing small objects. Specifically, our method achieves a recall of 0.971, and a F1Score of 0.975, surpassing state-of-the-art techniques. Additionally, our method achieves a mAP@.5(%) of 98.6, indicating its robustness in detecting small objects across varying scales
    摘要 小物体检测 via UAV(无人航空器)图像 captured from drones和雷达是一项复杂任务,涉及许多可考的挑战。这个领域涵盖许多复杂性,阻碍精准检测和定位小物体。为解决这些挑战,我们提出了一种新的方法 called JointYODNet,用于UAVs中的小物体检测。我们的方法基于特制的联合损失函数,用于提高小物体检测性能。通过对不同环境下UAV图像的广泛实验,我们评估了不同版本的损失函数,并确定了最有效的形式。结果表明,我们的联合损失函数可以高效地地localize小物体,并且在不同的缩放比例下保持稳定性。具体来说,我们的方法达到了0.971的回归率,和0.975的F1Score,这两个指标都高于当前的State-of-the-art技术。此外,我们的方法达到了98.6%的mAP@.5(%),这表明它在不同的缩放比例下具有较高的小物体检测稳定性。

AaP-ReID: Improved Attention-Aware Person Re-identification

  • paper_url: http://arxiv.org/abs/2309.15780
  • repo_url: None
  • paper_authors: Vipin Gautam, Shitala Prasad, Sharad Sinha
  • for: 本研究的目标是解决人识别 task 中的特定个体识别问题,以提高人识别的精度和可靠性。
  • methods: 我们提出了一种基于 ResNet 架构的 AaP-ReID 方法,其中包含 Channel-Wise Attention Bottleneck (CWAbottleneck) 块,可以动态调整每个通道的重要性,以学习更有力的特征。
  • results: 我们在 Market-1501、DukeMTMC-reID 和 CUHK03 三个 benchmark 数据集上进行了评估,与state-of-the-art 人识别方法相比,我们的 AaP-ReID 方法在rank-1准确率上达到了 95.6%、90.6% 和 82.4% 的水平。
    Abstract Person re-identification (ReID) is a well-known problem in the field of computer vision. The primary objective is to identify a specific individual within a gallery of images. However, this task is challenging due to various factors, such as pose variations, illumination changes, obstructions, and the presence ofconfusing backgrounds. Existing ReID methods often fail to capture discriminative features (e.g., head, shoes, backpacks) and instead capture irrelevant features when the target is occluded. Motivated by the success of part-based and attention-based ReID methods, we improve AlignedReID++ and present AaP-ReID, a more effective method for person ReID that incorporates channel-wise attention into a ResNet-based architecture. Our method incorporates the Channel-Wise Attention Bottleneck (CWAbottleneck) block and can learn discriminating features by dynamically adjusting the importance ofeach channel in the feature maps. We evaluated Aap-ReID on three benchmark datasets: Market-1501, DukeMTMC-reID, and CUHK03. When compared with state-of-the-art person ReID methods, we achieve competitive results with rank-1 accuracies of 95.6% on Market-1501, 90.6% on DukeMTMC-reID, and 82.4% on CUHK03.
    摘要 人脸重认(ReID)是计算机视觉领域的一个很有名的问题。主要目标是在一组图像中识别特定的个体。但这个任务受到多种因素的影响,如 pose 变化、照明变化、阻挡物和误导背景的存在。现有的 ReID 方法 oftentimes 未能捕捉特征特征(例如头、鞋、背包),而是在目标被遮盖时捕捉无关的特征。我们受到部分基于和注意力基于 ReID 方法的成功的激励,我们改进了 AlignedReID++ 并提出了 AaP-ReID,一种更有效的人脸重认方法。我们的方法包括 Channel-Wise Attention Bottleneck(CWAbottleneck)块,可以在 ResNet 基本架构中动态调整每个通道的重要性,从而学习特征。我们在 Market-1501、DukeMTMC-reID 和 CUHK03 三个标准数据集上评估 AaP-ReID,与状态之前的人脸重认方法相比,我们实现了竞争性的结果,rank-1 准确率分别达到 95.6%、90.6% 和 82.4%。

Aperture Diffraction for Compact Snapshot Spectral Imaging

  • paper_url: http://arxiv.org/abs/2309.16372
  • repo_url: https://github.com/krito-ex/csst
  • paper_authors: Tao Lv, Hao Ye, Quan Yuan, Zhan Shi, Yibo Wang, Shuming Wang, Xun Cao
  • for: 这个论文旨在描述一种名为 aperature Diffraction Imaging Spectrometer(ADIS)的新型快速 spectral imaging 系统,该系统只有一个映射镜和一个多元滤色器传感器,不需要任何额外的物理设备。
  • methods: 论文提出了一种新的光学设计,即通过diffraction-based spatial-spectral projection engineering来将对象空间中的每个点多态化到滤色器传感器上的不同编码位置。这种多态化的设计使得只需要单个曝光幕Raw image data可以实现高 spectral resolution和低aliasing的图像重建。
  • results: experiments show that the proposed system can achieve sub-super-pixel spatial resolution and high spectral resolution imaging, and the reconstructed images are highly consistent with the original data. In addition, the system is evaluated by analyzing the imaging optical theory and reconstruction algorithm, and the code will be available at GitHub.
    Abstract We demonstrate a compact, cost-effective snapshot spectral imaging system named Aperture Diffraction Imaging Spectrometer (ADIS), which consists only of an imaging lens with an ultra-thin orthogonal aperture mask and a mosaic filter sensor, requiring no additional physical footprint compared to common RGB cameras. Then we introduce a new optical design that each point in the object space is multiplexed to discrete encoding locations on the mosaic filter sensor by diffraction-based spatial-spectral projection engineering generated from the orthogonal mask. The orthogonal projection is uniformly accepted to obtain a weakly calibration-dependent data form to enhance modulation robustness. Meanwhile, the Cascade Shift-Shuffle Spectral Transformer (CSST) with strong perception of the diffraction degeneration is designed to solve a sparsity-constrained inverse problem, realizing the volume reconstruction from 2D measurements with Large amount of aliasing. Our system is evaluated by elaborating the imaging optical theory and reconstruction algorithm with demonstrating the experimental imaging under a single exposure. Ultimately, we achieve the sub-super-pixel spatial resolution and high spectral resolution imaging. The code will be available at: https://github.com/Krito-ex/CSST.
    摘要 我们提出了一种具有高效性和可持续性的快照spectral imaging系统,名为Aperture Diffraction Imaging Spectrometer(ADIS)。该系统只有一个映射镜和一个 orthogonal aperture mask,而不需要任何额外的物理空间。我们还介绍了一种新的光学设计,使得对象空间中的每个点都被多样化到灰度滤波器传感器上的独立编码位置。这种多样化是通过扩散基于的空间-спектраль投影工程实现的,从而获得弱依赖于均衡 calibration 的数据形式,以提高模拟稳定性。同时,我们设计了一种叫做Cascade Shift-Shuffle Spectral Transformer(CSST)的新型神经网络,用于解决一个具有扩散约束的减少问题,实现从2D测量获得3D重建。我们通过推导光学学理和重建算法,并进行实验测试,最终实现了下sampling 的超分辨率和高spectral resolution的成像。系统代码将在https://github.com/Krito-ex/CSST 上提供。

High Perceptual Quality Wireless Image Delivery with Denoising Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.15889
  • repo_url: None
  • paper_authors: Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz
  • for: 实现受损图像传输过程中的杂音无线通信频道之间的深度学习基于的共同源码渠道(DeepJSCC)和评估频道的数据模型(DDPM)。
  • methods: 使用目标图像的范围空间分解,将图像转换成范围空间后,使用DDPM进行累进填充null空间内容。
  • results: 在实际finite block length regime中,与传统DeepJSCC和当前学习型基于方法进行比较,实现了较好的干扰和人类视觉质量。将源代码公开供研究和重现。
    Abstract We consider the image transmission problem over a noisy wireless channel via deep learning-based joint source-channel coding (DeepJSCC) along with a denoising diffusion probabilistic model (DDPM) at the receiver. Specifically, we are interested in the perception-distortion trade-off in the practical finite block length regime, in which separate source and channel coding can be highly suboptimal. We introduce a novel scheme that utilizes the range-null space decomposition of the target image. We transmit the range-space of the image after encoding and employ DDPM to progressively refine its null space contents. Through extensive experiments, we demonstrate significant improvements in distortion and perceptual quality of reconstructed images compared to standard DeepJSCC and the state-of-the-art generative learning-based method. We will publicly share our source code to facilitate further research and reproducibility.
    摘要 我们考虑了通过深度学习基于源-通道编码(DeepJSCC)和推 diffusion概率模型(DDPM)的图像传输问题,特别是在实际 finite block length Régime中进行评估。我们关注图像传输过程中的觉受-误差交易,在这种情况下,分离的源和通道编码可能是非常不优化的。我们提出了一种新的方案,利用目标图像的范围空间划分。我们在编码后将范围空间传输给接收方,并使用 DDPM 进行逐渐提高null空间内容的进程。经过广泛的实验,我们发现了对于重建图像的误差和人类识别质量都有显著改善,相比标准 DeepJSCC 和当前最佳生成学习基于方法。我们将将源代码公开发布,以便进一步的研究和复现。

Rapid Network Adaptation: Learning to Adapt Neural Networks Using Test-Time Feedback

  • paper_url: http://arxiv.org/abs/2309.15762
  • repo_url: None
  • paper_authors: Teresa Yeo, Oğuzhan Fatih Kar, Zahra Sodagar, Amir Zamir
  • for: 本文提出了一种适应分布shift的方法,用于在测试时进行适应。与传统的训练时Robustness机制不同,我们创建了一个循环系统,并使用测试时反馈信号来适应网络。
  • methods: 我们使用了一种学习基于的函数来实现这个循环系统,实现了一个摘要优化器 для网络。
  • results: 我们通过了广泛的实验,包括不同的散度shift、任务和数据集,并显示了这种方法的高效性和灵活性。
    Abstract We propose a method for adapting neural networks to distribution shifts at test-time. In contrast to training-time robustness mechanisms that attempt to anticipate and counter the shift, we create a closed-loop system and make use of a test-time feedback signal to adapt a network on the fly. We show that this loop can be effectively implemented using a learning-based function, which realizes an amortized optimizer for the network. This leads to an adaptation method, named Rapid Network Adaptation (RNA), that is notably more flexible and orders of magnitude faster than the baselines. Through a broad set of experiments using various adaptation signals and target tasks, we study the efficiency and flexibility of this method. We perform the evaluations using various datasets (Taskonomy, Replica, ScanNet, Hypersim, COCO, ImageNet), tasks (depth, optical flow, semantic segmentation, classification), and distribution shifts (Cross-datasets, 2D and 3D Common Corruptions) with promising results. We end with a discussion on general formulations for handling distribution shifts and our observations from comparing with similar approaches from other domains.
    摘要 我们提出了一种方法,用于在测试时适应分布变化。与训练时鲁棒性机制不同,我们创建了一个封闭的循环系统,并使用测试时反馈信号来适应网络的 fly。我们表明,这种循环可以使用学习基于函数,实现一个摘要优化器,从而实现了网络的适应方法。我们称之为快速网络适应(RNA)。这种方法比基准更加灵活,并且速度几个数量级更快。通过使用不同的适应信号和目标任务,我们在各种实验中研究了这种方法的效率和灵活性。我们使用了不同的数据集(Taskonomy、Replica、ScanNet、Hypersim、COCO、ImageNet)、任务(深度、光流、semantic segmentation、分类)和分布变化(跨数据集、2D和3D Common Corruptions),并获得了可观的结果。我们在结束时对类似的方法进行了比较,并进行了一些总结和讨论。

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

  • paper_url: http://arxiv.org/abs/2309.15755
  • repo_url: None
  • paper_authors: Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, Guiguang Ding
  • for: 这个论文的目的是提出一种基于ViTs的 JOINT 压缩方法,以提高模型的速度和准确率,同时保持下游任务的可贯通性。
  • methods: 这个方法使用了一种异常的token合并策略(ATME),通过将邻近的token合并起来,成功地压缩了重复的token信息,保持图像的空间结构。此外,这个方法还使用了一种一致动态通道剔除策略(CDCP),可以动态剔除不重要的通道在ViTs中,大幅提高模型压缩。
  • results: 经过广泛的实验表明,这个方法可以在不同的ViTs上达到最佳性能,比如ImageNet上的DeiT-Tiny和DeiT-Small模型可以 Achieve 1.7$\times$和1.9$\times$的速度提升,而无需减少准确率。在ADE20k segmentationdataset上,这个方法可以具有最多1.31$\times$的速度提升,与相同的mIoU水平兼容。
    Abstract Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. Consequently, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, they generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. They are also disadvantageous in transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose a joint compression method for ViTs that offers both high accuracy and fast inference speed, while also maintaining favorable transferability to downstream tasks (CAIT). Specifically, we introduce an asymmetric token merging (ATME) strategy to effectively integrate neighboring tokens. It can successfully compress redundant token information while preserving the spatial structure of images. We further employ a consistent dynamic channel pruning (CDCP) strategy to dynamically prune unimportant channels in ViTs. Thanks to CDCP, insignificant channels in multi-head self-attention modules of ViTs can be pruned uniformly, greatly enhancing the model compression. Extensive experiments on benchmark datasets demonstrate that our proposed method can achieve state-of-the-art performance across various ViTs. For example, our pruned DeiT-Tiny and DeiT-Small achieve speedups of 1.7$\times$ and 1.9$\times$, respectively, without accuracy drops on ImageNet. On the ADE20k segmentation dataset, our method can enjoy up to 1.31$\times$ speedups with comparable mIoU. Our code will be publicly available.
    摘要 目标是提出一种可以同时保持高准确率和快速推理速度的 ViT 压缩方法,而且可以在下游视觉任务中保持图像的空间结构。我们提出了一种强化 token 合并策略(ATME),可以有效地压缩重复的token信息,同时保持图像的空间结构。此外,我们还采用了一种 dynamically 频道剪枝策略(CDCP),可以在 ViT 中动态剪枝无关的频道,从而提高模型压缩。我们的方法可以在多种 ViT 上达到领先的性能,例如,我们的压缩后 DeiT-Tiny 和 DeiT-Small 可以在 ImageNet 上增加 1.7 倍和 1.9 倍的速度,同时保持准确性。在 ADE20k segmentation 数据集上,我们的方法可以增加到 1.31 倍的速度,与相同的 mIoU 相对。我们的代码将公开。

InfraParis: A multi-modal and multi-task autonomous driving dataset

  • paper_url: http://arxiv.org/abs/2309.15751
  • repo_url: None
  • paper_authors: Gianni Franchi, Marwane Hariat, Xuanlong Yu, Nacim Belkhir, Antoine Manzanera, David Filliat
  • for: 这个论文旨在提供一个多模态数据集,以便提高自动驾驶计算机视觉模型的可靠性和多样化性。
  • methods: 这个论文使用了多种现有的深度神经网络模型,并对其进行了评估。
  • results: 论文发现,使用多模态数据集可以提高模型的性能,并且可以更好地处理新的对象、噪音、夜间条件和多样化场景。
    Abstract Current deep neural networks (DNNs) for autonomous driving computer vision are typically trained on specific datasets that only involve a single type of data and urban scenes. Consequently, these models struggle to handle new objects, noise, nighttime conditions, and diverse scenarios, which is essential for safety-critical applications. Despite ongoing efforts to enhance the resilience of computer vision DNNs, progress has been sluggish, partly due to the absence of benchmarks featuring multiple modalities. We introduce a novel and versatile dataset named InfraParis that supports multiple tasks across three modalities: RGB, depth, and infrared. We assess various state-of-the-art baseline techniques, encompassing models for the tasks of semantic segmentation, object detection, and depth estimation.
    摘要 当前的深度神经网络(DNNs) для自动驾驶计算机视觉通常是通过特定的数据集训练的,这些数据集只包含单一的数据和城市场景。因此,这些模型具有处理新的对象、噪音、夜间条件和多样化场景的能力异常差,这是安全应用的关键。虽然持续努力提高计算机视觉DNNs的鲜度,但进步缓慢,其中一个原因是多模态的标准准。我们介绍了一个新的和多样的数据集名为InfraParis,该数据集支持多个任务逐模态:RGB、深度和红外。我们评估了多种现有的基线技术,包括 semantic segmentation、物体检测和深度估计等任务的模型。

Automated CT Lung Cancer Screening Workflow using 3D Camera

  • paper_url: http://arxiv.org/abs/2309.15750
  • repo_url: None
  • paper_authors: Brian Teixeira, Vivek Singh, Birgi Tamersoy, Andreas Prokein, Ankur Kapoor
  • for: 这篇论文的目的是为了减少CT扫描中需要的时间consuming scout scans,并且自动化病人定位。
  • methods: 这篇论文使用了一个新的方法,可以从3D相机影像中估算病人的扫描范围、中心点和水平径确(WED),不需要使用实验 scan data。这个方法通过对超过60,000个CT扫描数据进行训练,并引入一个新的更新方法,可以在实时扫描数据中更新预测。
  • results: 这篇论文的结果显示,使用这个新方法可以很好地减少CT扫描中需要的时间和精度误差。在110对深度数据和CT扫描数据的测试集中,这个方法可以很好地估算病人的中心点、扫描范围和WED。相比IEC的Acceptance对�项目的10%,这个方法的相关WED误差为4%。
    Abstract Despite recent developments in CT planning that enabled automation in patient positioning, time-consuming scout scans are still needed to compute dose profile and ensure the patient is properly positioned. In this paper, we present a novel method which eliminates the need for scout scans in CT lung cancer screening by estimating patient scan range, isocenter, and Water Equivalent Diameter (WED) from 3D camera images. We achieve this task by training an implicit generative model on over 60,000 CT scans and introduce a novel approach for updating the prediction using real-time scan data. We demonstrate the effectiveness of our method on a testing set of 110 pairs of depth data and CT scan, resulting in an average error of 5mm in estimating the isocenter, 13mm in determining the scan range, 10mm and 16mm in estimating the AP and lateral WED respectively. The relative WED error of our method is 4%, which is well within the International Electrotechnical Commission (IEC) acceptance criteria of 10%.
    摘要 尽管最近的 computed tomography (CT) 规划技术已经实现了患者定位自动化,但是时间consuming的探测扫描仍然需要进行以计算剂量profile和确保患者是正确地位置。在这篇论文中,我们提出了一种新的方法,它可以消除 CT 肺癌检测中的探测扫描。我们通过对超过 60,000 个 CT 扫描图像进行训练,并 introduce 一种新的更新预测方法使用实时扫描数据。我们在测试集上进行了 110 对深度数据和 CT 扫描的对比,得到了平均错误为 5mm,13mm,10mm和16mm,分别用于计算中心点、扫描范围、AP和 lateral Water Equivalent Diameter (WED)。我们的方法的相对 WED 错误率为 4%,这在国际电工标准委员会 (IEC) 接受的 10% 范围内。

Synthetic Latent Fingerprint Generation Using Style Transfer

  • paper_url: http://arxiv.org/abs/2309.15734
  • repo_url: None
  • paper_authors: Amol S. Joshi, Ali Dabouei, Nasser Nasrabadi, Jeremy Dawson
  • for: 这篇论文旨在提供一种简单实用的方法来生成具有实际特征的潜在指纹资料,以便训练需要大量资料的神经网络模型。
  • methods: 本研究使用了Style Transfer和图像融合技术来实现潜在指纹生成。
  • results: 实验结果显示,生成的潜在指纹资料 preserve 输入触感指纹资料中的身份信息,并具有真实潜在指纹资料的特征。此外,生成的指纹资料显示了多种特征和样式,表明提案方法可以从同一个指纹中产生多个样本。
    Abstract Limited data availability is a challenging problem in the latent fingerprint domain. Synthetically generated fingerprints are vital for training data-hungry neural network-based algorithms. Conventional methods distort clean fingerprints to generate synthetic latent fingerprints. We propose a simple and effective approach using style transfer and image blending to synthesize realistic latent fingerprints. Our evaluation criteria and experiments demonstrate that the generated synthetic latent fingerprints preserve the identity information from the input contact-based fingerprints while possessing similar characteristics as real latent fingerprints. Additionally, we show that the generated fingerprints exhibit several qualities and styles, suggesting that the proposed method can generate multiple samples from a single fingerprint.
    摘要 限制的数据可用性是 latent fingerprint 领域中的一个挑战。Synthetically generated fingerprints 是训练数据涉及大量神经网络算法的重要资源。传统方法会扭曲清晰的指纹来生成伪装的 latent fingerprints。我们提议一种简单有效的方法,使用 Style transfer 和图像融合来生成真实的 latent fingerprints。我们的评估标准和实验表明,生成的伪装 latent fingerprints 保留输入的 Contact-based fingerprints 中的身份信息,同时具有真实 latent fingerprints 的相似特征。此外,我们还示出了生成的指纹具有多种特征和风格, suggesting that the proposed method can generate multiple samples from a single fingerprint.

Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation

  • paper_url: http://arxiv.org/abs/2309.15726
  • repo_url: None
  • paper_authors: Xin Yuan, Michael Maire
  • for: 这个论文是为了开发一种无监督的神经网络架构,用于同时生成和分割图像。
  • methods: 该模型采用无监督的杂化扩散目标来驱动学习,不需任何标注或区域知识来训练。模型中的计算瓶颈使得杂化网络partition输入图像,并在平行进行净化和结合结果。
  • results: 我们的训练模型可以生成高质量的合成图像和对真实图像进行无监督的图像分割,并且不需任何迭代训练或标注。实验结果表明,我们的模型可以准确地完成无监督图像分割任务和高质量的合成图像生成。
    Abstract We develop a neural network architecture which, trained in an unsupervised manner as a denoising diffusion model, simultaneously learns to both generate and segment images. Learning is driven entirely by the denoising diffusion objective, without any annotation or prior knowledge about regions during training. A computational bottleneck, built into the neural architecture, encourages the denoising network to partition an input into regions, denoise them in parallel, and combine the results. Our trained model generates both synthetic images and, by simple examination of its internal predicted partitions, a semantic segmentation of those images. Without any finetuning, we directly apply our unsupervised model to the downstream task of segmenting real images via noising and subsequently denoising them. Experiments demonstrate that our model achieves accurate unsupervised image segmentation and high-quality synthetic image generation across multiple datasets.
    摘要 我们开发了一种神经网络架构,通过不经过监督的方式,同时学习生成和分割图像。在训练过程中,我们没有任何注释或区域知识, entirely driven by the denoising diffusion objective。我们的神经网络架构包含计算瓶颈,使得杂化网络partition输入图像,并在平行进行杂化和结合结果。我们训练的模型可以同时生成 sintetic 图像和通过内部预测的分区来实现图像 semantic segmentation。无需追究,我们直接将无监督模型应用于图像分割下游任务,通过噪音和杂化图像来进行预测。实验表明,我们的模型可以准确无监督地分割图像和生成高质量的 sintetic 图像,在多个 dataset 上表现出色。

Physics-Based Rigid Body Object Tracking and Friction Filtering From RGB-D Videos

  • paper_url: http://arxiv.org/abs/2309.15703
  • repo_url: None
  • paper_authors: Rama Krishna Kandukuri, Michael Strecke, Joerg Stueckler
  • for: 这 paper 是为了解决物体间互动的理解问题,以便在增强现实和机器人领域中实现更加 precisione 的模拟和控制。
  • methods: 该 paper 使用了一种新的方法,即使用可微的物理模拟来模型物体的互动,并通过扩展卡尔曼滤波来Track 3D 物体的位姿和物理性能。
  • results: 该 paper 的实验结果表明,该方法可以准确地滤波物体的位姿和动量,同时也可以估算物体的透抗率。 furthermore, 该 paper 还提供了一些实验结果,证明该方法在不同的滑动场景中的性能。
    Abstract Physics-based understanding of object interactions from sensory observations is an essential capability in augmented reality and robotics. It enables capturing the properties of a scene for simulation and control. In this paper, we propose a novel approach for real-to-sim which tracks rigid objects in 3D from RGB-D images and infers physical properties of the objects. We use a differentiable physics simulation as state-transition model in an Extended Kalman Filter which can model contact and friction for arbitrary mesh-based shapes and in this way estimate physically plausible trajectories. We demonstrate that our approach can filter position, orientation, velocities, and concurrently can estimate the coefficient of friction of the objects. We analyse our approach on various sliding scenarios in synthetic image sequences of single objects and colliding objects. We also demonstrate and evaluate our approach on a real-world dataset. We will make our novel benchmark datasets publicly available to foster future research in this novel problem setting and comparison with our method.
    摘要 physics-based understanding of object interactions from sensory observations is a crucial capability in augmented reality and robotics. it enables capturing the properties of a scene for simulation and control. in this paper, we propose a novel approach for real-to-sim which tracks rigid objects in 3d from rgb-d images and infers physical properties of the objects. we use a differentiable physics simulation as state-transition model in an extended kalman filter, which can model contact and friction for arbitrary mesh-based shapes and in this way estimate physically plausible trajectories. we demonstrate that our approach can filter position, orientation, velocities, and concurrently can estimate the coefficient of friction of the objects. we analyze our approach on various sliding scenarios in synthetic image sequences of single objects and colliding objects. we also demonstrate and evaluate our approach on a real-world dataset. we will make our novel benchmark datasets publicly available to foster future research in this novel problem setting and comparison with our method.Here's the translation in Traditional Chinese:物理基础的物体互动理解从感知观察是现实增强 reality 和机器人学中的重要能力。它可以捕捉场景的属性进行模拟和控制。在这篇文章中,我们提出了一种新的approach for real-to-sim,追踪3d中的固定物体从rgb-d图像中,并将物体的物理性能推断出来。我们使用了可微分的物理模拟作为状态转换模型,可以模拟物体之间的触摸和摩擦,并从而估算物体的运动轨迹。我们显示了我们的方法可以范围对象的位置、方向、速度和同时估算物体的摩擦系数。我们分析了我们的方法在单一物体和碰撞物体的滑动情况下的性能。我们还评估了我们的方法在实际世界数据集上的表现。我们将我们的新的benchmark数据集公开,以便未来的研究者可以在这个新的问题设定下进行比较和研究。

SGRec3D: Self-Supervised 3D Scene Graph Learning via Object-Level Scene Reconstruction

  • paper_url: http://arxiv.org/abs/2309.15702
  • repo_url: None
  • paper_authors: Sebastian Koch, Pedro Hermosilla, Narunas Vaskevicius, Mirco Colosi, Timo Ropinski
  • for: 提高3D场景理解的能力
  • methods: 使用自我超vised pre-training方法SGRec3D来预处理3D场景图
  • results: 比起其他点云基于预训练方法,SGRec3D在3D场景图预测中提高了表达能力,得到了SOTA的性能,并且只需使用10%的标注数据进行精度调整即可以超过同类模型。
    Abstract In the field of 3D scene understanding, 3D scene graphs have emerged as a new scene representation that combines geometric and semantic information about objects and their relationships. However, learning semantic 3D scene graphs in a fully supervised manner is inherently difficult as it requires not only object-level annotations but also relationship labels. While pre-training approaches have helped to boost the performance of many methods in various fields, pre-training for 3D scene graph prediction has received little attention. Furthermore, we find in this paper that classical contrastive point cloud-based pre-training approaches are ineffective for 3D scene graph learning. To this end, we present SGRec3D, a novel self-supervised pre-training method for 3D scene graph prediction. We propose to reconstruct the 3D input scene from a graph bottleneck as a pretext task. Pre-training SGRec3D does not require object relationship labels, making it possible to exploit large-scale 3D scene understanding datasets, which were off-limits for 3D scene graph learning before. Our experiments demonstrate that in contrast to recent point cloud-based pre-training approaches, our proposed pre-training improves the 3D scene graph prediction considerably, which results in SOTA performance, outperforming other 3D scene graph models by +10% on object prediction and +4% on relationship prediction. Additionally, we show that only using a small subset of 10% labeled data during fine-tuning is sufficient to outperform the same model without pre-training.
    摘要 在三维场景理解领域,三维场景图(3D scene graph)已成为一种新的场景表示方法,可以同时包含物体的几何和 semantic信息。然而,在完全监督的情况下学习 semantic 3D scene graph 是非常困难的,因为需要不仅物体级别的注释,还需要关系标签。而在多种领域中,预训练方法已经有所帮助提高性能,但是针对 3D scene graph 的预训练却受到了少量的关注。此外,我们在这篇论文中发现,经典的对比点云预训练方法对于 3D scene graph 学习是无效的。为此,我们提出了 SGRec3D,一种新的自我监督预训练方法 для 3D scene graph 预测。我们提议使用场景图瓶颈来重建输入场景,作为一种预text任务。预训练 SGRec3D 不需要物体关系标签,因此可以利用大规模的 3D scene understanding 数据集,这些数据集在之前是 3D scene graph 学习中的不可达。我们的实验结果表明,相比最近的点云预训练方法,我们提posed的预训练方法可以大幅提高 3D scene graph 预测性能,达到了最新的标准性能,在物体预测上超过了 +10%,在关系预测上超过了 +4%。此外,我们还证明了只使用 10% 的标注数据进行细化调教是足够的,可以超过同样的模型无预训练。

Physics Inspired Hybrid Attention for SAR Target Recognition

  • paper_url: http://arxiv.org/abs/2309.15697
  • repo_url: https://github.com/xai4sar/piha
  • paper_authors: Zhongling Huang, Chong Wu, Xiwen Yao, Zhicheng Zhao, Xiankai Huang, Junwei Han
    for:The paper is focused on improving the performance and physical interpretability of SAR target recognition by integrating physical models and deep neural networks (DNNs).methods:The proposed method is based on a physics-inspired hybrid attention (PIHA) mechanism that leverages high-level semantics of physical information to activate and guide the feature group aware of local semantics of the target. The PIHA mechanism can be integrated into arbitrary DNNs without modifying the original architecture.results:The proposed method outperforms other state-of-the-art approaches in 12 test scenarios with the same ASC parameters. The experiments also show that PIHA is effective for different physical information and can be used to evaluate the model’s robustness and generalizability using the once-for-all (OFA) evaluation protocol.Here is the answer in Simplified Chinese text:for: 本 paper 的目的是提高 SAR 目标识别的性能和物理解释性,通过结合物理模型和深度神经网络 (DNNs)。methods: 提议的方法基于物理启发的混合注意力 (PIHA) 机制,利用高级别的物理信息来启动和指导target的本地semantics feature group。PIHA 机制可以与原始建筑不变的 DNNs 集成。results: 提议的方法在 12 个测试场景中超过了其他状态对照方法,并且在不同的物理信息下也表现出色。
    Abstract There has been a recent emphasis on integrating physical models and deep neural networks (DNNs) for SAR target recognition, to improve performance and achieve a higher level of physical interpretability. The attributed scattering center (ASC) parameters garnered the most interest, being considered as additional input data or features for fusion in most methods. However, the performance greatly depends on the ASC optimization result, and the fusion strategy is not adaptable to different types of physical information. Meanwhile, the current evaluation scheme is inadequate to assess the model's robustness and generalizability. Thus, we propose a physics inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol to address the above issues. PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of target, so as to re-weight the feature importance based on knowledge prior. It is flexible and generally applicable to various physical models, and can be integrated into arbitrary DNNs without modifying the original architecture. The experiments involve a rigorous assessment using the proposed OFA, which entails training and validating a model on either sufficient or limited data and evaluating on multiple test sets with different data distributions. Our method outperforms other state-of-the-art approaches in 12 test scenarios with same ASC parameters. Moreover, we analyze the working mechanism of PIHA and evaluate various PIHA enabled DNNs. The experiments also show PIHA is effective for different physical information. The source code together with the adopted physical information is available at https://github.com/XAI4SAR.
    摘要 Recently, there has been an emphasis on combining physical models and deep neural networks (DNNs) for target recognition in synthetic aperture radar (SAR) imaging, in order to improve performance and achieve better physical interpretability. The attributed scattering center (ASC) parameters have been widely used as additional input data or features for fusion in most methods. However, the performance of these methods heavily depends on the optimization result of ASC, and the fusion strategy is not adaptable to different types of physical information. Moreover, the current evaluation scheme is inadequate to assess the model's robustness and generalizability.To address these issues, we propose a physics-inspired hybrid attention (PIHA) mechanism and the once-for-all (OFA) evaluation protocol. PIHA leverages the high-level semantics of physical information to activate and guide the feature group aware of local semantics of the target, so as to re-weight the feature importance based on knowledge prior. This approach is flexible and generally applicable to various physical models, and can be integrated into arbitrary DNNs without modifying the original architecture.We conducted a series of experiments to evaluate the effectiveness of PIHA, using the proposed OFA evaluation protocol. The experiments involved training and validating a model on either sufficient or limited data, and evaluating its performance on multiple test sets with different data distributions. Our results show that PIHA outperforms other state-of-the-art approaches in 12 test scenarios with the same ASC parameters. Moreover, we analyzed the working mechanism of PIHA and evaluated various PIHA-enabled DNNs. The experiments also demonstrated that PIHA is effective for different physical information.The source code, together with the adopted physical information, is available at https://github.com/XAI4SAR.

A Unified View of Differentially Private Deep Generative Modeling

  • paper_url: http://arxiv.org/abs/2309.15696
  • repo_url: None
  • paper_authors: Dingfan Chen, Raouf Kerkouche, Mario Fritz
  • for: This paper aims to provide a unified view of various approaches for achieving privacy-preserving high-dimensional data generation through differentially private (DP) training of deep neural networks.
  • methods: The paper systematizes and jointly designs methods for different use cases, and discusses the strengths, limitations, and inherent correlations between different approaches.
  • results: The paper presents a novel unified view of privacy-preserving data generation methods, and provides potential paths forward for the field of DP data generation, with the aim of advancing privacy-preserving learning.
    Abstract The availability of rich and vast data sources has greatly advanced machine learning applications in various domains. However, data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles in compliance with privacy considerations is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released, enabling privacy-preserving downstream analysis and reproducible research in sensitive domains. In recent years, various approaches have been proposed for achieving privacy-preserving high-dimensional data generation by private training on top of deep neural networks. In this paper, we present a novel unified view that systematizes these approaches. Our view provides a joint design space for systematically deriving methods that cater to different use cases. We then discuss the strengths, limitations, and inherent correlations between different approaches, aiming to shed light on crucial aspects and inspire future research. We conclude by presenting potential paths forward for the field of DP data generation, with the aim of steering the community toward making the next important steps in advancing privacy-preserving learning.
    摘要 “由于丰富的数据源的可用性,机器学习应用在不同领域得到了很大的进步。然而,具有隐私问题的数据受到了严格的规定,这些规定 frequently prohibited data access 和 data sharing。为了遵循隐私考虑,在许多实际应用 scenario 中,技术进步是关键。具有隐私保证的数据发布(DP)提供了一个吸引人的解决方案,即仅发布了隐私检查的数据,允许隐私保证的下游分析和可重复性的研究。在过去几年,许多方法被提出供以实现隐私保证高维数据生成。在本文中,我们提出了一个新的统一的观点,它系统地探讨了不同的用案。我们的观点提供了一个共同的设计空间,可以系统地从数据生成中获得不同的方法。我们然后讨论了不同方法的优点、局限性和内在的相互关联性,以照明关键的问题和激励未来研究。我们结束时,提出了未来隐私保证数据生成领域的可能的进步之路,以导引社区做出下一步的进步。”

End-to-End Streaming Video Temporal Action Segmentation with Reinforce Learning

  • paper_url: http://arxiv.org/abs/2309.15683
  • repo_url: https://github.com/Thinksky5124/SVTAS
  • paper_authors: Wujun Wen, Jinrong Zhang, Shenglan Liu, Yunheng Li, Qifeng Li, Lin Feng
  • for: 本研究的目的是提出一种可以实时应用于长视频中的动作分类任务,以扩展现有的动作识别模型的应用场景。
  • methods: 该研究提出了一种结合流处理和强化学习的末端视频动作时间分 segmentation方法(SVTAS-RL),可以将动作识别任务视为动作分类 clustering 任务,并使用强化学习来缓解不一致的优化目标和方向问题。
  • results: 经过广泛的实验,SVTAS-RL 模型在多个数据集上达到了与现有模型相当的竞争性性能,并在ultra-long video dataset EGTEA 上表现出了更大的优势,这表明该方法可以取代现有的 TAS 模型,并且 SVTAS-RL 更适合长视频 TAS。
    Abstract Temporal Action Segmentation (TAS) from video is a kind of frame recognition task for long video with multiple action classes. As an video understanding task for long videos, current methods typically combine multi-modality action recognition models with temporal models to convert feature sequences to label sequences. This approach can only be applied to offline scenarios, which severely limits the TAS application. Therefore, this paper proposes an end-to-end Streaming Video Temporal Action Segmentation with Reinforce Learning (SVTAS-RL). The end-to-end SVTAS which regard TAS as an action segment clustering task can expand the application scenarios of TAS; and RL is used to alleviate the problem of inconsistent optimization objective and direction. Through extensive experiments, the SVTAS-RL model achieves a competitive performance to the state-of-the-art model of TAS on multiple datasets, and shows greater advantages on the ultra-long video dataset EGTEA. This indicates that our method can replace all current TAS models end-to-end and SVTAS-RL is more suitable for long video TAS. Code is availabel at https://github.com/Thinksky5124/SVTAS.
    摘要 Temporal Action Segmentation (TAS) from video is a type of frame recognition task for long videos with multiple action classes. As a video understanding task for long videos, current methods typically combine multi-modality action recognition models with temporal models to convert feature sequences into label sequences. This approach can only be applied to offline scenarios, which severely limits the TAS application. Therefore, this paper proposes an end-to-end Streaming Video Temporal Action Segmentation with Reinforce Learning (SVTAS-RL). The end-to-end SVTAS, which treats TAS as an action segment clustering task, can expand the application scenarios of TAS; and RL is used to alleviate the problem of inconsistent optimization objectives and directions. Through extensive experiments, the SVTAS-RL model achieves a competitive performance to the state-of-the-art model of TAS on multiple datasets, and shows greater advantages on the ultra-long video dataset EGTEA. This indicates that our method can replace all current TAS models end-to-end, and SVTAS-RL is more suitable for long video TAS. Code is available at https://github.com/Thinksky5124/SVTAS.Here is the word-for-word translation of the text into Simplified Chinese:视频 temporal action segmentation (TAS) 是一种帧 recognition 任务,用于长视频中的多种动作类。当前方法通常将多模态动作识别模型与时间模型组合,将特征序列转换为标签序列。这种方法只适用于线上enario,很大限制 TAS 应用。因此,这篇论文提出了一种终端到终 Streaming Video Temporal Action Segmentation with Reinforce Learning (SVTAS-RL)。终端 SVTAS 将 TAS 视为动作段 clustering 任务,可以扩大 TAS 的应用场景; RL 用于缓解不一致的优化目标和方向问题。经过广泛的实验,SVTAS-RL 模型在多个 datasets 上达到了与当前 TAS 模型的竞争性性能,并在EGTEA 数据集上表现出更大的优势。这表示我们的方法可以替换所有当前 TAS 模型,并且 SVTAS-RL 更适合长视频 TAS。 Code 可以在 https://github.com/Thinksky5124/SVTAS 中获取。

SJTU-TMQA: A quality assessment database for static mesh with texture map

  • paper_url: http://arxiv.org/abs/2309.15675
  • repo_url: None
  • paper_authors: Bingyang Cui, Qi Yang, Kaifa Yang, Yiling Xu, Xiaozhong Xu, Shan Liu
  • for: 这篇论文主要是为了评估纹理化网格质量的研究。
  • methods: 论文使用了21个参考网格和945个扭曲样本来创建大规模的纹理化网格质量评估数据库(SJTU-TMQA),并通过主观实验获得了意见分数(MOS)。
  • results: 研究显示了不同类型的扭曲对人类印象的影响,并评估了13种当前最佳对象度量的可靠性。结果显示这些度量之间的相关性为0.6级,表明需要更有效的对象度量。 SJTU-TMQA数据库可以在https://ccccby.github.io中下载。
    Abstract In recent years, static meshes with texture maps have become one of the most prevalent digital representations of 3D shapes in various applications, such as animation, gaming, medical imaging, and cultural heritage applications. However, little research has been done on the quality assessment of textured meshes, which hinders the development of quality-oriented applications, such as mesh compression and enhancement. In this paper, we create a large-scale textured mesh quality assessment database, namely SJTU-TMQA, which includes 21 reference meshes and 945 distorted samples. The meshes are rendered into processed video sequences and then conduct subjective experiments to obtain mean opinion scores (MOS). The diversity of content and accuracy of MOS has been shown to validate its heterogeneity and reliability. The impact of various types of distortion on human perception is demonstrated. 13 state-of-the-art objective metrics are evaluated on SJTU-TMQA. The results report the highest correlation of around 0.6, indicating the need for more effective objective metrics. The SJTU-TMQA is available at https://ccccby.github.io
    摘要 在最近的几年中,静止的矩阵图形在各种应用中变得非常普遍,如动画、游戏、医疗影像和文化遗产应用。然而,对纹理矩阵质量的研究很少,这限制了质量导向的应用,如矩阵压缩和提高。在这篇论文中,我们创建了一个大规模的纹理矩阵质量评估数据库,即上海交通大学纹理矩阵质量评价数据库(SJTU-TMQA),包括21个参考矩阵和945个扭曲样本。这些矩阵通过渲染而生成的处理视频序列,然后通过主观实验获得mean opinion score(MOS)。我们所得到的多样性和准确性已经被证明,以 validate its heterogeneity and reliability。我们还展示了不同类型的扭曲对人类的感知具有多大的影响。13种当前的对象度量被评估在SJTU-TMQA上,结果显示其相关性达0.6, indicating the need for more effective objective metrics。SJTU-TMQA可以在https://ccccby.github.io 上获取。

Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing

  • paper_url: http://arxiv.org/abs/2309.15664
  • repo_url: https://github.com/wangkai930418/DPL
  • paper_authors: Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer
  • for: 这 paper 的目的是给用户提供精细的图像编辑功能,以便通过修改文本提示来控制生成的图像。
  • methods: 这 paper 使用了Diffusion模型,并提出了一种名为 Dynamic Prompt Learning(DPL)的新方法,以解决图像编辑时的偏差问题。
  • results: compared to existing methods, DPL 可以准确地编辑图像中的特定对象,而不会影响其他图像区域。这 paper 的实验结果表明,DPL 可以在多种图像场景中获得superior的结果, both quantitatively (CLIP score, Structure-Dist) 和 qualitatively (用户评价).
    Abstract Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose Dynamic Prompt Learning (DPL) to force cross-attention maps to focus on correct noun words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method DPL, based on the publicly available Stable Diffusion, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.
    摘要 大规模文本到图像生成模型已经是生成智能的一个重要发展,扩散模型表现出了从文本输入提示synthesize出实际的图像的惊人能力。图像编辑研究的目标是给用户控制生成图像的文本提示。现有的图像编辑技术容易导致不必要地修改图像背景或 Distractor 对象上的元素,这会导致图像 editing 失败。根据我们的实验结果,不准确的跨注意力地图是这个问题的根本原因。基于这一观察,我们提出了动态提示学习(DPL),强制跨注意力地图专注于正确的名词在文本输入中。通过更新文本输入中的动态token для名词,我们实现了细化的图像编辑,并避免了不必要地修改其他图像区域。我们的方法DPL,基于公共可用的稳定扩散,广泛评估了多种图像,并 consistently 获得了较好的数值(CLIP 分数、结构-分布)和质量(用户评估)评估结果。我们展示了对 Word-Swap、提示精度和注意力重新分配的提示编辑结果的改进,特别是在复杂多对象场景中。

Human Kinematics-inspired Skeleton-based Video Anomaly Detection

  • paper_url: http://arxiv.org/abs/2309.15662
  • repo_url: https://github.com/XiaoJian923/Kinematics-VAD
  • paper_authors: Jian Xiao, Tianyuan Liu, Genlin Ji
  • for: 本研究旨在探讨人体异常检测视频中的新方法,以及人体动态特征如何用于检测异常。
  • methods: 本研究提出了一种新的方法 called HKVAD (Human Kinematic-inspired Video Anomaly Detection),它利用人体动态特征来检测视频异常。该方法首先利用人体三维姿态数据,特别是跑步姿势、脚部位置和颈部位置的动态特征,然后使用流变模型来估算概率并检测异常。
  • results: 根据实验结果,HKVAD方法在两个公共数据集上(ShanghaiTech和UBnormal)获得了良好的结果,而且只需使用了 minimal computational resources。这表明该方法的有效性和潜在性。
    Abstract Previous approaches to detecting human anomalies in videos have typically relied on implicit modeling by directly applying the model to video or skeleton data, potentially resulting in inaccurate modeling of motion information. In this paper, we conduct an exploratory study and introduce a new idea called HKVAD (Human Kinematic-inspired Video Anomaly Detection) for video anomaly detection, which involves the explicit use of human kinematic features to detect anomalies. To validate the effectiveness and potential of this perspective, we propose a pilot method that leverages the kinematic features of the skeleton pose, with a specific focus on the walking stride, skeleton displacement at feet level, and neck level. Following this, the method employs a normalizing flow model to estimate density and detect anomalies based on the estimated density. Based on the number of kinematic features used, we have devised three straightforward variant methods and conducted experiments on two highly challenging public datasets, ShanghaiTech and UBnormal. Our method achieves good results with minimal computational resources, validating its effectiveness and potential.
    摘要 前一些视频人异常检测方法通常是通过直接应用模型到视频或skeleton数据来进行隐式模型,这可能导致动作信息的不准确模型化。在这篇论文中,我们进行了一项探索性研究,并提出了一种新的思路called HKVAD(人体骨征驱动的视频异常检测),这种方法利用人体骨征特征来检测异常。为了证明这种视角的有效性和潜力,我们提议了一种起点方法,该方法利用人体skeleton姿势中的步伐、脚部偏移量和 neck 水平位置的骨征特征。接着,该方法使用了一种归一化流模型来估计密度并检测异常基于估计的密度。根据使用的骨征特征数量,我们设计了三种简单的变体方法,并在两个非常困难的公共数据集上进行了实验,即ShanghaiTech 和 UBnormal。我们的方法在计算资源不多的情况下达到了良好的结果,这 validate了其有效性和潜力。

FRS-Nets: Fourier Parameterized Rotation and Scale Equivariant Networks for Retinal Vessel Segmentation

  • paper_url: http://arxiv.org/abs/2309.15638
  • repo_url: None
  • paper_authors: Zihong Sun, Qi Xie, Deyu Meng
    for: 这篇论文主要目的是提出一种新的卷积操作符(FRS-Conv),以提高卷积神经网(CNNs)在血管分类中的精度和一致性。methods: 这篇论文使用了一种新的参数化方案,允许卷积 filters 进行高精度的旋转和缩放变换。它还提出了旋转和缩放对卷积映射的等调性数学表述。最后,它将这些表述与传统卷积映射相结合,实现了FRS-Conv。results: 这篇论文的实验结果显示,使用FRS-Conv可以实现血管分类中的高精度和一致性。它在三个公共数据集上进行了广泛的比较实验,包括内集和跨集设定。与相比方法相比,FRS-Nets 仅需13.9%的参数,却能够实现顶尖的性能,并且具有优秀的一致性和丰富的应用潜力。
    Abstract With translation equivariance, convolution neural networks (CNNs) have achieved great success in retinal vessel segmentation. However, some other symmetries of the vascular morphology are not characterized by CNNs, such as rotation and scale symmetries. To embed more equivariance into CNNs and achieve the accuracy requirement for retinal vessel segmentation, we construct a novel convolution operator (FRS-Conv), which is Fourier parameterized and equivariant to rotation and scaling. Specifically, we first adopt a new parameterization scheme, which enables convolutional filters to arbitrarily perform transformations with high accuracy. Secondly, we derive the formulations for the rotation and scale equivariant convolution mapping. Finally, we construct FRS-Conv following the proposed formulations and replace the traditional convolution filters in U-Net and Iter-Net with FRS-Conv (FRS-Nets). We faithfully reproduce all compared methods and conduct comprehensive experiments on three public datasets under both in-dataset and cross-dataset settings. With merely 13.9% parameters of corresponding baselines, FRS-Nets have achieved state-of-the-art performance and significantly outperform all compared methods. It demonstrates the remarkable accuracy, generalization, and clinical application potential of FRS-Nets.
    摘要 使用翻译等价性,图像卷积神经网络(CNN)在血管轮廓分割方面取得了很大的成功。然而,图像中的其他同质性,如旋转和缩放同质性,并没有被CNN表征出来。为了嵌入更多的等价性到CNN中,并达到血管轮廓分割的精度要求,我们构建了一种新型的卷积算子(FRS-Conv),该算子是快 Fourier 参数化的和旋转和缩放等价的。 Specifically,我们首先采用一种新的参数化方案,允许卷积滤波器通过高精度执行变换。其次,我们 derivate了旋转和缩放等价的卷积映射表达式。最后,我们根据提出的表达式构建FRS-Conv,并将传统卷积滤波器在U-Net和Iter-Net中取代。我们忠实地复制了所有相关的方法,并在三个公共数据集上进行了广泛的实验,包括在集合和跨集合设置下。只有13.9%的参数,FRS-Nets已经达到了同类方法的状态对比较好的性能,并显著超过了所有相关方法。这表明FRS-Nets具有出色的精度、普遍性和临床应用潜力。

Position and Orientation-Aware One-Shot Learning for Medical Action Recognition from Signal Data

  • paper_url: http://arxiv.org/abs/2309.15635
  • repo_url: None
  • paper_authors: Leiyu Xie, Yuxing Yang, Zeyu Fu, Syed Mohsen Naqvi
  • for: 这篇论文旨在提出一个基于信号数据的医疗动作识别框架,以提高医疗动作识别的精度和可靠性。
  • methods: 该框架包括两个阶段,每个阶段含有信号生成(SIG)、跨注意(CsA)、时间截然变(DTW)模组,以及具有隐私保证的位置和方向特征的资讯融合。 SIG 方法旨在将骨架资料转换为隐私保证的特征,以供训练。 CsA 模组则是为了帮助网络优化医疗动作识别,并对人体部位进行注意力调节,以解决类似的医疗动作相关问题。 DTW 模组则是为了将时间汇入调整,以提高模型性能。
  • results: 实验结果显示,该提案的方法可以在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD 等三个常用和知名的数据集上实现医疗动作识别的高精度和可靠性,并在不同的数据分配情况下协助优化医疗动作识别的性能,比如NTU RGB+D 60 的通用数据分配下提高了2.7%、NTU RGB+D 120 的通用数据分配下提高了6.2%、PKU-MMD 的通用数据分配下提高了4.1%。
    Abstract In this work, we propose a position and orientation-aware one-shot learning framework for medical action recognition from signal data. The proposed framework comprises two stages and each stage includes signal-level image generation (SIG), cross-attention (CsA), dynamic time warping (DTW) modules and the information fusion between the proposed privacy-preserved position and orientation features. The proposed SIG method aims to transform the raw skeleton data into privacy-preserved features for training. The CsA module is developed to guide the network in reducing medical action recognition bias and more focusing on important human body parts for each specific action, aimed at addressing similar medical action related issues. Moreover, the DTW module is employed to minimize temporal mismatching between instances and further improve model performance. Furthermore, the proposed privacy-preserved orientation-level features are utilized to assist the position-level features in both of the two stages for enhancing medical action recognition performance. Extensive experimental results on the widely-used and well-known NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets all demonstrate the effectiveness of the proposed method, which outperforms the other state-of-the-art methods with general dataset partitioning by 2.7%, 6.2% and 4.1%, respectively.
    摘要 在这个工作中,我们提出了一个位置和方向意识到一步学习框架 для医疗动作识别从信号数据。我们的框架包括两个阶段,每个阶段包括信号水平图生成(SIG)、交叉注意(CsA)、动态时间滤波(DTW)模块以及信号水平和方向级别特征的信息融合。我们的SIG方法旨在将原始骨架数据转换成隐私保护的特征进行训练。CsA模块是为了帮助网络减少医疗动作识别偏见,更关注每个特定动作中人体重要部分,以解决类似的医疗动作相关问题。此外,DTW模块用于最小化时间匹配错误,以提高模型性能。此外,我们的隐私保护方向级别特征被利用以帮助位置级别特征在两个阶段中提高医疗动作识别性能。我们的实验结果表明,我们的方法在 widely 使用和知名的 NTU RGB+D 60、NTU RGB+D 120 和 PKU-MMD 数据集上均达到了最高效果,与其他状态对比方法的总体数据分区优势为2.7%、6.2%和4.1%,分别。

Neuromorphic Imaging and Classification with Graph Learning

  • paper_url: http://arxiv.org/abs/2309.15627
  • repo_url: None
  • paper_authors: Pei Zhang, Chutian Wang, Edmund Y. Lam
  • for: 该论文旨在开发一种基于神经元模型的图像摄像头,以便在各种EXTREME的照明条件下捕捉动态场景,并且减少运动模糊和提高细节表示。
  • methods: 该论文使用图像摄像头异步记录像素亮度变化,并生成稀热事件流。然后,使用图形变换器处理这些事件数据,以实现精准的神经元分类。
  • results: 对比传统方法,该论文的方法能够在具有限制的计算资源和事件数量的实际场景中,提供更好的结果,并且在EXTREME照明条件下捕捉动态场景中减少运动模糊和提高细节表示。
    Abstract Bio-inspired neuromorphic cameras asynchronously record pixel brightness changes and generate sparse event streams. They can capture dynamic scenes with little motion blur and more details in extreme illumination conditions. Due to the multidimensional address-event structure, most existing vision algorithms cannot properly handle asynchronous event streams. While several event representations and processing methods have been developed to address such an issue, they are typically driven by a large number of events, leading to substantial overheads in runtime and memory. In this paper, we propose a new graph representation of the event data and couple it with a Graph Transformer to perform accurate neuromorphic classification. Extensive experiments show that our approach leads to better results and excels at the challenging realistic situations where only a small number of events and limited computational resources are available, paving the way for neuromorphic applications embedded into mobile facilities.
    摘要 生物启发 neuromorphic 摄像头异步记录像素亮度变化,生成稀疏事件流。它们可以捕捉动态场景,具有少量运动模糊和更多细节在极端照明条件下。由于多维度地址事件结构,大多数现有视觉算法无法正确处理异步事件流。虽然一些事件表示和处理方法已经开发出来解决这个问题,但它们通常受到大量事件数量的限制,导致运行时间和内存占用增加很多。在这篇论文中,我们提出一种新的图表示法,将事件数据表示为图,并与图变换器结合,实现高准确的 neuromorphic 分类。广泛的实验表明,我们的方法在实际情况下表现更好,能够在只有少量事件和有限的计算资源的情况下取得更好的结果,为 neuromorphic 应用在移动设备中铺平道路。

Leveraging Topology for Domain Adaptive Road Segmentation in Satellite and Aerial Imagery

  • paper_url: http://arxiv.org/abs/2309.15625
  • repo_url: None
  • paper_authors: Javed Iqbal, Aliza Masood, Waqas Sultani, Mohsen Ali
  • for: 本研究旨在提高遥感图像中道路分割的精度和一致性,以满足自动驾驶、城市规划和可持续发展等实际应用。
  • methods: 本研究提出了一种基于Topology的无监督领域适应方法,通过预测道路skeleton来强制道路分割预测和skeleton预测具有同 topological结构的约束。
  • results: 对 SpaceNet 和 DeepGlobe 数据集进行了广泛的实验,并证明了提出的方法在与现有状态的方法进行比较时具有显著的优势,具体的比较结果为:SpaceNet 到 DeepGlobe 的适应性提高6.6%, 6.7%, 9.8%。
    Abstract Getting precise aspects of road through segmentation from remote sensing imagery is useful for many real-world applications such as autonomous vehicles, urban development and planning, and achieving sustainable development goals. Roads are only a small part of the image, and their appearance, type, width, elevation, directions, etc. exhibit large variations across geographical areas. Furthermore, due to differences in urbanization styles, planning, and the natural environments; regions along the roads vary significantly. Due to these variations among the train and test domains, the road segmentation algorithms fail to generalize to new geographical locations. Unlike the generic domain alignment scenarios, road segmentation has no scene structure, and generic domain adaptation methods are unable to enforce topological properties like continuity, connectivity, smoothness, etc., thus resulting in degraded domain alignment. In this work, we propose a topology-aware unsupervised domain adaptation approach for road segmentation in remote sensing imagery. Specifically, we predict road skeleton, an auxiliary task to impose the topological constraints. To enforce consistent predictions of road and skeleton, especially in the unlabeled target domain, the conformity loss is defined across the skeleton prediction head and the road-segmentation head. Furthermore, for self-training, we filter out the noisy pseudo-labels by using a connectivity-based pseudo-labels refinement strategy, on both road and skeleton segmentation heads, thus avoiding holes and discontinuities. Extensive experiments on the benchmark datasets show the effectiveness of the proposed approach compared to existing state-of-the-art methods. Specifically, for SpaceNet to DeepGlobe adaptation, the proposed approach outperforms the competing methods by a minimum margin of 6.6%, 6.7%, and 9.8% in IoU, F1-score, and APLS, respectively.
    摘要 getting precise aspects of road through segmentation from remote sensing imagery is useful for many real-world applications such as autonomous vehicles, urban development and planning, and achieving sustainable development goals. roads are only a small part of the image, and their appearance, type, width, elevation, directions, etc. exhibit large variations across geographical areas. Furthermore, due to differences in urbanization styles, planning, and the natural environments; regions along the roads vary significantly. due to these variations among the train and test domains, the road segmentation algorithms fail to generalize to new geographical locations. unlike the generic domain alignment scenarios, road segmentation has no scene structure, and generic domain adaptation methods are unable to enforce topological properties like continuity, connectivity, smoothness, etc., thus resulting in degraded domain alignment. in this work, we propose a topology-aware unsupervised domain adaptation approach for road segmentation in remote sensing imagery. specifically, we predict road skeleton, an auxiliary task to impose the topological constraints. to enforce consistent predictions of road and skeleton, especially in the unlabeled target domain, the conformity loss is defined across the skeleton prediction head and the road-segmentation head. Furthermore, for self-training, we filter out the noisy pseudo-labels by using a connectivity-based pseudo-labels refinement strategy, on both road and skeleton segmentation heads, thus avoiding holes and discontinuities. extensive experiments on the benchmark datasets show the effectiveness of the proposed approach compared to existing state-of-the-art methods. specifically, for spacenet to deepglobe adaptation, the proposed approach outperforms the competing methods by a minimum margin of 6.6%, 6.7%, and 9.8% in iou, f1-score, and apls, respectively.

NoSENSE: Learned unrolled cardiac MRI reconstruction without explicit sensitivity maps

  • paper_url: http://arxiv.org/abs/2309.15608
  • repo_url: None
  • paper_authors: Felix Frederik Zimmermann, Andreas Kofler
  • for: 这个论文旨在提出一种基于深度卷积神经网络的加速心脏MRI多接收器磁共振图像重建方法,以避免许多现有的学习MR图像重建技术中的磁共振敏感度地图(CSM)估计。
  • methods: 该方法包括一系列新的学习图像和k空间块,以及共振磁场信息的共享和特征 Wisdom (FiLM)块,以及磁共振数据一致(DC)块。
  • results: 该方法在MICCAI STACOM CMRxRecon挑战中的笔轨和映射轨验证领导表中 achieved PSNR值为34.89和35.56,SSIM值为0.920和0.942,在4个不同的队伍中排名第4。
    Abstract We present a novel learned image reconstruction method for accelerated cardiac MRI with multiple receiver coils based on deep convolutional neural networks (CNNs) and algorithm unrolling. In contrast to many existing learned MR image reconstruction techniques that necessitate coil-sensitivity map (CSM) estimation as a distinct network component, our proposed approach avoids explicit CSM estimation. Instead, it implicitly captures and learns to exploit the inter-coil relationships of the images. Our method consists of a series of novel learned image and k-space blocks with shared latent information and adaptation to the acquisition parameters by feature-wise modulation (FiLM), as well as coil-wise data-consistency (DC) blocks. Our method achieved PSNR values of 34.89 and 35.56 and SSIM values of 0.920 and 0.942 in the cine track and mapping track validation leaderboard of the MICCAI STACOM CMRxRecon Challenge, respectively, ranking 4th among different teams at the time of writing. Code will be made available at https://github.com/fzimmermann89/CMRxRecon
    摘要 我们提出了一种新的学习Image重建方法,用于加速心脏MRI,基于深度卷积神经网络(CNNs)和算法膨胀。与许多现有的学习MR Image重建技术不同,我们的提议方法不需要显式的磁共振敏感度地图(CSM)估计。而是通过隐式地捕捉和利用图像间的相互关系,来避免直接估计CSM。我们的方法包括一系列新的学习图像和k空间块,共享缓存信息和适应到获取参数的特征 wise modulation(FiLM),以及磁共振数据一致(DC)块。我们的方法在MICCAI STACOM CMRxRecon Challenge的碰撞轨迹和映射轨迹验证领导борда上 achieved PSNR值为34.89和35.56,SSIM值为0.920和0.942,在不同团队中排名第四。代码将在https://github.com/fzimmermann89/CMRxRecon上提供。

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2309.15596
  • repo_url: None
  • paper_authors: Shizhe Chen, Ricardo Garcia, Cordelia Schmid, Ivan Laptev
  • for: 本研究旨在提高机器人对自然语言指令的理解和执行 manipulation 任务能力。
  • methods: 提议一种基于 3D 点云的策略PolarNet,通过特制的点云输入、高效点云编码器和多模态 transformer 来学习 3D 点云表示并与语言指令集成 для行为预测。
  • results: PolarNet 在RLBench benchmark 上展现出了高效和数据效果,与当前状态的 2D 和 3D 方法相比,在单任务和多任务学习中均有出色的表现。在真实的机器人上也取得了可喜的结果。
    Abstract The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties in combining multi-view cameras and inferring precise 3D positions and relationships. To address these limitations, we propose a 3D point cloud based policy called PolarNet for language-guided manipulation. It leverages carefully designed point cloud inputs, efficient point cloud encoders, and multimodal transformers to learn 3D point cloud representations and integrate them with language instructions for action prediction. PolarNet is shown to be effective and data efficient in a variety of experiments conducted on the RLBench benchmark. It outperforms state-of-the-art 2D and 3D approaches in both single-task and multi-task learning. It also achieves promising results on a real robot.
    摘要 “机器人理解和执行基于自然语言指令的 manipulate 任务是机器人学的长期目标。现有主流方法 для语言导向 manipulate 使用 2D 图像表示,它们面临着组合多视图摄像头和推断精确的 3D 位置和关系的困难。为解决这些限制,我们提出了一种基于 3D 点云的策略called PolarNet,用于语言导向 manipulate。它利用了特制的点云输入、高效的点云编码器和多模态转换器来学习 3D 点云表示并与语言指令集成以进行动作预测。PolarNet 在RLBench benchmark 上进行了多种实验,并在单任务和多任务学习中超越了当前状态的 2D 和 3D 方法。它还在真实的机器人上实现了可靠的结果。”

Domain generalization across tumor types, laboratories, and species – insights from the 2022 edition of the Mitosis Domain Generalization Challenge

  • paper_url: http://arxiv.org/abs/2309.15589
  • repo_url: None
  • paper_authors: Marc Aubreville, Nikolas Stathonikos, Taryn A. Donovan, Robert Klopfleisch, Jonathan Ganz, Jonas Ammeling, Frauke Wilm, Mitko Veta, Samir Jabari, Markus Eckstein, Jonas Annuscheit, Christian Krumnow, Engin Bozaba, Sercan Cayir, Hongyan Gu, Xiang ‘Anthony’ Chen, Mostafa Jahanifar, Adam Shephard, Satoshi Kondo, Satoshi Kasai, Sujatha Kotte, VG Saipradeep, Maxime W. Lafarge, Viktor H. Koelzer, Ziyue Wang, Yongbing Zhang, Sen Yang, Xiyue Wang, Katharina Breininger, Christof A. Bertram
  • For: The paper is focused on the challenge of recognizing mitotic figures in histologic tumor specimens, which is crucial for patient outcome assessment.* Methods: The paper describes the 2022 challenge on Mitosis Domain Generalization (MIDOG 2022), which provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection from nine challenge participants on ten independent domains.* Results: The top-performing team achieved an $F_1$ score of 0.764, demonstrating that domain generalization across various tumor domains is possible with today’s deep learning-based recognition pipelines. However, all methods resulted in reduced recall scores compared to the immunohistochemistry-assisted reference standard, with only minor changes in the ranking of participants.Here are the three points in Simplified Chinese text:* For: 这篇论文关注了 histologic tumor specimen 中 mitotic figure 的识别问题,这对患者结果评估非常重要。* Methods: 论文描述了2022年 Mitosis Domain Generalization (MIDOG 2022) 挑战,该挑战提供了六个不同领域的注意力束教学图像,并对 nine 个挑战参与者的算法方法进行了在 ten 个独立领域上的评估。* Results: 最高级别的团队实现了 $F_1$ score 0.764,表明今天的深度学习基于的识别管道中的领域泛化是可能的。然而,所有方法都导致了与 immunohistochemistry 助记标准相比的减少的回归得分,仅有小量的排名变化。
    Abstract Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization devices, or specimens are produced in different laboratories. This observation motivated the inception of the 2022 challenge on MItosis Domain Generalization (MIDOG 2022). The challenge provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection provided by nine challenge participants on ten independent domains. Ground truth for mitotic figure detection was established in two ways: a three-expert consensus and an independent, immunohistochemistry-assisted set of labels. This work represents an overview of the challenge tasks, the algorithmic strategies employed by the participants, and potential factors contributing to their success. With an $F_1$ score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today's deep learning-based recognition pipelines. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, but with only minor changes in the order of participants in the ranking.
    摘要 <>转换文本到简化中文。<>识别mitotic图像在癌症组织样本中的涉及非常重要,对患者结果评估非常重要。这项任务对算法和人工专家来说都是挑战,随着图像表示的变化,算法性能会下降。在不同的癌症类型、不同的数字化设备获取图像以及不同的实验室生产的样本中, covariate shift会出现很大。这些 Observation 驱动了2022年的 Mitosis Domain Generalization(MIDOG 2022)挑战。挑战提供了六个不同领域的癌症组织样本,并评估了参与挑战的九支算法在十个独立领域上的方法。 truth 的确定方法包括三位专家一致和独立的免疫抗体辅助标注。本文将介绍挑战任务、参与者所采用的算法策略以及成功的因素。与 F1 score 为 0.764 的top Performing 团队,我们总结了:今天的深度学习基于的识别管道可以在不同的癌症领域进行预测Domain generalization。在与免疫抗体辅助标注referenced标准进行评估时,所有方法均出现了减少回归分数,但只有一些参与者在排名中的顺序发生了小幅变化。

LivDet2023 – Fingerprint Liveness Detection Competition: Advancing Generalization

  • paper_url: http://arxiv.org/abs/2309.15578
  • repo_url: None
  • paper_authors: Marco Micheletto, Roberto Casula, Giulia Orrù, Simone Carta, Sara Concas, Simone Maurizio La Cava, Julian Fierrez, Gian Luca Marcialis
  • for: 本研究旨在评估指纹识别系统中的生物特征检测技术,以提高系统的安全性和可靠性。
  • methods: 本研究使用了LivDet2023比赛提供的指纹识别数据集,并采用了多种生物特征检测技术,如指纹图像分析和Machine Learning算法,来检测指纹是否为真实的。
  • results: 研究发现,使用生物特征检测技术可以准确地检测指纹是否为真实的,并且可以在不同的环境和 Condition下提供高度的检测精度。
    Abstract The International Fingerprint Liveness Detection Competition (LivDet) is a biennial event that invites academic and industry participants to prove their advancements in Fingerprint Presentation Attack Detection (PAD). This edition, LivDet2023, proposed two challenges, Liveness Detection in Action and Fingerprint Representation, to evaluate the efficacy of PAD embedded in verification systems and the effectiveness and compactness of feature sets. A third, hidden challenge is the inclusion of two subsets in the training set whose sensor information is unknown, testing participants ability to generalize their models. Only bona fide fingerprint samples were provided to participants, and the competition reports and assesses the performance of their algorithms suffering from this limitation in data availability.
    摘要 国际生物指纹生活检测竞赛(LivDet)是一项每两年一度的活动,邀请学术和业界参与者展示他们在指纹攻击检测(PAD)领域的进步。本届LivDet2023中提出了两个挑战,生活检测在动作中和指纹表示,以评估参与者提供的验证系统中的PAD效果和特征集的效率和 компакт性。而隐藏的第三个挑战是在训练集中包含两个子集的感知信息不明确,测试参与者的模型是否能够泛化。只有真实的指纹样本被提供给参与者,竞赛报告和评估参与者的算法受到这种数据可用性的限制。

Learning Spatial-Temporal Regularized Tensor Sparse RPCA for Background Subtraction

  • paper_url: http://arxiv.org/abs/2309.15576
  • repo_url: None
  • paper_authors: Basit Alawode, Sajid Javed
  • for: 这个论文的目的是提出一种基于tensor robust principal component analysis的准确背景 subtractor,用于解决视觉Computer Vision中的背景 subtractor问题。
  • methods: 该论文使用的方法包括:Robust principal component analysis(RPCA)、tensor RPCA、spatial-temporal regularized tensor sparse RPCA、batch和online-based optimization方法。
  • results: 该论文的实验结果表明,提出的方法在六个公共的背景 subtractor数据集上显示出比较出色的性能,与一些现有的方法相比。
    Abstract Video background subtraction is one of the fundamental problems in computer vision that aims to segment all moving objects. Robust principal component analysis has been identified as a promising unsupervised paradigm for background subtraction tasks in the last decade thanks to its competitive performance in a number of benchmark datasets. Tensor robust principal component analysis variations have improved background subtraction performance further. However, because moving object pixels in the sparse component are treated independently and do not have to adhere to spatial-temporal structured-sparsity constraints, performance is reduced for sequences with dynamic backgrounds, camouflaged, and camera jitter problems. In this work, we present a spatial-temporal regularized tensor sparse RPCA algorithm for precise background subtraction. Within the sparse component, we impose spatial-temporal regularizations in the form of normalized graph-Laplacian matrices. To do this, we build two graphs, one across the input tensor spatial locations and the other across its frontal slices in the time domain. While maximizing the objective function, we compel the tensor sparse component to serve as the spatiotemporal eigenvectors of the graph-Laplacian matrices. The disconnected moving object pixels in the sparse component are preserved by the proposed graph-based regularizations since they both comprise of spatiotemporal subspace-based structure. Additionally, we propose a unique objective function that employs batch and online-based optimization methods to jointly maximize the background-foreground and spatial-temporal regularization components. Experiments are performed on six publicly available background subtraction datasets that demonstrate the superior performance of the proposed algorithm compared to several existing methods. Our source code will be available very soon.
    摘要 Background subtraction是计算机视觉中的基本问题之一,旨在 segmenting all moving objects。在过去的一个 décennial中, Robust Principal Component Analysis(RPCA)被认为是一种有竞争力的无监督方法 для背景 subtractio Task。然而,由于在 sparse component中的运动 объек pixel不受 spatial-temporal 结构约束,因此在静背景、潜藏和摄像机震动问题时,性能会降低。在这项工作中,我们提出了一种带有spatial-temporal regularization的tensor sparse RPCA算法,用于高精度的背景 subtractio。在 sparse component中,我们对图 Laplacian矩阵进行正规化,以便在图 Laplacian矩阵的eigenvectors中强制tensor sparse component服为spatiotemporal eigenvectors。此外,我们还提出了一种新的目标函数,该目标函数通过批量和在线优化方法来同时最大化背景-前景和spatial-temporal regularization组件。在六个公共available background subtractio dataset上进行了实验,并证明了我们提出的算法与现有方法相比具有更高的性能。我们的源代码即将公开。

Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.15575
  • repo_url: https://github.com/bostoncake/c-visdit
  • paper_authors: Yizhe Xiong, Hui Chen, Zijia Lin, Sicheng Zhao, Guiguang Ding
  • for: 本研究 targets at few-shot unsupervised domain adaptation (FUDA) problem, where only a few labeled source samples are available, and aims to transfer knowledge from the source domain to the target domain without requiring abundant labeled data in the target domain.
  • methods: 本 paper proposes a novel Confidence-based Visual Dispersal Transfer learning method (C-VisDiT) for FUDA, which consists of a cross-domain visual dispersal strategy and an intra-domain visual dispersal strategy. The cross-domain strategy transfers only high-confidence source knowledge for model adaptation, while the intra-domain strategy guides the learning of hard target samples with easy ones.
  • results: 在Office-31, Office-Home, VisDA-C, 和DomainNet benchmark datasets上,C-VisDiT significantly outperforms state-of-the-art FUDA methods. The proposed method is able to transfer reliable source knowledge to the target domain and improve the classification performance of hard target samples.
    Abstract Unsupervised domain adaptation aims to transfer knowledge from a fully-labeled source domain to an unlabeled target domain. However, in real-world scenarios, providing abundant labeled data even in the source domain can be infeasible due to the difficulty and high expense of annotation. To address this issue, recent works consider the Few-shot Unsupervised Domain Adaptation (FUDA) where only a few source samples are labeled, and conduct knowledge transfer via self-supervised learning methods. Yet existing methods generally overlook that the sparse label setting hinders learning reliable source knowledge for transfer. Additionally, the learning difficulty difference in target samples is different but ignored, leaving hard target samples poorly classified. To tackle both deficiencies, in this paper, we propose a novel Confidence-based Visual Dispersal Transfer learning method (C-VisDiT) for FUDA. Specifically, C-VisDiT consists of a cross-domain visual dispersal strategy that transfers only high-confidence source knowledge for model adaptation and an intra-domain visual dispersal strategy that guides the learning of hard target samples with easy ones. We conduct extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet benchmark datasets and the results demonstrate that the proposed C-VisDiT significantly outperforms state-of-the-art FUDA methods. Our code is available at https://github.com/Bostoncake/C-VisDiT.
    摘要 <>将文本翻译成简化中文。<>无监督领域适应目标是将源领域中完全标注的知识传递到无标注目标领域。然而,在实际情况下,提供充沛的标注数据甚至在源领域中可能是不可能的,因为标注的困难和高昂的成本。为解决这个问题,最近的研究将注重ew-shot无监督领域适应(FUDA),只有一些源样本被标注,并通过无监督学习方法进行知识传递。然而,现有的方法通常忽略了稀疏标注设置会阻碍学习可靠的源知识传递,同时 ignore了目标样本的学习难度差,导致目标样本被较差地分类。为解决这两个缺陷,本文提出了一种基于信任度的视觉分散学习方法(C-VisDiT) для FUDA。具体来说,C-VisDiT包括一种跨领域视觉分散策略,将高信任度源知识传递到模型适应,以及一种内领域视觉分散策略,用于导引目标样本中的困难样本和易样本进行学习。我们对Office-31、Office-Home、VisDA-C和DomainNet数据集进行了广泛的实验,结果表明,提出的C-VisDiT显著超过了当前最佳的FUDA方法。我们的代码可以在https://github.com/Bostoncake/C-VisDiT中找到。

The Maximum Cover with Rotating Field of View

  • paper_url: http://arxiv.org/abs/2309.15573
  • repo_url: https://github.com/ManojKumarPatnaik/Major-project-list
  • paper_authors: Igor Potapov, Jason Ralph, Theofilos Triommatis
  • for: maximize the visibility and limit the uncertainty in localization problems for a convex polygon $P$ and a static spotlight outside $P$.
  • methods: use a theoretical foundation for the analysis of the maximum cover with a rotating field of view, and express the function of the area $A_{\phi}(\theta)$ as various compositions of a function $A_{\theta}(\phi)$.
  • results: develop an algorithm that approximates the direction of the field of view with precision $\varepsilon$ and complexity $\mathcal{O}(n(\log{n}+(\log{\varepsilon})/\phi))$.Here’s the full text in Simplified Chinese:
  • for: 这个论文是为了最大化 polygon $P$ 和外部的静止灯光之间的可见性,以及限制 localization 问题中的uncertainty。
  • methods: 使用一种理论基础来分析最大覆盖的rotating field of view问题,并将函数 $A_{\phi}(\theta)$ 表示为不同的compositions。
  • results: 开发一个精度为 $\varepsilon$ 的算法,用于 Approximate 静止灯光的方向,复杂度为 $\mathcal{O}(n(\log{n}+(\log{\varepsilon})/\phi))$.
    Abstract Imagine a polygon-shaped platform $P$ and only one static spotlight outside $P$; which direction should the spotlight face to light most of $P$? This problem occurs in maximising the visibility, as well as in limiting the uncertainty in localisation problems. More formally, we define the following maximum cover problem: "Given a convex polygon $P$ and a Field Of View (FOV) with a given centre and inner angle $\phi$; find the direction (an angle of rotation $\theta$) of the FOV such that the intersection between the FOV and $P$ has the maximum area". In this paper, we provide the theoretical foundation for the analysis of the maximum cover with a rotating field of view. The main challenge is that the function of the area $A_{\phi}(\theta)$, with the angle of rotation $\theta$ and the fixed inner angle $\phi$, cannot be approximated directly. We found an alternative way to express it by various compositions of a function $A_{\theta}(\phi)$ (with a restricted inner angle $\phi$ and a fixed direction $\theta$). We show that $A_{\theta}(\phi)$ has an analytical solution in the special case of a two-sector intersection and later provides a constrictive solution for the original problem. Since the optimal solution is a real number, we develop an algorithm that approximates the direction of the field of view, with precision $\varepsilon$, and complexity $\mathcal{O}(n(\log{n}+(\log{\varepsilon})/\phi))$.
    摘要 想象一个 polygon 形式的平台 $P$ 和一个静止的外部灯光 ; 这个灯光应该朝向哪里来照亮 $P$ 最多呢?这个问题在最大可见性和局部化问题中都有出现。我们定义以下最大覆盖问题:“给定一个 convex polygon $P$ 和一个视野(Field Of View,FOV)的中心和内角 $\phi$;找出FOV 的方向(旋转角 $\theta$),使得 FOV 和 $P$ 的交集具有最大面积”。在这篇论文中,我们提供了对最大覆盖问题的理论基础的分析。主要挑战在于函数 $A_{\phi}(\theta)$ 的计算不能直接 aproximated。我们发现了一种代替的方法,通过不同的compositions来表示 $A_{\theta}(\phi)$ 。我们示示了在特殊情况下的两部分交集时,$A_{\theta}(\phi)$ 有分析解,并提供了一个压缩性的解决方案。由于优化解决方案是实数,我们开发了一个精度为 $\varepsilon$ 的搜索算法,复杂度为 $\mathcal{O}(n(\log{n}+(\log{\varepsilon})/\phi))$.

HPL-ViT: A Unified Perception Framework for Heterogeneous Parallel LiDARs in V2V

  • paper_url: http://arxiv.org/abs/2309.15572
  • repo_url: https://github.com/NumtraCG/614ca2d5a2b781088de648b020210923-155728routingdatapipeline230921
  • paper_authors: Yuhang Liu, Boyi Sun, Yuke Li, Yuzheng Hu, Fei-Yue Wang
  • for: 这个论文的目的是发展下一代智能探测器,提出了一个新的框架,并在实验平台DAWN上建立了硬件实现。
  • methods: 这个论文使用了OpenCDA和RLS来建立一个多种探测器数据集OPV2V-HPL,并提出了一个具有域专特性抽象的HPL-ViT架构,用于稳定特征融合。
  • results: 实验结果显示,HPL-ViT在所有设定下均 achievement SOTA表现,并具有优秀的泛化能力。
    Abstract To develop the next generation of intelligent LiDARs, we propose a novel framework of parallel LiDARs and construct a hardware prototype in our experimental platform, DAWN (Digital Artificial World for Natural). It emphasizes the tight integration of physical and digital space in LiDAR systems, with networking being one of its supported core features. In the context of autonomous driving, V2V (Vehicle-to-Vehicle) technology enables efficient information sharing between different agents which significantly promotes the development of LiDAR networks. However, current research operates under an ideal situation where all vehicles are equipped with identical LiDAR, ignoring the diversity of LiDAR categories and operating frequencies. In this paper, we first utilize OpenCDA and RLS (Realistic LiDAR Simulation) to construct a novel heterogeneous LiDAR dataset named OPV2V-HPL. Additionally, we present HPL-ViT, a pioneering architecture designed for robust feature fusion in heterogeneous and dynamic scenarios. It uses a graph-attention Transformer to extract domain-specific features for each agent, coupled with a cross-attention mechanism for the final fusion. Extensive experiments on OPV2V-HPL demonstrate that HPL-ViT achieves SOTA (state-of-the-art) performance in all settings and exhibits outstanding generalization capabilities.
    摘要 要开发下一代智能激光仪,我们提出了一个新的框架──并行激光仪架构(Parallel LiDARs),并在我们的实验平台DAWN(数位人工世界)中实现了实验。这个框架强调物理和数位空间之间的紧密融合,并且支持网络作为核心功能。在自驾车领域,车辆之间的通信技术(V2V)可以实现车辆之间的有效信息交换,这有助于开发激光网络。然而,现有的研究假设所有车辆都采用同一款激光仪,忽略了激光仪的多标准和频率多标准。在这篇论文中,我们首先使用OpenCDA和RLS(现实激光仪 simulator)创建了一个独特的不同激光类型和频率的激光数据集名为OPV2V-HPL。此外,我们还提出了HPL-ViT,一个创新的架构,用于在多标准和动态enario中实现坚固的特征融合。它使用图形注意力Transformer提取特定领域的特征,并与交互式混合机制进行最终融合。实验结果显示,HPL-ViT在所有设定下实现了SOTA性能,并且具有卓越的普遍化能力。

Guided Frequency Loss for Image Restoration

  • paper_url: http://arxiv.org/abs/2309.15563
  • repo_url: None
  • paper_authors: Bilel Benjdira, Anas M. Ali, Anis Koubaa
  • for: 提高图像Restoration的效果
  • methods: 提出了一种名为Guided Frequency Loss(GFL)的损失函数,用于让模型同时学习图像的频谱内容和空间内容
  • results: GFL损失函数在Super Resolution和Denoising任务上实现了PSNR指标的提高,并且在SwinIR和SRGAN模型中提高了训练效果,特别是在受限数据上表现更佳
    Abstract Image Restoration has seen remarkable progress in recent years. Many generative models have been adapted to tackle the known restoration cases of images. However, the interest in benefiting from the frequency domain is not well explored despite its major factor in these particular cases of image synthesis. In this study, we propose the Guided Frequency Loss (GFL), which helps the model to learn in a balanced way the image's frequency content alongside the spatial content. It aggregates three major components that work in parallel to enhance learning efficiency; a Charbonnier component, a Laplacian Pyramid component, and a Gradual Frequency component. We tested GFL on the Super Resolution and the Denoising tasks. We used three different datasets and three different architectures for each of them. We found that the GFL loss improved the PSNR metric in most implemented experiments. Also, it improved the training of the Super Resolution models in both SwinIR and SRGAN. In addition, the utility of the GFL loss increased better on constrained data due to the less stochasticity in the high frequencies' components among samples.
    摘要 Image Restoration 在最近几年内有了非常 significiant progress。许多生成模型已经被应用于图像还原的知名情况中。然而,利用频率领域的利益并没有得到充分的探索,尽管它在这些图像生成情况中扮演着重要的角色。在这种研究中,我们提出了引导频率损失(GFL),它帮助模型同时学习图像的频率内容和空间内容。GFLloss 包括三个主要组成部分,它们在并行工作以提高学习效率:Charbonnier 组件、Laplacian Pyramid 组件和渐进频率组件。我们在Super Resolution 和 Denoising 任务上测试了GFL loss。我们使用了三个不同的数据集和三个不同的架构。我们发现,GFL loss 提高了 PSNR 指标的大多数实验中。此外,GFL loss 也提高了 SwinIR 和 SRGAN 中的 Super Resolution 模型训练。此外,GFL loss 在受限数据上的利用程度更高,因为高频分布在样本中的不确定性更低。

Learning from SAM: Harnessing a Segmentation Foundation Model for Sim2Real Domain Adaptation through Regularization

  • paper_url: http://arxiv.org/abs/2309.15562
  • repo_url: None
  • paper_authors: Mayara E. Bonani, Max Schwarz, Sven Behnke
  • for: 本研究旨在提高无监督预处理的预测性能,尤其是在机器人应用中,目标领域训练数据罕见而注解成本高昂。
  • methods: 本方法基于Segment Anything模型,利用无注解目标频道数据进行自我监督预处理,并提出了一种协方差-方差损失结构,以正则化目标频道上的特征表示。
  • results: 在YCB-Video和HomebrewedDB等 datasets上,本方法的表现优于先前的方法,甚至在YCB-Video上超过了使用真注解的网络。
    Abstract Domain adaptation is especially important for robotics applications, where target domain training data is usually scarce and annotations are costly to obtain. We present a method for self-supervised domain adaptation for the scenario where annotated source domain data (e.g. from synthetic generation) is available, but the target domain data is completely unannotated. Our method targets the semantic segmentation task and leverages a segmentation foundation model (Segment Anything Model) to obtain segment information on unannotated data. We take inspiration from recent advances in unsupervised local feature learning and propose an invariance-variance loss structure over the detected segments for regularizing feature representations in the target domain. Crucially, this loss structure and network architecture can handle overlapping segments and oversegmentation as produced by Segment Anything. We demonstrate the advantage of our method on the challenging YCB-Video and HomebrewedDB datasets and show that it outperforms prior work and, on YCB-Video, even a network trained with real annotations.
    摘要 域 adaptation 特别重要 для robotics 应用程序,目标域训练数据通常罕见而且标注成本高昂。我们提出了一种自我超级vised域 adaptation 方法,其中可以使用已有的源域数据(例如从生成的 sintetico 数据)进行标注,但目标域数据完全无标注。我们的方法targets semantic segmentation 任务,利用 segmentation 基础模型(Segment Anything Model)获取目标域数据中的分割信息。我们启发自最近的无监督本地特征学习的进步,并提出了一种协方差-方差损失结构来规范目标域数据中的特征表示。这种损失结构和网络结构可以处理重叠的分割和过分割,这是由 Segment Anything 生成的。我们在 YCB-Video 和 HomebrewedDB datasets 中展示了我们的方法的优势,并证明它超过先前的工作和,在 YCB-Video 上,连实际标注生成的网络都不能比拟。

Highly Efficient SNNs for High-speed Object Detection

  • paper_url: http://arxiv.org/abs/2309.15883
  • repo_url: None
  • paper_authors: Nemin Qiu, Zhiguo Li, Yuan Li, Chuang Zhu
  • for: 这个论文旨在提出一种高效的神经网络模型,用于快速的物体检测任务。
  • methods: 该论文使用量化训练方法建立了一个具有初始紧凑性的神经网络模型,并提出了一种扩展 Pseudoquantization 方法来保证模型的正确性。另外,它还提出了一种连续推理方案,使用 Feed-Forward Integrate-and-Fire(FewdIF)神经元来实现高速的物体检测。
  • results: 实验结果表明,该高效的神经网络模型可以在 GPU 上实现 118 倍的速度提升,只需要 1.5 MB 的参数进行物体检测任务。此外,在 FPGA 平台上,提出的模型可以实现 800+ FPS 的物体检测,并且具有极低的响应时间。
    Abstract The high biological properties and low energy consumption of Spiking Neural Networks (SNNs) have brought much attention in recent years. However, the converted SNNs generally need large time steps to achieve satisfactory performance, which will result in high inference latency and computational resources increase. In this work, we propose a highly efficient and fast SNN for object detection. First, we build an initial compact ANN by using quantization training method of convolution layer fold batch normalization layer and neural network modification. Second, we theoretically analyze how to obtain the low complexity SNN correctly. Then, we propose a scale-aware pseudoquantization scheme to guarantee the correctness of the compact ANN to SNN. Third, we propose a continuous inference scheme by using a Feed-Forward Integrate-and-Fire (FewdIF) neuron to realize high-speed object detection. Experimental results show that our efficient SNN can achieve 118X speedup on GPU with only 1.5MB parameters for object detection tasks. We further verify our SNN on FPGA platform and the proposed model can achieve 800+FPS object detection with extremely low latency.
    摘要 高生物特性和低能耗的神经网络(SNN)在最近几年内受到了广泛关注。然而,通常的SNN conversions需要大量的时间步长来 достичь满意的性能,这会导致高的推理延迟和计算资源增加。在这种情况下,我们提出了一种高效率和快速的SNN для对象检测。首先,我们使用量化训练方法建立了一个初始化紧凑的ANN。其次,我们 teorically 分析了如何正确地获得低复杂度SNN。然后,我们提出了一种扩展 Pseudoquantization 方案,以确保紧凑ANN的正确性。第三,我们提出了一种连续推理方案,使用 Feed-Forward Integrate-and-Fire(FewdIF) neuron 来实现高速对象检测。实验结果显示,我们的高效SNN在 GPU 上可以实现118倍的速度提升,只需1.5MB 的参数进行对象检测任务。我们进一步验证了我们的SNN在 FPGA 平台上,并发现提出的模型可以实现800+ FPS 对象检测,并且具有极低的延迟。

Learning Dense Flow Field for Highly-accurate Cross-view Camera Localization

  • paper_url: http://arxiv.org/abs/2309.15556
  • repo_url: None
  • paper_authors: Zhenbo Song, Xianghui Ze, Jianfeng Lu, Yujiao Shi
  • for: 本研究旨在解决基于卫星图像的地面图像三重自由度摄像机pose估算问题。
  • methods: 我们提出了一种新的端到端方法,利用了对精密像素粒子流场的学习,以计算摄像机pose。我们的方法与现有方法不同,在像素级别上建立特征度量,使得整个图像得到全图像超级视图控制。具体来说,我们使用了两种不同的卷积网络来提取地面和卫星特征。然后,我们将地面特征图 проек到鸟瞰视图(BEV)中使用固定镜头高度假设来实现初步的几何对应。为了进一步确立地面和卫星特征之间的内容关系,我们引入了一个差分卷积块来修正项目的BEV特征。然后,我们使用RAFT流体解oder网络来计算 dense流场对应。在获得dense流场对应后,我们通过最小二乘方法来过滤匹配的准确值和回归地面摄像机pose。
  • results: 我们的方法与现有方法相比,在KITTI、FORD multi-AV、VIGOR和Oxford RobotCar等数据集上具有显著的改善。特别是,我们的方法可以将地面摄像机pose的 median localization error 降低89%、19%、80%和35%。
    Abstract This paper addresses the problem of estimating the 3-DoF camera pose for a ground-level image with respect to a satellite image that encompasses the local surroundings. We propose a novel end-to-end approach that leverages the learning of dense pixel-wise flow fields in pairs of ground and satellite images to calculate the camera pose. Our approach differs from existing methods by constructing the feature metric at the pixel level, enabling full-image supervision for learning distinctive geometric configurations and visual appearances across views. Specifically, our method employs two distinct convolution networks for ground and satellite feature extraction. Then, we project the ground feature map to the bird's eye view (BEV) using a fixed camera height assumption to achieve preliminary geometric alignment. To further establish content association between the BEV and satellite features, we introduce a residual convolution block to refine the projected BEV feature. Optical flow estimation is performed on the refined BEV feature map and the satellite feature map using flow decoder networks based on RAFT. After obtaining dense flow correspondences, we apply the least square method to filter matching inliers and regress the ground camera pose. Extensive experiments demonstrate significant improvements compared to state-of-the-art methods. Notably, our approach reduces the median localization error by 89%, 19%, 80% and 35% on the KITTI, Ford multi-AV, VIGOR and Oxford RobotCar datasets, respectively.
    摘要 Translated into Simplified Chinese:这篇论文关注了根据卫星图像的地面图像的3个自由度摄像机pose的估算问题。我们提出了一种新的端到端方法,利用了 dense pixel-wise流场场的学习,以计算摄像机pose。我们的方法与现有方法不同,在像素级别构建特征度量,以实现全图像监督,以学习不同视角的特征配置和视觉特征。specifically,我们使用了两个不同的卷积网络来EXTRACT ground和卫星特征。然后,我们将地面特征图Projected to the bird's eye view (BEV) using a fixed camera height assumption to achieve preliminary geometric alignment。为了进一步确立地面和卫星特征之间的内容关联,我们引入了一个差分卷积块来修正Projected BEV特征。然后,我们使用了 RAFT流场估计器来进行流场估计在BEV特征图和卫星特征图上。得到了密集流场匹配后,我们使用最小二乘法来过滤匹配的入liers和回归地面摄像机pose。我们的方法在KITTI、Ford multi-AV、VIGOR和Oxford RobotCar等数据集上进行了广泛的实验,并达到了STATE-OF-THE-ART的性能。尤其是,我们的方法在KITTI数据集上reduces the median localization error by 89%, 19%, 80% and 35% compared to state-of-the-art methods。

Low Latency of object detection for spikng neural network

  • paper_url: http://arxiv.org/abs/2309.15555
  • repo_url: None
  • paper_authors: Nemin Qiu, Chuang Zhu
  • for: 本文旨在开发高精度低延迟的神经网络,特别适用于Edge AI应用。
  • methods: 本文使用了系统性的变换方法,从神经网络中提取了精度和速度两个维度的优势,并通过结构性的修改和量化纠正错误来提高准确率和速度。
  • results: 实验结果显示,提议方法在MS COCO、PASCAL VOC等难度较高的数据集上具有更高的准确率和更低的延迟,并且能够展示神经网络处理脉冲信号的优势。
    Abstract Spiking Neural Networks, as a third-generation neural network, are well-suited for edge AI applications due to their binary spike nature. However, when it comes to complex tasks like object detection, SNNs often require a substantial number of time steps to achieve high performance. This limitation significantly hampers the widespread adoption of SNNs in latency-sensitive edge devices. In this paper, our focus is on generating highly accurate and low-latency SNNs specifically for object detection. Firstly, we systematically derive the conversion between SNNs and ANNs and analyze how to improve the consistency between them: improving the spike firing rate and reducing the quantization error. Then we propose a structural replacement, quantization of ANN activation and residual fix to allevicate the disparity. We evaluate our method on challenging dataset MS COCO, PASCAL VOC and our spike dataset. The experimental results show that the proposed method achieves higher accuracy and lower latency compared to previous work Spiking-YOLO. The advantages of SNNs processing of spike signals are also demonstrated.
    摘要 神经网络具有辐射性,可以在边缘智能应用中使用,因为它们的二进制脉冲性。然而,当面临复杂任务时,如物体检测,SNNs经常需要较多的时间步骤以达到高性能。这种限制妨碍了SNNs在响应时间敏感的边缘设备中的普及。在这篇论文中,我们关注于生成高精度低延迟的SNNs,特别是用于物体检测。首先,我们系统地 derivate SNNs和ANNs之间的转化,并分析如何提高脉冲发射率和减少量化误差。然后,我们提议一种结构性的替换方案,即Activation和Residual的量化纠正,以降低不一致性。我们在MS COCO、PASCAL VOC和我们的脉冲集上进行了实验,结果表明,我们的方法可以 achieved higher accuracy和lower latency compared to previous work Spiking-YOLO。此外,SNNs处理脉冲信号的优势也得到了演示。

Uncertainty Quantification via Neural Posterior Principal Components

  • paper_url: http://arxiv.org/abs/2309.15533
  • repo_url: None
  • paper_authors: Elias Nehme, Omer Yair, Tomer Michaeli
  • for: 这个论文的目的是提出一种能够在单个前向传播中预测 posterior 分布的主成分(PC),以便实现图像修复模型中的不确定性评估。
  • methods: 该方法基于 neural network 的卷积神经网络,可以在单个前向传播中预测 posterior 分布的主成分。可以选择使用预训练的模型,或者从 scratch 开始训练一个模型,以输出预测图像和 posterior 分布的主成分。
  • results: 该方法在多个图像修复问题中表现出色,例如噪声除除、图像缺失填充、超分辨率重建和生物图像转换等。与 posterior 抽样法相比,该方法可以实现更快速的uncertainty量化,并且可以提供更自然的不确定性方向。详细例子可以参考 https://eliasnehme.github.io/NPPC/
    Abstract Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. However, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Examples are available at https://eliasnehme.github.io/NPPC/
    摘要 <>translate the following text into Simplified Chinese<>难以量化的不确定性是图像修复模型在安全关键领域部署的关键,如自动驾驶和生物成像。目前,图像修复模型的不确定性可视化方法主要集中在每个像素的估计上。然而,每个像素的独立差分(variance)的热图通常并不是实际上的很有帮助,因为它们不会捕捉图像中像素之间的强相关性。一个更自然的不确定性度量是根据 posterior 分布的主成分(principal components,PCs)的方差来计算。理论上,PCs 可以通过将 conditional generative model 生成的样本应用 PCA 来计算。然而,这需要在测试时生成非常多的样本,这是目前的 diffusion 模型 非常慢。在这个工作中,我们提出了一种方法,可以在单个前向传播中计算 posterior 分布的 PCs для任何输入图像。我们的方法可以在一个已经训练好的模型上执行,该模型是通过最小二乘误差(MSE)进行训练来减少 Mean Squared Error 的。也可以从头开始训练这个模型,以输出预测图像和 posterior PCs。我们在多种图像重建问题中展示了我们的方法,包括噪声除去、填充、超分辨和生物图像到图像转换。我们的方法可以准确地传递实例特有的不确定性方向,实现了对 posterior 抽样器的uncertainty quantification,并且速度比对 diffusion 模型的训练进行训练好的模型。示例可以在 中找到。

Missing-modality Enabled Multi-modal Fusion Architecture for Medical Data

  • paper_url: http://arxiv.org/abs/2309.15529
  • repo_url: None
  • paper_authors: Muyu Wang, Shiyu Fan, Yichen Li, Hui Chen
  • for: 这个研究旨在开发一个可靠的多模式融合架构,以实现医疗资料中缺失的模式不断影响深度学习模型的性能。
  • methods: 本研究使用了一个基于Transformer的多模式融合模组,将双模式融合为一个三模式融合架构。此外,研究者还引入多変量损失函数,以提高模型对缺失模式的Robustness。
  • results: 实验结果显示,提案的多模式融合架构能够有效地融合三种模式,并在缺失模式情况下保持优秀的性能。这个方法可能会扩展到更多模式,以提高临床实用性。
    Abstract Fusing multi-modal data can improve the performance of deep learning models. However, missing modalities are common for medical data due to patients' specificity, which is detrimental to the performance of multi-modal models in applications. Therefore, it is critical to adapt the models to missing modalities. This study aimed to develop an efficient multi-modal fusion architecture for medical data that was robust to missing modalities and further improved the performance on disease diagnosis.X-ray chest radiographs for the image modality, radiology reports for the text modality, and structured value data for the tabular data modality were fused in this study. Each modality pair was fused with a Transformer-based bi-modal fusion module, and the three bi-modal fusion modules were then combined into a tri-modal fusion framework. Additionally, multivariate loss functions were introduced into the training process to improve model's robustness to missing modalities in the inference process. Finally, we designed comparison and ablation experiments for validating the effectiveness of the fusion, the robustness to missing modalities and the enhancements from each key component. Experiments were conducted on MIMIC-IV, MIMIC-CXR with the 14-label disease diagnosis task. Areas under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC) were used to evaluate models' performance. The experimental results demonstrated that our proposed multi-modal fusion architecture effectively fused three modalities and showed strong robustness to missing modalities. This method is hopeful to be scaled to more modalities to enhance the clinical practicality of the model.
    摘要 融合多Modal数据可以提高深度学习模型的性能。然而,医疗数据中缺失Modalities是常见的,这会导致多Modal模型在应用中表现不佳。因此,适应缺失Modalities是非常重要的。这项研究旨在开发一种可靠的多Modal融合架构,可以在医疗数据中融合多种Modalities,并且在缺失Modalities时保持模型的性能。本研究使用的Modalities包括X射成像(image modality)、 radiology report(text modality)和结构化数据(tabular data modality)。每个Modal pair使用Transformer基于的bi-Modal融合模块进行融合,并将三个bi-Modal融合模块组合成一个tri-Modal融合架构。此外,我们还引入了多个变量损失函数来改善模型在推理过程中对缺失Modalities的Robustness。最后,我们设计了比较和减少实验来验证融合的有效性、Robustness和每个关键组件的改进。实验使用MIMIC-IV和MIMIC-CXR datasets,并使用14个疾病诊断任务来评估模型的性能。实验结果表明,我们提posed的多Modal融合架构可以有效地融合三种Modalities,并且在缺失Modalities时保持模型的性能。这种方法可以在更多Modalities上进行扩展,以提高临床实用性。

P2I-NET: Mapping Camera Pose to Image via Adversarial Learning for New View Synthesis in Real Indoor Environments

  • paper_url: http://arxiv.org/abs/2309.15526
  • repo_url: None
  • paper_authors: Xujie Kang, Kanglin Liu, Jiang Duan, Yuanhao Gong, Guoping Qiu
  • For: 根据一个新的6DoF摄像头位置,研究在室内环境中预测当前摄像头的视角,并且使用一个conditional生成问题答案网络(P2I-NET)来直接预测当前摄像头的视角。* Methods: 提出了两个副档案检测器,一个是在伪实值空间中的对应检测器,另一个是在真实世界摄像头位置空间中的对应检测器,以确保生成的图像和实际世界中的摄像头位置之间的一致性。此外,还引入了一个深度卷积神经网络(CNN)来进一步强制这一一致性。* Results: 实际进行了对新视角预测实验,结果显示P2I-NET在许多NeRF基础模型的比较下表现出色,尤其是在速度方面,P2I-NET比基础模型40-100倍快。此外,还提供了一个新的室内环境数据集,包括22个高分辨率RGBD影像和对应的摄像头位置参数。
    Abstract Given a new $6DoF$ camera pose in an indoor environment, we study the challenging problem of predicting the view from that pose based on a set of reference RGBD views. Existing explicit or implicit 3D geometry construction methods are computationally expensive while those based on learning have predominantly focused on isolated views of object categories with regular geometric structure. Differing from the traditional \textit{render-inpaint} approach to new view synthesis in the real indoor environment, we propose a conditional generative adversarial neural network (P2I-NET) to directly predict the new view from the given pose. P2I-NET learns the conditional distribution of the images of the environment for establishing the correspondence between the camera pose and its view of the environment, and achieves this through a number of innovative designs in its architecture and training lost function. Two auxiliary discriminator constraints are introduced for enforcing the consistency between the pose of the generated image and that of the corresponding real world image in both the latent feature space and the real world pose space. Additionally a deep convolutional neural network (CNN) is introduced to further reinforce this consistency in the pixel space. We have performed extensive new view synthesis experiments on real indoor datasets. Results show that P2I-NET has superior performance against a number of NeRF based strong baseline models. In particular, we show that P2I-NET is 40 to 100 times faster than these competitor techniques while synthesising similar quality images. Furthermore, we contribute a new publicly available indoor environment dataset containing 22 high resolution RGBD videos where each frame also has accurate camera pose parameters.
    摘要

Improving Facade Parsing with Vision Transformers and Line Integration

  • paper_url: http://arxiv.org/abs/2309.15523
  • repo_url: https://github.com/wbw520/rtfp
  • paper_authors: Bowen Wang, Jiaxing Zhang, Ran Zhang, Yunqin Li, Liangzhi Li, Yuta Nakashima
  • For: The paper is focused on developing a new dataset (Comprehensive Facade Parsing) and a novel pipeline (Revision-based Transformer Facade Parsing) for real-world facade parsing tasks, with the aim of improving computational efficiency and accuracy.* Methods: The paper introduces the use of Vision Transformers (ViT) in facade parsing, and proposes a new revision algorithm (Line Acquisition, Filtering, and Revision) to improve the segmentation results.* Results: The paper reports superior performance of the proposed method on three datasets (ECP 2011, RueMonge 2014, and CFP) compared to existing methods.Here are the three points in Simplified Chinese text:* For: 本文目的是开发一个新的 dataset (Comprehensive Facade Parsing) 和一种新的管道 (Revision-based Transformer Facade Parsing),以提高实际场景中的facade parsing任务计算效率和准确率。* Methods: 本文提出使用 Vision Transformers (ViT) 在facade parsing任务中,并提出一种新的修订算法 (Line Acquisition, Filtering, and Revision) 来提高分割结果。* Results: 本文report示本方法在三个dataset (ECP 2011, RueMonge 2014, 和 CFP) 上的表现较为现有方法有所提高。
    Abstract Facade parsing stands as a pivotal computer vision task with far-reaching applications in areas like architecture, urban planning, and energy efficiency. Despite the recent success of deep learning-based methods in yielding impressive results on certain open-source datasets, their viability for real-world applications remains uncertain. Real-world scenarios are considerably more intricate, demanding greater computational efficiency. Existing datasets often fall short in representing these settings, and previous methods frequently rely on extra models to enhance accuracy, which requires much computation cost. In this paper, we introduce Comprehensive Facade Parsing (CFP), a dataset meticulously designed to encompass the intricacies of real-world facade parsing tasks. Comprising a total of 602 high-resolution street-view images, this dataset captures a diverse array of challenging scenarios, including sloping angles and densely clustered buildings, with painstakingly curated annotations for each image. We introduce a new pipeline known as Revision-based Transformer Facade Parsing (RTFP). This marks the pioneering utilization of Vision Transformers (ViT) in facade parsing, and our experimental results definitively substantiate its merit. We also design Line Acquisition, Filtering, and Revision (LAFR), an efficient yet accurate revision algorithm that can improve the segment result solely from simple line detection using prior knowledge of the facade. In ECP 2011, RueMonge 2014, and our CFP, we evaluate the superiority of our method.
    摘要 外墙解析作为计算机视觉任务,在建筑、城市规划和能效环境等领域具有广泛的应用前景。尽管最近的深度学习方法在某些开源数据集上实现了卓越的结果,但其在实际应用中的可靠性仍存在uncertainty。实际场景相对较复杂,需要更高的计算效率。现有的数据集frequently不能 полностью反映这些场景,而前一些方法通常需要额外的模型来提高准确性,这需要大量的计算成本。在这篇论文中,我们介绍了全面的外墙解析(CFP)数据集,这个数据集仔细地设计,以涵盖实际场景中的复杂性。总共包含602个高分辨率街景图像,这个数据集包括倾斜角和密集建筑等挑战性场景,并且对每个图像进行了精心的标注。我们提出了一个新的管道,称为修订基于转换器的外墙解析(RTFP)。这是首次在外墙解析中使用视transformer(ViT),我们的实验结果证明了它的优势。我们还设计了线性获取、筛选和修订(LAFR)算法,这是一种高效又准确的修订算法,可以通过对外墙的基本线段进行优化来提高 segment结果。在ECP 2011、RueMonge 2014和我们的CFP中,我们评估了我们的方法的优越性。

MLOps for Scarce Image Data: A Use Case in Microscopic Image Analysis

  • paper_url: http://arxiv.org/abs/2309.15521
  • repo_url: None
  • paper_authors: Angelo Yamachui Sitcheu, Nils Friederich, Simon Baeuerle, Oliver Neumann, Markus Reischl, Ralf Mikut
  • for: This paper aims to enhance biomedical image analysis using a holistic approach to Machine Learning Operations (MLOps) in the context of scarce data.
  • methods: The proposed method includes a fingerprinting process to select the best models, datasets, and development strategy for image analysis tasks, as well as automated model development and continuous deployment and monitoring to ensure continuous learning.
  • results: The paper presents preliminary results of a proof of concept for fingerprinting in microscopic image datasets.
    Abstract Nowadays, Machine Learning (ML) is experiencing tremendous popularity that has never been seen before. The operationalization of ML models is governed by a set of concepts and methods referred to as Machine Learning Operations (MLOps). Nevertheless, researchers, as well as professionals, often focus more on the automation aspect and neglect the continuous deployment and monitoring aspects of MLOps. As a result, there is a lack of continuous learning through the flow of feedback from production to development, causing unexpected model deterioration over time due to concept drifts, particularly when dealing with scarce data. This work explores the complete application of MLOps in the context of scarce data analysis. The paper proposes a new holistic approach to enhance biomedical image analysis. Our method includes: a fingerprinting process that enables selecting the best models, datasets, and model development strategy relative to the image analysis task at hand; an automated model development stage; and a continuous deployment and monitoring process to ensure continuous learning. For preliminary results, we perform a proof of concept for fingerprinting in microscopic image datasets.
    摘要

SAF-Net: Self-Attention Fusion Network for Myocardial Infarction Detection using Multi-View Echocardiography

  • paper_url: http://arxiv.org/abs/2309.15520
  • repo_url: None
  • paper_authors: Ilke Adalioglu, Mete Ahishali, Aysen Degerli, Serkan Kiranyaz, Moncef Gabbouj
    for:This paper proposes a novel view-fusion model named SAF-Net to detect myocardial infarction (MI) from multi-view echocardiography recordings.methods:The proposed framework utilizes apical 2-chamber (A2C) and apical 4-chamber (A4C) view echocardiography recordings, and extracts highly representative features using pre-trained deep networks. The SAF-Net model uses a self-attention mechanism to learn dependencies in the extracted feature vectors.results:The proposed SAF-Net model achieves a high-performance level with 88.26% precision, 77.64% sensitivity, and 78.13% accuracy in detecting MI from multi-view echocardiography recordings. The results demonstrate that the SAF-Net model achieves the most accurate MI detection over multi-view echocardiography recordings.Here’s the simplified Chinese text in the format you requested:for: 这个研究提出了一种名为SAF-Net的新视角融合模型,用于从多视角电子心肺图像记录中检测心肺炎。methods: 该提案使用了二尖脉(A2C)和四尖脉(A4C)视角电子心肺图像记录,并EXTRACT高度表征特征。使用预训练深度网络来EXTRACT特征。results: 提案的SAF-Net模型在多视角电子心肺图像记录中检测心肺炎的高性能水平达88.26%的准确率,77.64%的敏感度和78.13%的准确率。结果表明SAF-Net模型在多视角电子心肺图像记录中检测心肺炎的最高精度。
    Abstract Myocardial infarction (MI) is a severe case of coronary artery disease (CAD) and ultimately, its detection is substantial to prevent progressive damage to the myocardium. In this study, we propose a novel view-fusion model named self-attention fusion network (SAF-Net) to detect MI from multi-view echocardiography recordings. The proposed framework utilizes apical 2-chamber (A2C) and apical 4-chamber (A4C) view echocardiography recordings for classification. Three reference frames are extracted from each recording of both views and deployed pre-trained deep networks to extract highly representative features. The SAF-Net model utilizes a self-attention mechanism to learn dependencies in extracted feature vectors. The proposed model is computationally efficient thanks to its compact architecture having three main parts: a feature embedding to reduce dimensionality, self-attention for view-pooling, and dense layers for the classification. Experimental evaluation is performed using the HMC-QU-TAU dataset which consists of 160 patients with A2C and A4C view echocardiography recordings. The proposed SAF-Net model achieves a high-performance level with 88.26% precision, 77.64% sensitivity, and 78.13% accuracy. The results demonstrate that the SAF-Net model achieves the most accurate MI detection over multi-view echocardiography recordings.
    摘要 我occidental infarction (MI) 是 coronary artery disease (CAD) 的严重情况,检测MI的检测是防止进一步对我ocardium的损害的关键。在这项研究中,我们提出了一种新的视图融合模型,名为自我注意力融合网络(SAF-Net),用于从多视图echo受检记录中检测MI。我们的框架使用了Apical 2-chamber(A2C)和Apical 4-chamber(A4C)视图echo受检记录,并从每个视图中提取了三个参照帧,并使用预训练的深度网络提取了高度表征特征。SAF-Net模型使用了自我注意力机制来学习视图之间的依赖关系。我们的模型具有紧凑的架构,包括特征嵌入、自我注意力 Pooling 和 dense层,这使得模型 computationally efficient。我们在HMC-QU-TAU数据集上进行了实验评估,该数据集包含160名患有A2C和A4C视图echo受检记录的患者。我们的SAF-Net模型在多视图echo受检记录中检测MI的性能达到了88.26%的精度、77.64%的敏感度和78.13%的准确率,这些结果证明了SAF-Net模型在多视图echo受检记录中的MI检测性能最高。

Defending Against Physical Adversarial Patch Attacks on Infrared Human Detection

  • paper_url: http://arxiv.org/abs/2309.15519
  • repo_url: None
  • paper_authors: Lukas Strack, Futa Waseda, Huy H. Nguyen, Yinqiang Zheng, Isao Echizen
  • for: 本研究旨在提高红外检测系统的安全性,对Physically-Realizable Adversarial Patches(PRAP)的攻击进行防御。
  • methods: 我们提出了一种简单的防御策略——质量检测(POD),通过随机添加质量样本来增强训练样本,并在检测人员时同时检测质量样本。
  • results: POD不仅可以准确地检测人员,还可以识别质量样本的位置,并在不同的质量样本攻击下保持高度的鲁棒性。
    Abstract Infrared detection is an emerging technique for safety-critical tasks owing to its remarkable anti-interference capability. However, recent studies have revealed that it is vulnerable to physically-realizable adversarial patches, posing risks in its real-world applications. To address this problem, we are the first to investigate defense strategies against adversarial patch attacks on infrared detection, especially human detection. We have devised a straightforward defense strategy, patch-based occlusion-aware detection (POD), which efficiently augments training samples with random patches and subsequently detects them. POD not only robustly detects people but also identifies adversarial patch locations. Surprisingly, while being extremely computationally efficient, POD easily generalizes to state-of-the-art adversarial patch attacks that are unseen during training. Furthermore, POD improves detection precision even in a clean (i.e., no-patch) situation due to the data augmentation effect. Evaluation demonstrated that POD is robust to adversarial patches of various shapes and sizes. The effectiveness of our baseline approach is shown to be a viable defense mechanism for real-world infrared human detection systems, paving the way for exploring future research directions.
    摘要 红外检测是一种出现的技术,具有很好的防障特性,因此在安全关键任务中得到广泛应用。然而,最近的研究发现,红外检测系统容易受到physically realizable adversarial patches的威胁,这会影响其在实际应用中的安全性。为了解决这个问题,我们是第一个调查红外检测系统中 adversarial patch 攻击的防御策略,特别是人体检测。我们提出了一种简单的防御策略,即 patch-based occlusion-aware detection(POD),它可以增加训练样本中的随机贴图,并在后续检测它们。POD不仅可以准确检测人体,还可以识别隐藏在贴图中的敌意贴图位置。意外地,POD的计算效率非常低,同时它可以在未见过训练时的攻击中保持高效。此外,POD在清洁(即无贴图)情况下也可以提高检测精度,这是因为数据增强效果。我们的基线方法在不同形状和大小的敌意贴图攻击中都能够保持高效。这些结果表明,POD是一种可靠的防御策略,可以保护实际中的红外人体检测系统,开辟了未来研究的新途径。

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

  • paper_url: http://arxiv.org/abs/2309.15508
  • repo_url: None
  • paper_authors: Lingxiao Lu, Bo Zhang, Li Niu
  • for: 合成具有真实感的图像,即将前景对象渗透到背景图像中。
  • methods: 使用大量的前景和背景对象对 diffusion 模型进行预训练,以便在测试时直接应用到新的前景和背景对象。
  • results: 经验显示,使用这种方法可以快速并高效地生成高质量的合成图像,但是经常失去前景细节和显示明显的artefacts。
    Abstract The goal of image composition is merging a foreground object into a background image to obtain a realistic composite image. Recently, generative composition methods are built on large pretrained diffusion models, due to their unprecedented image generation ability. They train a model on abundant pairs of foregrounds and backgrounds, so that it can be directly applied to a new pair of foreground and background at test time. However, the generated results often lose the foreground details and exhibit noticeable artifacts. In this work, we propose an embarrassingly simple approach named DreamCom inspired by DreamBooth. Specifically, given a few reference images for a subject, we finetune text-guided inpainting diffusion model to associate this subject with a special token and inpaint this subject in the specified bounding box. We also construct a new dataset named MureCom well-tailored for this task.
    摘要 “目的是将前景物体合并到背景图像中,以获得实际的合成图像。现在,生成作业方法基于大量预训数据模型,因为它们可以实现前无之纪录的图像生成能力。它们在丰富的前景和背景组合中训练模型,以便在测试时直接应用到新的前景和背景。然而,生成结果经常失去前景细节,并表现出明显的错误。在这个工作中,我们提出了一个轻松简单的方法名为DreamCom,受 DreamBooth 的启发。具体来说,我们将一些对主题的参考图片给调整文本导向填充扩散模型,将主题与特殊的token相关,并在指定的矩形盒中填充这个主题。我们还建立了一个新的数据集名为MureCom,专门用于这个任务。”

Finite Scalar Quantization: VQ-VAE Made Simple

  • paper_url: http://arxiv.org/abs/2309.15505
  • repo_url: https://github.com/google-research/google-research
  • paper_authors: Fabian Mentzer, David Minnen, Eirikur Agustsson, Michael Tschannen
  • for: The paper aims to propose a simple scheme for vector quantization (VQ) in the latent representation of VQ-VAEs, which is called finite scalar quantization (FSQ).
  • methods: The paper uses FSQ to project the VAE representation down to a few dimensions, and each dimension is quantized to a small set of fixed values. The authors use an appropriate choice of the number of dimensions and values each dimension can take to obtain the same codebook size as in VQ.
  • results: The authors train the same models that have been trained on VQ-VAE representations using FSQ, including autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Despite the much simpler design of FSQ, the authors obtain competitive performance in all these tasks.
    Abstract We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
    摘要 Translated into Simplified Chinese:我们提议将vector quantization(VQ)在VQ-VAE中的latent representation中 replaced with一种简单的方案 calledfinite scalar quantization(FSQ),其中我们将VAE表示下降到一些维度(通常小于10)。每个维度被quantized到一小集fixed values,导致一个由这些集组成的(含义)codebook。通过合适地选择维度和每个维度可以取的值的数量,我们可以获得与VQ的codebook大小一样的大小。在这些简单的抽象表示之上,我们可以训练与VQ-VAE表示相同的模型。例如,用于图像生成的autoregressive和masked transformer模型,以及用于计算机视觉任务的多modal生成和精密预测。具体来说,我们使用FSQ与MaskGIT进行图像生成,以及与UViM进行深度估计、色化和分割。尽管FSQ的设计非常简单,但我们在所有这些任务中都获得了竞争性的性能。我们强调FSQ不会出现codebook collapse,并且不需要VQ中使用的复杂机制(如承诺损失、codebook重新种子、code splitting、Entropy penalty等)来学习表示 expressive discrete representations。

Investigating the changes in BOLD responses during viewing of images with varied complexity: An fMRI time-series based analysis on human vision

  • paper_url: http://arxiv.org/abs/2309.15495
  • repo_url: https://github.com/naveen7102/fmri-time-series-classification
  • paper_authors: Naveen Kanigiri, Manohar Suggula, Debanjali Bhattacharya, Neelam Sinha
  • for: investigate the neurological variation of human brain responses during viewing of images with varied complexity using fMRI time series (TS) analysis.
  • methods: employ classical machine learning and deep learning strategies to classify image complexity-specific fMRI TS, and perform temporal semantic segmentation on whole fMRI TS.
  • results: established a baseline in studying how differently human brain functions while looking into images of diverse complexities, and provided insightful explanations for how static images with diverse complexities are perceived.
    Abstract Functional MRI (fMRI) is widely used to examine brain functionality by detecting alteration in oxygenated blood flow that arises with brain activity. This work aims to investigate the neurological variation of human brain responses during viewing of images with varied complexity using fMRI time series (TS) analysis. Publicly available BOLD5000 dataset is used for this purpose which contains fMRI scans while viewing 5254 distinct images of diverse categories, drawn from three standard computer vision datasets: COCO, Imagenet and SUN. To understand vision, it is important to study how brain functions while looking at images of diverse complexities. Our first study employs classical machine learning and deep learning strategies to classify image complexity-specific fMRI TS, represents instances when images from COCO, Imagenet and SUN datasets are seen. The implementation of this classification across visual datasets holds great significance, as it provides valuable insights into the fluctuations in BOLD signals when perceiving images of varying complexities. Subsequently, temporal semantic segmentation is also performed on whole fMRI TS to segment these time instances. The obtained result of this analysis has established a baseline in studying how differently human brain functions while looking into images of diverse complexities. Therefore, accurate identification and distinguishing of variations in BOLD signals from fMRI TS data serves as a critical initial step in vision studies, providing insightful explanations for how static images with diverse complexities are perceived.
    摘要

CauDR: A Causality-inspired Domain Generalization Framework for Fundus-based Diabetic Retinopathy Grading

  • paper_url: http://arxiv.org/abs/2309.15493
  • repo_url: None
  • paper_authors: Hao Wei, Peilun Shi, Juzheng Miao, Minqing Zhang, Guitao Bai, Jianing Qiu, Furui Liu, Wu Yuan
    for: 这个研究旨在提高computer-aided diabetic retinopathy grading system的准确性和一致性,以帮助镜外科医生快速识别和诊断。methods: 这个研究使用了 novel retinal imaging cameras 和 deep learning-based algorithms,并将 causality analysis 应用到模型架构中以减少域别差异的影响。results: 研究结果显示,这个新的条件预测架构(CauDR)能够减少域别差异的影响,并 achieves state-of-the-art 性能。
    Abstract Diabetic retinopathy (DR) is the most common diabetic complication, which usually leads to retinal damage, vision loss, and even blindness. A computer-aided DR grading system has a significant impact on helping ophthalmologists with rapid screening and diagnosis. Recent advances in fundus photography have precipitated the development of novel retinal imaging cameras and their subsequent implementation in clinical practice. However, most deep learning-based algorithms for DR grading demonstrate limited generalization across domains. This inferior performance stems from variance in imaging protocols and devices inducing domain shifts. We posit that declining model performance between domains arises from learning spurious correlations in the data. Incorporating do-operations from causality analysis into model architectures may mitigate this issue and improve generalizability. Specifically, a novel universal structural causal model (SCM) was proposed to analyze spurious correlations in fundus imaging. Building on this, a causality-inspired diabetic retinopathy grading framework named CauDR was developed to eliminate spurious correlations and achieve more generalizable DR diagnostics. Furthermore, existing datasets were reorganized into 4DR benchmark for DG scenario. Results demonstrate the effectiveness and the state-of-the-art (SOTA) performance of CauDR.
    摘要 糖尿病 retinopathy (DR) 是糖尿病最常见的侵犯,通常会导致视力损害、视力损伤和甚至是失明。一个计算机支持的 DR 分级系统有助于医生快速评估和诊断。在最近的投影照相技术发展后,新的视觉内部影像相机被实施到临床实践中。但大多数深度学习基于的 DR 分级算法显示有限的应用普遍性。这是由于几何图像协议和设备之间的差异引起的领域转移。我们认为,模型在不同领域之间的性能下降是由于学习伪的相互关联。将 causality 分析中的动作从事件给到模型架构中可能将这个问题解决,并提高普遍性。特别是,一个新的通用结构 causality 模型 (SCM) 被提出供分析视觉内部影像中的伪的相互关联。基于这个 SCM,一个以 causality 为基础的糖尿病 retinopathy 分级框架 (CauDR) 被开发,以消除伪的相互关联并 дости得更高的普遍性。此外,现有的数据被重新排序为 4DR 参考景,结果显示 CauDR 的效果和顶尖性能。

Survey on Deep Face Restoration: From Non-blind to Blind and Beyond

  • paper_url: http://arxiv.org/abs/2309.15490
  • repo_url: https://github.com/24wenjie-li/awesome-face-restoration
  • paper_authors: Wenjie Li, Mei Wang, Kai Zhang, Juncheng Li, Xiaoming Li, Yuhang Zhang, Guangwei Gao, Weihong Deng, Chia-Wen Lin
  • for: 本研究目的是为了提高低质量(LQ)图像的面部图像修复(FR)技术。
  • methods: 本文首先检视了现实中常见的LQ图像因素,并介绍了用于生成LQ图像的降低技术。然后, authors分类了FR方法按照不同的任务,并讲解它们的发展历程。此外, authors还介绍了常见的面部优先级,并讨论了如何提高它们的效果。
  • results: 在实验部分, authors 全面评估了当前最佳FR方法的性能 across 多个任务,并从不同的角度分析它们的表现。Note: The “24wenjie-li” in the repository URL is the name of the author, not a typo.
    Abstract Face restoration (FR) is a specialized field within image restoration that aims to recover low-quality (LQ) face images into high-quality (HQ) face images. Recent advances in deep learning technology have led to significant progress in FR methods. In this paper, we begin by examining the prevalent factors responsible for real-world LQ images and introduce degradation techniques used to synthesize LQ images. We also discuss notable benchmarks commonly utilized in the field. Next, we categorize FR methods based on different tasks and explain their evolution over time. Furthermore, we explore the various facial priors commonly utilized in the restoration process and discuss strategies to enhance their effectiveness. In the experimental section, we thoroughly evaluate the performance of state-of-the-art FR methods across various tasks using a unified benchmark. We analyze their performance from different perspectives. Finally, we discuss the challenges faced in the field of FR and propose potential directions for future advancements. The open-source repository corresponding to this work can be found at https:// github.com/ 24wenjie-li/ Awesome-Face-Restoration.
    摘要 面部恢复(FR)是图像恢复的一个专业领域,旨在将低质量(LQ)的面部图像恢复到高质量(HQ)的面部图像。 current deep learning技术的进步已经导致FR方法得到了重要的进步。在这篇论文中,我们首先检查了实际中LQ图像的主要因素,并介绍了用于生成LQ图像的降低技术。我们还讨论了在领域中常用的标准 bencmarks。然后,我们将FR方法分为不同任务,并解释它们的演化历史。此外,我们探讨了常用的面部先验和如何提高它们的效果。在实验部分,我们对现今FR方法的性能进行了广泛的评估,使用了一个统一的 bencmark。我们从不同的角度分析了它们的性能。最后,我们讨论了FR领域面临的挑战和未来的发展方向。相关的开源存储库可以在https://github.com/24wenjie-li/Awesome-Face-Restoration中找到。

Tackling VQA with Pretrained Foundation Models without Further Training

  • paper_url: http://arxiv.org/abs/2309.15487
  • repo_url: None
  • paper_authors: Alvin De Jun Tan, Bingquan Shen
  • for: 这个论文的目的是探讨如何使用预训练的大语言模型(LLMs)解决视觉问答(VQA)问题,无需进一步训练。
  • methods: 这个论文使用了将预训练的LLMs和其他基础模型结合使用,以便在不进一步训练的情况下解决VQA问题。文章探讨了不同的解码策略来生成图像的文本表示,并评估了其性能在VQAv2数据集上。
  • results: 研究发现,通过使用自然语言来表示图像,LLMs可以快速理解图像,并且不需要进一步训练。不同的解码策略对图像的文本表示具有不同的性能,但是综合评估结果表明,使用自然语言来表示图像是一个有效的方法。
    Abstract Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the image and text embeddings. However, these methods are computationally expensive and requires large scale image-text dataset for training. In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem. The general idea is to use natural language to represent the images such that the LLM can understand the images. We explore different decoding strategies for generating textual representation of the image and evaluate their performance on the VQAv2 dataset.
    摘要

Transferability of Representations Learned using Supervised Contrastive Learning Trained on a Multi-Domain Dataset

  • paper_url: http://arxiv.org/abs/2309.15486
  • repo_url: None
  • paper_authors: Alvin De Jun Tan, Clement Tan, Chai Kiat Yeo
  • for: 本研究使用 Supervised Contrastive Learning 框架来学习 DomainNet 多域数据集上的表示,并评估这些表示的传递性在不同域的下游数据集上。
  • methods: 本研究使用 Supervised Contrastive Learning 框架,并使用 fixed feature linear evaluation protocol 评估表示的传递性。
  • results: 实验结果显示,Supervised Contrastive Learning 模型在 7 个不同域的下游数据集上的平均表现比基eline模型优于 6.05%。这些结果表明,Supervised Contrastive Learning 模型可能可以在多域数据集上学习更robust的表示,并且这些表示可以更好地传递到其他域。
    Abstract Contrastive learning has shown to learn better quality representations than models trained using cross-entropy loss. They also transfer better to downstream datasets from different domains. However, little work has been done to explore the transferability of representations learned using contrastive learning when trained on a multi-domain dataset. In this paper, a study has been conducted using the Supervised Contrastive Learning framework to learn representations from the multi-domain DomainNet dataset and then evaluate the transferability of the representations learned on other downstream datasets. The fixed feature linear evaluation protocol will be used to evaluate the transferability on 7 downstream datasets that were chosen across different domains. The results obtained are compared to a baseline model that was trained using the widely used cross-entropy loss. Empirical results from the experiments showed that on average, the Supervised Contrastive Learning model performed 6.05% better than the baseline model on the 7 downstream datasets. The findings suggest that Supervised Contrastive Learning models can potentially learn more robust representations that transfer better across domains than cross-entropy models when trained on a multi-domain dataset.
    摘要 <>使用对比学习可以学习出较高质量的表示,并且这些表示可以更好地在不同领域下转移。然而,对多领域数据集上使用对比学习学习表示的转移性还未得到充分研究。本文通过使用Supervised Contrastive Learning框架,从多领域数据集DomainNet上学习表示,然后使用 fixes 特征线性评估协议评估这些表示在不同领域下的转移性。结果表明,对7个下游数据集进行比较,Supervised Contrastive Learning模型在平均上比基线模型6.05%更好。这些结果表明,Supervised Contrastive Learning模型可能可以在多领域数据集上学习更加稳定的表示,并且这些表示可以更好地转移到不同领域。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Style Transfer and Self-Supervised Learning Powered Myocardium Infarction Super-Resolution Segmentation

  • paper_url: http://arxiv.org/abs/2309.15485
  • repo_url: None
  • paper_authors: Lichao Wang, Jiahao Huang, Xiaodan Xing, Yinzhe Wu, Ramyah Rajakulasingam, Andrew D. Scott, Pedro F Ferreira, Ranil De Silva, Sonia Nielles-Vallespin, Guang Yang
  • For: The paper aims to enhance diffusion tensor imaging (DTI) images by translating them into the late gadolinium enhancement (LGE) domain, which offers a larger amount of data with high-resolution and distinct highlighting of myocardium infarction (MI) areas.* Methods: The proposed pipeline incorporates a novel style transfer model and a simultaneous super-resolution and segmentation model. An end-to-end super-resolution segmentation model is introduced to generate high-resolution mask from low-resolution LGE style DTI image. A multi-task self-supervised learning strategy is employed to pre-train the super-resolution segmentation model.* Results: The proposed pipeline is expected to enhance the performance of the segmentation model by acquiring more representative knowledge and improving its segmentation performance after fine-tuning.Here is the simplified Chinese version of the three key points:* For: 这篇论文目标是使用晚期加多林酸增强图像(LGE)域来提高扩散tensor成像(DTI)图像的分辨率和干扰率。* Methods: 提议的管道包括一种新的样式传输模型和同时的超解像和分割模型。该模型可以将低分辨率LGE样式DTI图像转换为高分辨率mask。* Results: 提议的管道可以提高分割模型的性能,并且可以通过自动预训练和微调来提高分割性能。
    Abstract This study proposes a pipeline that incorporates a novel style transfer model and a simultaneous super-resolution and segmentation model. The proposed pipeline aims to enhance diffusion tensor imaging (DTI) images by translating them into the late gadolinium enhancement (LGE) domain, which offers a larger amount of data with high-resolution and distinct highlighting of myocardium infarction (MI) areas. Subsequently, the segmentation task is performed on the LGE style image. An end-to-end super-resolution segmentation model is introduced to generate high-resolution mask from low-resolution LGE style DTI image. Further, to enhance the performance of the model, a multi-task self-supervised learning strategy is employed to pre-train the super-resolution segmentation model, allowing it to acquire more representative knowledge and improve its segmentation performance after fine-tuning. https: github.com/wlc2424762917/Med_Img
    摘要 这个研究提出了一个管道,其中包括一种新的风格传输模型和同时的超高分辨率和分割模型。该管道的目标是通过将扩散tensor图像(DTI)转换到晚期加多林革命(LGE)域中,以获得更多的数据,高分辨率和明确驳批我OCARD Infarction(MI)区域。然后,对LGE风格图像进行分割任务。该研究引入了一种端到端超高分辨率分割模型,以生成高分辨率mask从低分辨率LGE风格DTI图像中。此外,为了提高模型的性能,该研究采用了多任务自主学习策略,将超高分辨率分割模型在先修改后 fine-tuning 中进行自我调节。References:* DTI: diffusion tensor imaging* LGE: late gadolinium enhancement* MI: myocardium infarction* SR: super-resolution* Segmentation: 分割

The Robust Semantic Segmentation UNCV2023 Challenge Results

  • paper_url: http://arxiv.org/abs/2309.15478
  • repo_url: None
  • paper_authors: Xuanlong Yu, Yi Zuo, Zitao Wang, Xiaowen Zhang, Jiaxuan Zhao, Yuting Yang, Licheng Jiao, Rui Peng, Xinyi Wang, Junpei Zhang, Kexin Zhang, Fang Liu, Roberto Alcover-Couso, Juan C. SanMiguel, Marcos Escudero-Viñolo, Hanlin Tian, Kenta Matsui, Tianhao Wang, Fahmy Adan, Zhitong Gao, Xuming He, Quentin Bouniot, Hossein Moghaddam, Shyam Nandan Rai, Fabio Cermelli, Carlo Masone, Andrea Pilzer, Elisa Ricci, Andrei Bursuc, Arno Solin, Martin Trapp, Rui Li, Angela Yao, Wenlong Chen, Ivor Simpson, Neill D. F. Campbell, Gianni Franchi
  • for: 本文描述了在ICCV 2023 年举行的 MUAD 不确定量评估挑战中使用的赢利解决方案。挑战的目标是提高城市环境下的semantic segmentation robustness,特别是面对自然的反对抗情况下。
  • methods: 本文介绍了参与挑战的19个提交的方法,其中许多技术启发自过去几年Computer Vision和Machine Learning领域的主要会议和学术期刊上的先进uncertainty quantification方法。
  • results: 本文介绍了挑战的topperforming解决方案,并提供了所有参与者使用的多种方法的全面概述,以便让读者更深入地了解城市环境下的semantic segmentation中的不确定性处理策略。
    Abstract This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty quantification methodologies presented at prominent conferences in the fields of computer vision and machine learning and journals over the past few years. Within this document, the challenge is introduced, shedding light on its purpose and objectives, which primarily revolved around enhancing the robustness of semantic segmentation in urban scenes under varying natural adversarial conditions. The report then delves into the top-performing solutions. Moreover, the document aims to provide a comprehensive overview of the diverse solutions deployed by all participants. By doing so, it seeks to offer readers a deeper insight into the array of strategies that can be leveraged to effectively handle the inherent uncertainties associated with autonomous driving and semantic segmentation, especially within urban environments.
    摘要 The challenge aimed to enhance the robustness of semantic segmentation in urban scenes under varying natural adversarial conditions. The report introduces the challenge and its objectives, and then delves into the top-performing solutions. Additionally, the document provides a comprehensive overview of the diverse solutions deployed by all participants, offering readers a deeper understanding of the array of strategies that can be used to effectively handle the inherent uncertainties associated with autonomous driving and semantic segmentation, particularly within urban environments.Translated into Simplified Chinese:这份报告详细介绍了在ICCV 2023 年举行的 MUAD 不确定量化挑战的赢家解决方案。挑战的主要目标是在城市环境下进行 semantic segmentation,特别是在自然难题下进行。报告介绍了19个提交的解决方案,其中许多技术受到了过去几年计算机视觉和机器学习领域的 prominent 会议和学术期刊上的uncertainty quantification方法的 inspirations。挑战的目的是提高城市景观下 semantic segmentation 的 robustness,特别是在自然难题下。报告 introduce 了挑战和其目标,然后它介绍了top-performing 的解决方案。此外,报告还提供了所有参与者所使用的多种解决方案的全面概述,以便让读者更深入地了解 autonomous driving 和 semantic segmentation 中的不确定性,特别是在城市环境下。

A Tutorial on Uniform B-Spline

  • paper_url: http://arxiv.org/abs/2309.15477
  • repo_url: None
  • paper_authors: Yi Zhou
  • for: 本文探讨了非常规B-spline的一致性和矩阵表示。
  • methods: 本文使用了矩阵理论和计算机科学技术来研究非常规B-spline的一致性和矩阵表示。
  • results: 本文得到了一种新的非常规B-spline的矩阵表示方法,并且验证了该方法的正确性和有效性。
    Abstract This document facilitates understanding of core concepts about uniform B-spline and its matrix representation.
    摘要 这份文档帮助理解均匀B-spline的核心概念以及其矩阵表示方法。Here's the breakdown of the translation:* 这份文档 (zhè fāng wén dào) - This document* 帮助理解 (bāng zhù lǐ jiě) - Facilitates understanding* 均匀B-spline (jūn yí B-spline) - Uniform B-spline* 核心概念 (gōng zhī yù yì) - Core concepts* 以及 (yǐ yú) - And* 矩阵表示方法 (pí zhāng biǎo yì fāng yì) - Matrix representation method

Cross-Dataset Experimental Study of Radar-Camera Fusion in Bird’s-Eye View

  • paper_url: http://arxiv.org/abs/2309.15465
  • repo_url: None
  • paper_authors: Lukas Stäcker, Philipp Heidenreich, Jason Rambach, Didier Stricker
  • for: 提高自动驾驶系统的可靠性和可靠性,通过利用卫星信息和摄像头的融合。
  • methods: 提出一种新的和灵活的融合网络,并在nuScenes和View-of-Delft两个数据集上进行测试。
  • results: 研究发现,摄像头分支需要大量和多样化的训练数据,而雷达分支受益于高性能的雷达。通过传输学习,我们提高了摄像头的性能在较小的数据集上。结果还表明,雷达-摄像头融合方法在相比摄像头只和雷达只的基准下显著超越。
    Abstract By exploiting complementary sensor information, radar and camera fusion systems have the potential to provide a highly robust and reliable perception system for advanced driver assistance systems and automated driving functions. Recent advances in camera-based object detection offer new radar-camera fusion possibilities with bird's eye view feature maps. In this work, we propose a novel and flexible fusion network and evaluate its performance on two datasets: nuScenes and View-of-Delft. Our experiments reveal that while the camera branch needs large and diverse training data, the radar branch benefits more from a high-performance radar. Using transfer learning, we improve the camera's performance on the smaller dataset. Our results further demonstrate that the radar-camera fusion approach significantly outperforms the camera-only and radar-only baselines.
    摘要 通过利用相 complementary 的感知信息,雷达和摄像头融合系统具有提供高度可靠和可靠的感知系统的潜在能力,以用于先进的驾驶助手和自动驾驶功能。最新的摄像头基于物体检测技术开创了新的雷达-摄像头融合可能性,包括 bird's eye view 特征地图。在这种工作中,我们提出了一种新的和灵活的融合网络,并评估其性能在 nuScenes 和 View-of-Delft 两个数据集上。我们的实验表明,虽然摄像头分支需要大量和多样化的训练数据,但雷达分支受益于高性能的雷达。通过传输学习,我们改进了摄像头在较小的数据集上的性能。我们的结果还证明了雷达-摄像头融合方法在相对于摄像头Only 和雷达Only 基eline上显著超越。

GAMMA: Graspability-Aware Mobile MAnipulation Policy Learning based on Online Grasping Pose Fusion

  • paper_url: http://arxiv.org/abs/2309.15459
  • repo_url: None
  • paper_authors: Jiazhao Zhang, Nandiraju Gireesh, Jilong Wang, Xiaomeng Fang, Chaoyi Xu, Weiguang Chen, Liu Dai, He Wang
  • for: 本研究旨在提高机器人助手的移动抓取能力,增强机器人在不稳定环境中的抓取能力。
  • methods: 该研究提出了一种基于在线抓取姿态融合框架的抓取可见性感知方法,可以在实时下对抓取姿态进行筛选和融合,从而实现时间上的一致性。
  • results: 该方法可以减少红UND grasping pose的数量,同时提高抓取姿态质量,从而提高机器人的移动抓取能力。
    Abstract Mobile manipulation constitutes a fundamental task for robotic assistants and garners significant attention within the robotics community. A critical challenge inherent in mobile manipulation is the effective observation of the target while approaching it for grasping. In this work, we propose a graspability-aware mobile manipulation approach powered by an online grasping pose fusion framework that enables a temporally consistent grasping observation. Specifically, the predicted grasping poses are online organized to eliminate the redundant, outlier grasping poses, which can be encoded as a grasping pose observation state for reinforcement learning. Moreover, on-the-fly fusing the grasping poses enables a direct assessment of graspability, encompassing both the quantity and quality of grasping poses.
    摘要 Mobile manipulation 是Robotic assistant的基本任务,在Robotics community中吸引了广泛的关注。一个关键的挑战在于有效地观察目标而 approaching 它进行抓取。在这种工作中,我们提议一种基于在线抓取姿 pose 融合框架的抓取可能性感知approach,使抓取观察得到时间协调。具体来说,预测的抓取姿 pose 被在线组织,以消除重复和异常的抓取姿 pose,这些可以编码为抓取观察状态 для reinforcement learning。此外,在线融合抓取姿 pose 直接评估抓取可能性,包括抓取姿 pose 的量和质量。

Semantics-Driven Cloud-Edge Collaborative Inference

  • paper_url: http://arxiv.org/abs/2309.15435
  • repo_url: None
  • paper_authors: Yuche Gao, Beibei Zhang
  • for: 这篇论文主要针对智能城市应用中的视频数据进行高效分析,以智能交通为例。
  • methods: 本论文提出了一个基于 semantics 的云端-边缘协作方法,将视频分析过程分为两个阶段:在边缘服务器上提取视频内容的 semantics (车牌号码图像),然后将该内容提交到云端或边缘服务器进行识别。这种分析方法可以降低端到端传输时间和提高throughput。
  • results: 实验结果显示,相比于将所有视频分析工作负载到云端或边缘服务器进行处理,这种云端-边缘协作方法可以提高端到端传输速度(最多5倍快)、throughput(最多9帧/秒)和网页流量量(50%减少)。这显示了这种方法的有效性。
    Abstract With the proliferation of video data in smart city applications like intelligent transportation, efficient video analytics has become crucial but also challenging. This paper proposes a semantics-driven cloud-edge collaborative approach for accelerating video inference, using license plate recognition as a case study. The method separates semantics extraction and recognition, allowing edge servers to only extract visual semantics (license plate patches) from video frames and offload computation-intensive recognition to the cloud or neighboring edges based on load. This segmented processing coupled with a load-aware work distribution strategy aims to reduce end-to-end latency and improve throughput. Experiments demonstrate significant improvements in end-to-end inference speed (up to 5x faster), throughput (up to 9 FPS), and reduced traffic volumes (50% less) compared to cloud-only or edge-only processing, validating the efficiency of the proposed approach. The cloud-edge collaborative framework with semantics-driven work partitioning provides a promising solution for scaling video analytics in smart cities.
    摘要 随着智能城市应用中视频数据的普遍化,高效的视频分析已成为非常重要,同时也变得非常困难。这篇论文提议一种基于 semantics的云端-边缘集成方法,用 license plate recognition 作为案例研究。该方法将 semantics 提取和识别分开,让边缘服务器只提取视频帧中的视觉 semantics(车牌补丁),并将 computation-intensive 识别 tasks 提交到云或邻近的边缘服务器进行处理,根据负载情况进行异步分配工作。这种分解处理和根据负载情况分配工作的策略,可以降低端到端 latency 和提高吞吐量。实验结果显示,与云只处理或边缘只处理相比,提出的方法可以提高端到端推理速度(最多5倍)、吞吐量(最多9 FPS)和发送流量量(50% 降低)。云端-边缘集成框架,带有基于 semantics 的工作分配策略,为视频分析在智能城市中扩大 scalability 提供了一个可靠的解决方案。

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions

  • paper_url: http://arxiv.org/abs/2309.15426
  • repo_url: https://github.com/oppo-us-research/NeuRBF
  • paper_authors: Zhang Chen, Zhong Li, Liangchen Song, Lele Chen, Jingyi Yu, Junsong Yuan, Yi Xu
  • for: 本研究目的是提出一种基于总体卷积的神经场,以更好地表示许多应用中的维度灵活的信号。
  • methods: 本研究使用普通卷积函数作为信号表示,而不是传统的格子化神经场。这种方法可以更好地适应目标信号,并且可以通过组合多个频率弧函数来扩展卷积基底的通道 capacities。
  • results: 实验表明,对于2D图像和3D签名距离场表示,我们的方法可以达到更高的准确率和更小的模型大小,而且与优化的权重分配策略相比,我们的方法可以更好地适应不同类型的信号。当应用于神经辉场重建时,我们的方法可以达到最新的 Rendering 质量,同时具有小型模型和相对较快的训练速度。
    Abstract We present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.
    摘要 我们提出了一种新型的神经场,使用通用径向基数进行信号表示。现有的神经场通常采用网格基于表示方式,将本地神经特征存储在网格节点上,并使用 N 维线性核函数来在连续查询点上 interpolate 特征。这些神经特征的空间位置固定在网格节点上,无法适应目标信号。而我们的方法则基于通用径向基数,具有更高的空间适应性和能够更好地适应目标信号。为了进一步提高扁球基数的通道 capacitance,我们提议将其与多频环相互融合。这种技术可以无需更多参数,将扁球基数扩展到多个频环 band 中,以便更好地表示细节。此外,我们还提出了一种将 adaptive 扁球基数与网格基数结合的 гибрид组合,以便继承两者的适应性和插值平滑性。我们仔细设计了权重分配方案,使 radial bases 能够有效地适应不同类型的信号。我们的实验表明,对 2D 图像和 3D 签名距离场表示,我们的方法比先前艺术高得多,而且模型尺寸和训练速度也相对较小。当应用于神经辐射场重建时,我们的方法实现了状态艺术的渲染质量,具有小型模型和相对较快的训练速度。

Inherit with Distillation and Evolve with Contrast: Exploring Class Incremental Semantic Segmentation Without Exemplar Memory

  • paper_url: http://arxiv.org/abs/2309.15413
  • repo_url: None
  • paper_authors: Danpei Zhao, Bo Yuan, Zhenwei Shi
  • For: Addressing class incremental semantic segmentation (CISS) without exemplar memory and resolving catastrophic forgetting and semantic drift simultaneously.* Methods: Proposed method IDEC consists of Dense Knowledge Distillation on all Aspects (DADA) and Asymmetric Region-wise Contrastive Learning (ARCL) modules, with a dynamic class-specific pseudo-labelling strategy.* Results: Achieved state-of-the-art performance on multiple CISS tasks, with superior anti-forgetting ability, particularly in multi-step CISS tasks.
    Abstract As a front-burner problem in incremental learning, class incremental semantic segmentation (CISS) is plagued by catastrophic forgetting and semantic drift. Although recent methods have utilized knowledge distillation to transfer knowledge from the old model, they are still unable to avoid pixel confusion, which results in severe misclassification after incremental steps due to the lack of annotations for past and future classes. Meanwhile data-replay-based approaches suffer from storage burdens and privacy concerns. In this paper, we propose to address CISS without exemplar memory and resolve catastrophic forgetting as well as semantic drift synchronously. We present Inherit with Distillation and Evolve with Contrast (IDEC), which consists of a Dense Knowledge Distillation on all Aspects (DADA) manner and an Asymmetric Region-wise Contrastive Learning (ARCL) module. Driven by the devised dynamic class-specific pseudo-labelling strategy, DADA distils intermediate-layer features and output-logits collaboratively with more emphasis on semantic-invariant knowledge inheritance. ARCL implements region-wise contrastive learning in the latent space to resolve semantic drift among known classes, current classes, and unknown classes. We demonstrate the effectiveness of our method on multiple CISS tasks by state-of-the-art performance, including Pascal VOC 2012, ADE20K and ISPRS datasets. Our method also shows superior anti-forgetting ability, particularly in multi-step CISS tasks.
    摘要 为了解决逐步学习中的前燃问题,我们提出了一种不使用示例内存的类增量 semantic segmentation(CISS)方法,它可以同时解决悬峰忘记和 semantics 漂移问题。 although recent methods have used knowledge distillation to transfer knowledge from the old model, they are still unable to avoid pixel confusion, which results in severe misclassification after incremental steps due to the lack of annotations for past and future classes. Meanwhile, data-replay-based approaches suffer from storage burdens and privacy concerns.在这篇论文中,我们提出了一种不使用示例内存的 CISS 方法,可以同时解决悬峰忘记和 semantics 漂移问题。我们称之为 Inherit with Distillation and Evolve with Contrast (IDEC),它包括一种 dense knowledge distillation on all aspects (DADA) 方法和一种 asymmetric region-wise contrastive learning (ARCL) 模块。通过我们制定的动态类pecific pseudo-labeling策略,DADA 可以在中间层次和输出层之间兼容知识,同时更强调semantic-invariant知识继承。ARCL 实现了在 latent space 中进行区域 wise contrastive learning,以解决 semantics 漂移问题。我们在多个 CISS 任务上示出了我们的方法的效果,包括 Pascal VOC 2012、ADE20K 和 ISPRS 数据集。我们的方法还显示了在多 step CISS 任务中的超越抗忘记能力。

3D Multiple Object Tracking on Autonomous Driving: A Literature Review

  • paper_url: http://arxiv.org/abs/2309.15411
  • repo_url: None
  • paper_authors: Peng Zhang, Xin Li, Liang He, Xin Lin
  • for: 这篇论文主要探讨了三维多 объек图跟踪(3D MOT)领域的研究状况,并提出了未来研究的可能性。
  • methods: 本论文使用了系统性的描述和分析方法,对3D MOT领域的各种方法进行了分类和评价,并提供了一份概括性的实验指标。
  • results: 本论文的研究结果表明,3D MOT领域还存在许多挑战,如 объек形变、隐藏、小目标、数据缺乏、检测失败等问题。同时,本论文还提出了未来研究的可能性,以帮助解决这些挑战。
    Abstract 3D multi-object tracking (3D MOT) stands as a pivotal domain within autonomous driving, experiencing a surge in scholarly interest and commercial promise over recent years. Despite its paramount significance, 3D MOT confronts a myriad of formidable challenges, encompassing abrupt alterations in object appearances, pervasive occlusion, the presence of diminutive targets, data sparsity, missed detections, and the unpredictable initiation and termination of object motion trajectories. Countless methodologies have emerged to grapple with these issues, yet 3D MOT endures as a formidable problem that warrants further exploration. This paper undertakes a comprehensive examination, assessment, and synthesis of the research landscape in this domain, remaining attuned to the latest developments in 3D MOT while suggesting prospective avenues for future investigation. Our exploration commences with a systematic exposition of key facets of 3D MOT and its associated domains, including problem delineation, classification, methodological approaches, fundamental principles, and empirical investigations. Subsequently, we categorize these methodologies into distinct groups, dissecting each group meticulously with regard to its challenges, underlying rationale, progress, merits, and demerits. Furthermore, we present a concise recapitulation of experimental metrics and offer an overview of prevalent datasets, facilitating a quantitative comparison for a more intuitive assessment. Lastly, our deliberations culminate in a discussion of the prevailing research landscape, highlighting extant challenges and charting possible directions for 3D MOT research. We present a structured and lucid road-map to guide forthcoming endeavors in this field.
    摘要 三维多目标跟踪(3D MOT)是自动驾驶领域中的一个重要领域,在最近几年内受到学术界和商业界的关注增长。despite its paramount significance, 3D MOT still faces numerous challenges, including sudden changes in object appearance, ubiquitous occlusion, the presence of small targets, data sparsity, missed detections, and the unpredictable initiation and termination of object motion trajectories. To address these issues, numerous methodologies have been proposed, but 3D MOT remains a formidable problem that requires further exploration.This paper provides a comprehensive examination, assessment, and synthesis of the research landscape in this domain, keeping pace with the latest developments in 3D MOT and suggesting potential avenues for future investigation. Our exploration begins with a systematic exposition of key aspects of 3D MOT and its related domains, including problem delineation, classification, methodological approaches, fundamental principles, and empirical investigations. We then categorize these methodologies into distinct groups, dissecting each group carefully with regard to its challenges, underlying rationale, progress, merits, and demerits.Furthermore, we present a concise summary of experimental metrics and provide an overview of prevalent datasets, facilitating a quantitative comparison for a more intuitive assessment. Finally, our deliberations culminate in a discussion of the prevailing research landscape, highlighting existing challenges and charting possible directions for 3D MOT research. We provide a structured and lucid roadmap to guide forthcoming endeavors in this field.

KDD-LOAM: Jointly Learned Keypoint Detector and Descriptors Assisted LiDAR Odometry and Mapping

  • paper_url: http://arxiv.org/abs/2309.15394
  • repo_url: None
  • paper_authors: Renlang Huang, Minglei Zhao, Jiming Chen, Liang Li
  • for: 提高点云注册的效率和可靠性,使用稀疏关键点匹配。
  • methods: 提出一种紧密相互关联的关键点检测器和描述符(TCKDD),基于多任务全连接神经网络和概率检测损失。
  • results: 对indoor和outdoor数据集进行了广泛的实验,并达到了点云注册的州态艺术性能。此外,还设计了关键点检测器和描述符协助LiDAR导航和地图框架(KDD-LOAM),其实时导航基于关键点描述符匹配的RANSAC。
    Abstract Sparse keypoint matching based on distinct 3D feature representations can improve the efficiency and robustness of point cloud registration. Existing learning-based 3D descriptors and keypoint detectors are either independent or loosely coupled, so they cannot fully adapt to each other. In this work, we propose a tightly coupled keypoint detector and descriptor (TCKDD) based on a multi-task fully convolutional network with a probabilistic detection loss. In particular, this self-supervised detection loss fully adapts the keypoint detector to any jointly learned descriptors and benefits the self-supervised learning of descriptors. Extensive experiments on both indoor and outdoor datasets show that our TCKDD achieves state-of-the-art performance in point cloud registration. Furthermore, we design a keypoint detector and descriptors-assisted LiDAR odometry and mapping framework (KDD-LOAM), whose real-time odometry relies on keypoint descriptor matching-based RANSAC. The sparse keypoints are further used for efficient scan-to-map registration and mapping. Experiments on KITTI dataset demonstrate that KDD-LOAM significantly surpasses LOAM and shows competitive performance in odometry.
    摘要 《稀疏关键点匹配基于独特3D特征表示可以提高点云注册的效率和可靠性。现有的学习基于3D描述器和关键点检测器的方法是独立或松散相连,因此它们无法完全适应于每 andere。在这种工作中,我们提出了紧密相连的关键点检测器和描述器(TCKDD),基于多任务全连接神经网络和 probabilistic检测损失。特别是,这种自主学习的检测损失可以完全适应与共同学习的描述器,并为描述器的自主学习提供了优势。广泛的实验表明,我们的TCKDD在点云注册中 achieved state-of-the-art performance。此外,我们还设计了关键点检测器和描述器协助的LiDAR Odometry和地图框架(KDD-LOAM),其实时导航 rely on 关键点描述器匹配基于RANSAC。稀疏的关键点还用于高效的扫描到地图注册和地图。实验表明,KDD-LOAM在LOAM和ODometry中具有显著优势,并在ODometry中达到了竞争性的表现。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Subjective Face Transform using Human First Impressions

  • paper_url: http://arxiv.org/abs/2309.15381
  • repo_url: https://github.com/CV-Lehigh/subjective_face_transform
  • paper_authors: Chaitanya Roygaga, Joshua Krinsky, Kai Zhang, Kenny Kwok, Aparna Bharati
  • for: understand and explain biases in subjective interpretation of faces
  • methods: uses generative models to find semantically meaningful edits to a face image that change perceived attributes
  • results: demonstrates the generalizability of the approach by training on real and synthetic faces and evaluating on in-domain and out-of-domain images using predictive models and human ratings
    Abstract Humans tend to form quick subjective first impressions of non-physical attributes when seeing someone's face, such as perceived trustworthiness or attractiveness. To understand what variations in a face lead to different subjective impressions, this work uses generative models to find semantically meaningful edits to a face image that change perceived attributes. Unlike prior work that relied on statistical manipulation in feature space, our end-to-end framework considers trade-offs between preserving identity and changing perceptual attributes. It maps identity-preserving latent space directions to changes in attribute scores, enabling transformation of any input face along an attribute axis according to a target change. We train on real and synthetic faces, evaluate for in-domain and out-of-domain images using predictive models and human ratings, demonstrating the generalizability of our approach. Ultimately, such a framework can be used to understand and explain biases in subjective interpretation of faces that are not dependent on the identity.
    摘要 人们往往通过看一个人的脸来快速形成主观的非物理属性的印象,如信任度或美好程度。为了了解不同面部变化对主观印象的影响,这项工作使用生成模型找到保持标准化属性的semantically meaningful编辑。不同于先前的工作,我们的末端框架不仅考虑了特征空间的统计 manipulate,而且考虑了保持标识的trade-offs。它将标识保持的秘密空间方向映射到特征分数上的变化,以便将任何输入脸图像根据目标变化推动到特征轴上。我们在真实和 sintetic 脸图像上训练,使用预测模型和人类评分来评估,并证明了我们的方法的普适性。最终,这种框架可以用来理解和解释不依赖于标识的面部解释中的偏见。

Towards Foundation Models Learned from Anatomy in Medical Imaging via Self-Supervision

  • paper_url: http://arxiv.org/abs/2309.15358
  • repo_url: https://github.com/jlianglab/eden
  • paper_authors: Mohammad Reza Hosseinzadeh Taher, Michael B. Gotway, Jianming Liang
  • for: This paper aims to develop a foundation model for medical imaging that can “understand” human anatomy and possess fundamental properties of medical imaging.
  • methods: The authors propose a novel self-supervised learning (SSL) strategy that exploits the hierarchical nature of human anatomy, which encapsulates the intrinsic attributes of anatomical structures-locality and compositionality-within the embedding space.
  • results: The SSL pretrained model derived from the training strategy outperforms state-of-the-art (SOTA) fully/self-supervised baselines and enhances annotation efficiency, offering potential few-shot segmentation capabilities with performance improvements ranging from 9% to 30% for segmentation tasks compared to SSL baselines.
    Abstract Human anatomy is the foundation of medical imaging and boasts one striking characteristic: its hierarchy in nature, exhibiting two intrinsic properties: (1) locality: each anatomical structure is morphologically distinct from the others; and (2) compositionality: each anatomical structure is an integrated part of a larger whole. We envision a foundation model for medical imaging that is consciously and purposefully developed upon this foundation to gain the capability of "understanding" human anatomy and to possess the fundamental properties of medical imaging. As our first step in realizing this vision towards foundation models in medical imaging, we devise a novel self-supervised learning (SSL) strategy that exploits the hierarchical nature of human anatomy. Our extensive experiments demonstrate that the SSL pretrained model, derived from our training strategy, not only outperforms state-of-the-art (SOTA) fully/self-supervised baselines but also enhances annotation efficiency, offering potential few-shot segmentation capabilities with performance improvements ranging from 9% to 30% for segmentation tasks compared to SSL baselines. This performance is attributed to the significance of anatomy comprehension via our learning strategy, which encapsulates the intrinsic attributes of anatomical structures-locality and compositionality-within the embedding space, yet overlooked in existing SSL methods. All code and pretrained models are available at https://github.com/JLiangLab/Eden.
    摘要 人体解剖是医学成像的基础,具有一个突出的特点:它具有地域性和组成性两个内在属性。我们想象一个基于这个基础的基础模型,能够“理解”人体解剖和拥有医学成像的基本属性。作为我们实现这个视野的第一步,我们提出了一种新的自动教育学习(SSL)策略,利用人体解剖的层次结构。我们的广泛实验表明,我们的SSL预训练模型,基于我们的培训策略,不仅超越了当前最佳自动/自我监督基线,还提高了标注效率,可能实现少量shot segmentation功能,并且比SSL基线表现出9%至30%的性能提升。这种性能归功于我们的学习策略,它在嵌入空间内具有地域性和组成性的特征,这些特征在现有的SSL方法中未被考虑。所有代码和预训练模型可以在https://github.com/JLiangLab/Eden中找到。

Multimodal Dataset for Localization, Mapping and Crop Monitoring in Citrus Tree Farms

  • paper_url: http://arxiv.org/abs/2309.15332
  • repo_url: https://github.com/ucr-robotics/citrus-farm-dataset
  • paper_authors: Hanzhe Teng, Yipeng Wang, Xiaoao Song, Konstantinos Karydis
  • for: 这个论文主要用于开发自动化农业机器人系统,特别是在 citrus 树环境中进行地图建模、定位和农业监测等任务。
  • methods: 该论文使用了多modal的感知数据,包括RGB图像、深度图像、离子图像、热图像以及导航传感器数据,以及中心式位置定位的RTK。
  • results: 该论文提供了一个名为 CitrusFarm 的大型多modal 感知数据集,包括7个序列、3个田间、不同树种、不同植物排列和不同日light 条件,总共1.7小时的操作时间、7.5公里的距离和1.3TB的数据。
    Abstract In this work we introduce the CitrusFarm dataset, a comprehensive multimodal sensory dataset collected by a wheeled mobile robot operating in agricultural fields. The dataset offers stereo RGB images with depth information, as well as monochrome, near-infrared and thermal images, presenting diverse spectral responses crucial for agricultural research. Furthermore, it provides a range of navigational sensor data encompassing wheel odometry, LiDAR, inertial measurement unit (IMU), and GNSS with Real-Time Kinematic (RTK) as the centimeter-level positioning ground truth. The dataset comprises seven sequences collected in three fields of citrus trees, featuring various tree species at different growth stages, distinctive planting patterns, as well as varying daylight conditions. It spans a total operation time of 1.7 hours, covers a distance of 7.5 km, and constitutes 1.3 TB of data. We anticipate that this dataset can facilitate the development of autonomous robot systems operating in agricultural tree environments, especially for localization, mapping and crop monitoring tasks. Moreover, the rich sensing modalities offered in this dataset can also support research in a range of robotics and computer vision tasks, such as place recognition, scene understanding, object detection and segmentation, and multimodal learning. The dataset, in conjunction with related tools and resources, is made publicly available at https://github.com/UCR-Robotics/Citrus-Farm-Dataset.
    摘要 在这项工作中,我们介绍了一个全面的多Modal感知数据集,称为CitrusFarm数据集,由一辆滚动式移动机器人在农业场所中收集到的。该数据集包含了STEREO RGB图像和深度信息,以及灰度、近红外和热图像,这些图像具有多种光谱响应,对农业研究非常重要。此外,数据集还提供了一系列导航传感器数据,包括轮胎速度、LiDAR、惯性测量单元(IMU)和GNSS(实时准确定位),这些数据可以提供中心水平位置的准确性。数据集包括7个序列,收集在3个柑橘树场中,其中每个场景都有不同的树种、植物排列方式和不同的日light Conditions。总共耗时1.7小时,涵盖7.5公里的距离,总数据量为1.3TB。我们预计这个数据集可以帮助开发在农业树木环境中自动化机器人系统,特别是地图Localization、映射和耕作监测任务。此外,这个数据集中的丰富的感知modalities也可以支持机器人和计算机视觉相关的研究,例如地点认知、场景理解、物体检测和分割、多Modal学习等。数据集、相关工具和资源,通过https://github.com/UCR-Robotics/Citrus-Farm-Dataset进行公共发布。

BASED: Bundle-Adjusting Surgical Endoscopic Dynamic Video Reconstruction using Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2309.15329
  • repo_url: None
  • paper_authors: Shreya Saha, Sainan Liu, Shan Lin, Jingpei Lu, Michael Yip
  • for: 这篇论文旨在重构弹性场景从内镜视频中,以便实现无人操作的微创外科手术。
  • methods: 该方法采用神经辐射场(NeRF)方法学习3D隐藏表示场景,以满足动态和弹性场景的重建需求,并且可以处理不知情相机位置。
  • results: 经过多个实验数据集,该模型能够适应多种相机和场景设置,并且展示了在当前和未来 робо器外科系统中的承诺。
    Abstract Reconstruction of deformable scenes from endoscopic videos is important for many applications such as intraoperative navigation, surgical visual perception, and robotic surgery. It is a foundational requirement for realizing autonomous robotic interventions for minimally invasive surgery. However, previous approaches in this domain have been limited by their modular nature and are confined to specific camera and scene settings. Our work adopts the Neural Radiance Fields (NeRF) approach to learning 3D implicit representations of scenes that are both dynamic and deformable over time, and furthermore with unknown camera poses. We demonstrate this approach on endoscopic surgical scenes from robotic surgery. This work removes the constraints of known camera poses and overcomes the drawbacks of the state-of-the-art unstructured dynamic scene reconstruction technique, which relies on the static part of the scene for accurate reconstruction. Through several experimental datasets, we demonstrate the versatility of our proposed model to adapt to diverse camera and scene settings, and show its promise for both current and future robotic surgical systems.
    摘要 <>TRANSLATE_TEXT重建可变场景从内部视频是许多应用程序的关键,如实时操作导航、手术视觉、和机器人手术。它是实现自主机器人干预手术的基础要求。然而,过去的方法在这个领域受到了模块化的限制,只能在特定的摄像头和场景设置下工作。我们的工作采用Neural Radiance Fields(NeRF)方法来学习3D隐式表示场景,这些场景是时间上的动态和变形的,并且有未知的摄像头姿态。我们在Robotic surgery中使用了这种方法。这种方法可以消除知 Camera pose的限制,并超越了状态之Art的不结构化动态场景重建技术,该技术基于场景的静止部分进行准确重建。通过多个实验数据集,我们示出了我们提议的模型的多样性,可以适应多种摄像头和场景设置,并表明了它在当前和未来机器人手术系统中的承诺。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.