cs.CV - 2023-08-21

Improving Continuous Sign Language Recognition with Cross-Lingual Signs

  • paper_url: http://arxiv.org/abs/2308.10809
  • repo_url: None
  • paper_authors: Fangyun Wei, Yutong Chen
  • for: 这个研究是为了不间断手语识别(CSLR),CSLR 是一个弱有监督的任务,它的目的是从视频中识别连续的手语,不需要任何先知的时间 bound 信息。
  • methods: 我们的方法基于跨语言手语的观察,我们发现了不同的手语之间的相似视觉信号(如手势和动作)。我们的方法是利用这些相似的视觉信号来帮助另一种手语识别。我们首先建立了两个手语词典,其中一个包含两个数据集中的隔离手语。然后,我们使用一个已知的 isolated sign language recognition 模型来确定两个手语之间的寻常签名。最后,我们使用这些签名来训练一个 CSLR 模型。
  • results: 我们的方法在两个广泛使用的 CSLR 数据集(Phoenix-2014 和 Phoenix-2014T)上达到了状态的最佳性能。
    Abstract This work dedicates to continuous sign language recognition (CSLR), which is a weakly supervised task dealing with the recognition of continuous signs from videos, without any prior knowledge about the temporal boundaries between consecutive signs. Data scarcity heavily impedes the progress of CSLR. Existing approaches typically train CSLR models on a monolingual corpus, which is orders of magnitude smaller than that of speech recognition. In this work, we explore the feasibility of utilizing multilingual sign language corpora to facilitate monolingual CSLR. Our work is built upon the observation of cross-lingual signs, which originate from different sign languages but have similar visual signals (e.g., hand shape and motion). The underlying idea of our approach is to identify the cross-lingual signs in one sign language and properly leverage them as auxiliary training data to improve the recognition capability of another. To achieve the goal, we first build two sign language dictionaries containing isolated signs that appear in two datasets. Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model. At last, we train a CSLR model on the combination of the target data with original labels and the auxiliary data with mapped labels. Experimentally, our approach achieves state-of-the-art performance on two widely-used CSLR datasets: Phoenix-2014 and Phoenix-2014T.
    摘要 Our approach is based on the observation that different sign languages share similar visual signals, such as hand shape and motion. We aim to leverage cross-lingual signs in one sign language to improve the recognition capability of another. To achieve this, we first create two sign language dictionaries containing isolated signs from two datasets. We then identify the sign-to-sign mappings between the two sign languages using a well-optimized isolated sign language recognition model. Finally, we train a CSLR model on a combination of the target data with original labels and the auxiliary data with mapped labels.Experiments show that our approach achieves state-of-the-art performance on two widely used CSLR datasets: Phoenix-2014 and Phoenix-2014T.

MGMAE: Motion Guided Masking for Video Masked Autoencoding

  • paper_url: http://arxiv.org/abs/2308.10794
  • repo_url: None
  • paper_authors: Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, Limin Wang
  • for: 这 paper 的目的是提高 video 自我监督学习中的表示学习性能。
  • methods: 这 paper 使用了 Temporal Redundancy 导致高度的 masking ratio 和自定义的 masking 策略,并引入了动态导向的 masking 策略,以利用视频中的运动信息来建立时间一致的 masking 体积。
  • results: Comparing with the original VideoMAE, 这 paper 的 MGMAE 在 Something-Something V2 和 Kinetics-400 数据集上表现出色,并且通过视觉分析来证明 MGMAE 可以带来更有用的结构信息。
    Abstract Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.
    摘要 《面具自编码》在无监督视频表示学习中表现出色。视频中的时间重复性导致高度的面具率和自定义面具策略,在这篇论文中,我们想要进一步改进视频面具自编码的性能。我们的关键发现是,运动是视频中一个普遍和特殊的先验,应该在面具预训练中考虑。我们的运动导向的面具明确地包含运动信息,建立了时间一致的面具量。基于这个面具量,我们可以在时间中跟踪不填充的元素,并从视频中采样一组时间一致的立方体。这些时间一致的不填充元素将further降低信息泄露问题,并促使MGMAE学习更有用的结构信息。我们实现了我们的MGMAE,使用了高效的在线Optical Flow估计器和倒向面具地图折叠策略。我们在Something-Something V2和Kinetics-400数据集上进行了实验,并证明了我们的MGMAE在原始VideoMAE之上表现出色。此外,我们还提供了视觉分析,以 Illustrate我们的MGMAE可以在运动适应的方式中采样时间一致的立方体,为更有效的视频预训练。

Extraction of Text from Optic Nerve Optical Coherence Tomography Reports

  • paper_url: http://arxiv.org/abs/2308.10790
  • repo_url: None
  • paper_authors: Iyad Majid, Youchen Victor Zhang, Robert Chang, Sophia Y. Wang
  • for: 这个研究的目的是开发和评估一种基于规则的算法,以提高从Zeiss Cirrus optical coherence tomography(OCT)扫描报告中提取文本数据的效率,包括Retinal nerve fiber layer(RNFL)值和其他 ganglion cell count(GCC)数据。
  • methods: 这个研究使用了DICOM文件,并将其转换为图像文件,然后使用PaddleOCR Python包进行光学字符识别。基于规则的算法被设计和优化,以提高对RNFL和GCC报告中数据的提取精度。
  • results: 研究结果表明,开发的算法可以准确地提取RNFL和GCC报告中的数据,其精度与人工审查的结果相似。在一些情况下,特别是RNFL厚度的时钟小时5和6,以及GCC的信号强度,提取数据存在一定的挑战。
    Abstract Purpose: The purpose of this study was to develop and evaluate rule-based algorithms to enhance the extraction of text data, including retinal nerve fiber layer (RNFL) values and other ganglion cell count (GCC) data, from Zeiss Cirrus optical coherence tomography (OCT) scan reports. Methods: DICOM files that contained encapsulated PDF reports with RNFL or Ganglion Cell in their document titles were identified from a clinical imaging repository at a single academic ophthalmic center. PDF reports were then converted into image files and processed using the PaddleOCR Python package for optical character recognition. Rule-based algorithms were designed and iteratively optimized for improved performance in extracting RNFL and GCC data. Evaluation of the algorithms was conducted through manual review of a set of RNFL and GCC reports. Results: The developed algorithms demonstrated high precision in extracting data from both RNFL and GCC scans. Precision was slightly better for the right eye in RNFL extraction (OD: 0.9803 vs. OS: 0.9046), and for the left eye in GCC extraction (OD: 0.9567 vs. OS: 0.9677). Some values presented more challenges in extraction, particularly clock hours 5 and 6 for RNFL thickness, and signal strength for GCC. Conclusions: A customized optical character recognition algorithm can identify numeric results from optical coherence scan reports with high precision. Automated processing of PDF reports can greatly reduce the time to extract OCT results on a large scale.
    摘要 目的:本研究的目的是开发和评估基于规则的算法,以提高从Zeiss CirrusOptical coherence tomography(OCT)扫描报告中提取文本数据,包括Retinal nerve fiber layer(RNFL)值和其他ganglion cell count(GCC)数据。方法:从一个学术眼科中心的临床扫描存储库中提取包含encapsulated PDF报告的DICOM文件,其中PDF报告中包含RNFL或GCC的文档标题。然后将PDF报告转换成图像文件,并使用Python package PaddleOCR进行光学 charactereognition。设计和优化基于规则的算法以提高提取RNFL和GCC数据的精度。评估算法的效果通过手动审查一组RNFL和GCC报告进行。结果:开发的算法在RNFL和GCC扫描报告中提取数据的精度很高,OD中的右眼RNFL提取精度为0.9803,而OS中的左眼RNFL提取精度为0.9046。在GCC扫描报告中,左眼GCC提取精度为0.9567,而右眼GCC提取精度为0.9677。报告中一些值的提取存在更大的挑战,例如RNFL厚度的时钟小时5和6,以及GCC的信号强度。结论:可以使用自定义的光学 charactereognition算法来从OCT扫描报告中提取 numeric结果,并且自动处理PDF报告可以大大减少对大规模OCT结果的提取时间。

Dense Error Map Estimation for MRI-Ultrasound Registration in Brain Tumor Surgery Using Swin UNETR

  • paper_url: http://arxiv.org/abs/2308.10784
  • repo_url: None
  • paper_authors: Soorena Salari, Amirhossein Rasoulian, Hassan Rivaz, Yiming Xiao
  • for: 降低手术成本并提高脑肿瘤除脑成功率,早期手术治疗脑肿瘤非常重要。但是,手术过程中脑组织变形(脑shift)会使预操作图像无效,因此需要一种可靠的、可携带的工具来跟踪脑shift。
  • methods: 作者提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估MRI-iUS регистрация结果的质量,并在实时进行3D-patch-wise紧凑错误图表示。
  • results: 作者通过使用实际临床数据进行了评估,并显示了该框架在iUS-导引脑肿瘤除脑中的表现。
    Abstract Early surgical treatment of brain tumors is crucial in reducing patient mortality rates. However, brain tissue deformation (called brain shift) occurs during the surgery, rendering pre-operative images invalid. As a cost-effective and portable tool, intra-operative ultrasound (iUS) can track brain shift, and accurate MRI-iUS registration techniques can update pre-surgical plans and facilitate the interpretation of iUS. This can boost surgical safety and outcomes by maximizing tumor removal while avoiding eloquent regions. However, manual assessment of MRI-iUS registration results in real-time is difficult and prone to errors due to the 3D nature of the data. Automatic algorithms that can quantify the quality of inter-modal medical image registration outcomes can be highly beneficial. Therefore, we propose a novel deep-learning (DL) based framework with the Swin UNETR to automatically assess 3D-patch-wise dense error maps for MRI-iUS registration in iUS-guided brain tumor resection and show its performance with real clinical data for the first time.
    摘要 早期脑肿手术的抢救率可以通过早期手术来降低患者死亡率。然而,手术过程中的脑组织变形(脑移动)会使得先前的图像无效,这使得抢救脑肿的手术更加具有挑战性。作为一种cost-effective和可搬式的工具,实时ultrasound(iUS)可以跟踪脑移动,并且精准的MRI-iUS注册技术可以更新先前的计划,从而提高手术的安全性和效果。然而,手动评估MRI-iUS注册结果的实时性具有困难和错误的倾向,特别是由于数据的3D性。自动的算法可以评估多Modal医疗图像注册结果的质量,这可以对手术的安全性和效果产生很大的改善。因此,我们提出了一种基于深度学习(DL)的框架,使用Swin UNITER来自动评估MRI-iUS注册结果的3D紧密错误地图,并在实际临床数据上展示其性能。

CoNe: Contrast Your Neighbours for Supervised Image Classification

  • paper_url: http://arxiv.org/abs/2308.10761
  • repo_url: https://github.com/mingkai-zheng/cone
  • paper_authors: Mingkai Zheng, Shan You, Lang Huang, Xiu Su, Fei Wang, Chen Qian, Xiaogang Wang, Chang Xu
  • For: 提高图像分类的表现,强调内类样本之间的差异* Methods: 引入对邻居进行对比,使每个样本不仅受到类中心的监督,还 direktly使用类近的特征作为锚点生成更适应的目标* Results: 在不同的基准数据集、网络架构和设置下,CoNe达到了状态革命性的表现,包括80.8%的Top-1准确率在ImageNet上使用ResNet-50,超越了最近的Timm训练秘密。
    Abstract Image classification is a longstanding problem in computer vision and machine learning research. Most recent works (e.g. SupCon , Triplet, and max-margin) mainly focus on grouping the intra-class samples aggressively and compactly, with the assumption that all intra-class samples should be pulled tightly towards their class centers. However, such an objective will be very hard to achieve since it ignores the intra-class variance in the dataset. (i.e. different instances from the same class can have significant differences). Thus, such a monotonous objective is not sufficient. To provide a more informative objective, we introduce Contrast Your Neighbours (CoNe) - a simple yet practical learning framework for supervised image classification. Specifically, in CoNe, each sample is not only supervised by its class center but also directly employs the features of its similar neighbors as anchors to generate more adaptive and refined targets. Moreover, to further boost the performance, we propose ``distributional consistency" as a more informative regularization to enable similar instances to have a similar probability distribution. Extensive experimental results demonstrate that CoNe achieves state-of-the-art performance across different benchmark datasets, network architectures, and settings. Notably, even without a complicated training recipe, our CoNe achieves 80.8\% Top-1 accuracy on ImageNet with ResNet-50, which surpasses the recent Timm training recipe (80.4\%). Code and pre-trained models are available at \href{https://github.com/mingkai-zheng/CoNe}{https://github.com/mingkai-zheng/CoNe}.
    摘要 image分类是计算机视觉和机器学习领域的长期问题。最近的研究主要集中在强制内类样本凑合,假设所有内类样本都应该很紧张地吸引向类中心。然而,这种假设忽略了数据集中的内类差异。因此,这种单一的目标不够。为提供更加有用的目标,我们介绍了对比你邻居(CoNe)-一种简单 yet 实用的学习框架 для超参数化图像分类。具体来说,在CoNe中,每个样本不仅被supervised by its类中心,还直接使用类似邻居的特征作为锚点来生成更适应和细化的目标。此外,为了进一步提高性能,我们提出了“分布一致性”作为更加有用的正则化,使得类似的实例有相似的概率分布。实验结果表明,CoNe在不同的数据集、网络架构和设置下均达到了状态之顶的性能。特别是,我们的CoNe在ImageNet上使用ResNet-50达到了80.8%的Top-1准确率,比最近的Timm训练秘密(80.4%)高出了0.4%。代码和预训练模型可以在获取。

Boosting Adversarial Attack with Similar Target

  • paper_url: http://arxiv.org/abs/2308.10743
  • repo_url: https://github.com/huanranchen/Similar-Target-Attacker
  • paper_authors: Shuo Zhang, Ziruo Wang, Zikai Zhou, Huanran Chen
  • for: 防止深度神经网络受到攻击,提高模型应用安全性。
  • methods: 使用ensemble攻击,通过提高cosinus相似性来减少攻击方法的精度。
  • results: 在ImageNet上,我们的方法能够提高攻击力,并在18种分类器和透彻训练的模型上表现出色。
    Abstract Deep neural networks are vulnerable to adversarial examples, posing a threat to the models' applications and raising security concerns. An intriguing property of adversarial examples is their strong transferability. Several methods have been proposed to enhance transferability, including ensemble attacks which have demonstrated their efficacy. However, prior approaches simply average logits, probabilities, or losses for model ensembling, lacking a comprehensive analysis of how and why model ensembling significantly improves transferability. In this paper, we propose a similar targeted attack method named Similar Target~(ST). By promoting cosine similarity between the gradients of each model, our method regularizes the optimization direction to simultaneously attack all surrogate models. This strategy has been proven to enhance generalization ability. Experimental results on ImageNet validate the effectiveness of our approach in improving adversarial transferability. Our method outperforms state-of-the-art attackers on 18 discriminative classifiers and adversarially trained models.
    摘要 In this paper, we propose a new method called Similar Target (ST) that targets the gradients of each model to enhance transferability. By promoting cosine similarity between the gradients, our method regularizes the optimization direction to simultaneously attack all surrogate models. This strategy has been shown to improve generalization ability.Experimental results on ImageNet demonstrate the effectiveness of our approach in improving adversarial transferability. Our method outperforms state-of-the-art attackers on 18 discriminative classifiers and adversarially trained models.

Patch Is Not All You Need

  • paper_url: http://arxiv.org/abs/2308.10729
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Changzhen Li, Jie Zhang, Yang Wei, Zhilong Ji, Jinfeng Bai, Shiguang Shan
  • for: 提高计算机视觉任务的表现,特别是处理图像的Sequential输入问题。
  • methods: 提出了一种名为 Pattern Transformer(Patternformer)的新方法,该方法可以将图像转换为模式序列,以便将其输入到Transformer中。在这个方法中,使用了Convolutional Neural Network(CNN)来提取图像中的多种模式,每个通道表示一种唯一的模式,并将其作为视觉токен输入到下一个Transformer中。
  • results: 使用了vanilla ResNet和Transformer,在CIFAR-10和CIFAR-100上达到了状态机器人性表现,并在ImageNet上获得了竞争性的结果。
    Abstract Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
    摘要 computer 视野中的 Transformers 已经取得了很大的成功,在不同任务上达到了极高的表现。然而,它们的内置的序列输入依赖关系使得需要手动将图像分割成patch序列,这会中断图像的自然的结构和semantic连续性。为了解决这个问题,我们提出了一种新的 Pattern Transformer(Patternformer),用于适应图像转换为模式序列,以便将其作为Transformer输入。具体来说,我们使用 Convolutional Neural Network 提取图像中的多种模式,每个通道表示一个唯一的模式,并将它们传递给接下来的 Transformer 作为视觉 токен。通过让网络优化这些模式,每个模式可以专注于其本地区域兴趣点,从而保留它的内在结构和semantic信息。只使用普通的 ResNet 和 Transformer,我们在 CIFAR-10 和 CIFAR-100 上取得了状态对应的表现,并在 ImageNet 上获得了竞争性的结果。

Test-time augmentation-based active learning and self-training for label-efficient segmentation

  • paper_url: http://arxiv.org/abs/2308.10727
  • repo_url: None
  • paper_authors: Bella Specktor-Fadida, Anna Levchakov, Dana Schonberger, Liat Ben-Sira, Dafna Ben-Bashat, Leo Joskowicz
  • for: 这个论文的目的是提出一种新的自动监督(ST)和活动学习(AL)结合方法,以提高深度学习模型的精度和稳定性。
  • methods: 该方法首先使用测试时扩展(TTA)对初始教师网络进行处理,然后根据最低估计的 dice 分数选择需要注释的情况。注释后,使用现有注释和ST情况下的粗略pseudo标签进行重新训练。
  • results: 研究结果显示,ST 对两个任务都非常有效,可以提高 ID 和 OOD 数据的性能。然而,在单序列肥胎体部分 segmentation 任务中,将 AL 与 ST 结合使用后,性能略有下降。而在高变化的 Placenta 任务中,AL 却没有提高性能。结合 AL 和 ST 后,对单序列肥胎体部分 segmentation 任务,得到了0.961的 dice 分数,只需使用 6 个原始扫描和 2 个新序列扫描。
    Abstract Deep learning techniques depend on large datasets whose annotation is time-consuming. To reduce annotation burden, the self-training (ST) and active-learning (AL) methods have been developed as well as methods that combine them in an iterative fashion. However, it remains unclear when each method is the most useful, and when it is advantageous to combine them. In this paper, we propose a new method that combines ST with AL using Test-Time Augmentations (TTA). First, TTA is performed on an initial teacher network. Then, cases for annotation are selected based on the lowest estimated Dice score. Cases with high estimated scores are used as soft pseudo-labels for ST. The selected annotated cases are trained with existing annotated cases and ST cases with border slices annotations. We demonstrate the method on MRI fetal body and placenta segmentation tasks with different data variability characteristics. Our results indicate that ST is highly effective for both tasks, boosting performance for in-distribution (ID) and out-of-distribution (OOD) data. However, while self-training improved the performance of single-sequence fetal body segmentation when combined with AL, it slightly deteriorated performance of multi-sequence placenta segmentation on ID data. AL was helpful for the high variability placenta data, but did not improve upon random selection for the single-sequence body data. For fetal body segmentation sequence transfer, combining AL with ST following ST iteration yielded a Dice of 0.961 with only 6 original scans and 2 new sequence scans. Results using only 15 high-variability placenta cases were similar to those using 50 cases. Code is available at: https://github.com/Bella31/TTA-quality-estimation-ST-AL
    摘要 深度学习技术需要大量数据进行注释,但这些注释是时间消耗的。为了减轻注释负担,自适应学习(ST)和活动学习(AL)方法已经被开发出来,同时也有将这两种方法相互融合的方法。然而,还没有一个明确的时候,哪种方法是最有用,以及何时合并它们。在这篇论文中,我们提出了一种新的方法,即将ST与AL相互融合,使用测试时数据增强(TTA)。首先,TTA在初始教师网络上进行。然后,选择基于最低估计 dice 分数的案例进行注释。高估分数的案例用作软 Pseudo-标签进行ST。选取的注释案例与ST案例一起进行训练,并使用现有注释案例和ST案例。我们在MRI胎体和胎盘分割任务上进行了实验,并发现ST具有高效性,对于ID和OOD数据都能提高性能。然而,在单序列胎体分割任务中,与AL结合自适应学习可以提高性能,而在多序列胎盘分割任务中,AL对ID数据的性能有所下降。AL对高变化胎盘数据具有帮助,但对单序列胎体数据来说,AL没有提高随机选择的性能。对胎体分割序列传输,将AL与ST相互融合,并在ST迭代后应用AL,可以达到0.961的Dice值,只需6个原始扫描和2个新序列扫描。使用仅15个高变化胎盘案例可以达到类似的性能,与使用50个案例相当。软件代码可以在以下链接中找到:https://github.com/Bella31/TTA-quality-estimation-ST-AL

Backdooring Textual Inversion for Concept Censorship

  • paper_url: http://arxiv.org/abs/2308.10718
  • repo_url: None
  • paper_authors: Yutong Wu, Jie Zhang, Florian Kerschbaum, Tianwei Zhang
  • for: 防止AI生成内容中的恶意用途,如假新闻和个人诋毁。
  • methods: 利用backdoor技术对文本反转(TI)模型进行概念审核,选择敏感词作为触发词,在训练阶段 censored 敏感词,以防止模型生成包含恶意概念的图像。
  • results: 通过对Stable Diffusion模型进行广泛的实验,证明了我们的方法的有效性。Results show that our approach can effectively censor malicious concepts in text-to-image models, ensuring the safe and responsible use of AI-generated content.
    Abstract Recent years have witnessed success in AIGC (AI Generated Content). People can make use of a pre-trained diffusion model to generate images of high quality or freely modify existing pictures with only prompts in nature language. More excitingly, the emerging personalization techniques make it feasible to create specific-desired images with only a few images as references. However, this induces severe threats if such advanced techniques are misused by malicious users, such as spreading fake news or defaming individual reputations. Thus, it is necessary to regulate personalization models (i.e., concept censorship) for their development and advancement. In this paper, we focus on the personalization technique dubbed Textual Inversion (TI), which is becoming prevailing for its lightweight nature and excellent performance. TI crafts the word embedding that contains detailed information about a specific object. Users can easily download the word embedding from public websites like Civitai and add it to their own stable diffusion model without fine-tuning for personalization. To achieve the concept censorship of a TI model, we propose leveraging the backdoor technique for good by injecting backdoors into the Textual Inversion embeddings. Briefly, we select some sensitive words as triggers during the training of TI, which will be censored for normal use. In the subsequent generation stage, if the triggers are combined with personalized embeddings as final prompts, the model will output a pre-defined target image rather than images including the desired malicious concept. To demonstrate the effectiveness of our approach, we conduct extensive experiments on Stable Diffusion, a prevailing open-sourced text-to-image model. Our code, data, and results are available at https://concept-censorship.github.io.
    摘要 近年来,AI生成内容(AIGC)得到了成功。人们可以使用预训练的扩散模型来生成高质量的图像,或者使用Only Prompts在自然语言中自定义现有图像。更有趣的是,新兴的个性化技术使得可以通过只提供一些图像作为参考来生成特定的愿望图像。然而,这也会导致严重的安全问题,如散播假新闻或者损害个人声誉。因此,需要规制个性化模型(i.e.,概念审查)的发展和进步。 在这篇论文中,我们将关注个性化技术中的文本反转(TI),该技术在轻量级和性能方面卓越。TI可以在公共网站如Civitai上下载word embedding,并将其添加到自己的稳定扩散模型中,无需微调。为了实现TI模型的概念审查,我们提议利用后门技术,即在TI中植入后门。具体来说,我们在TI训练过程中选择一些敏感词作为触发词,并在生成阶段将这些触发词与个性化 embedding 结合使用,以生成预定义的目标图像。 为证明我们的方法的有效性,我们在Stable Diffusion上进行了广泛的实验,该是一个流行的开源文本到图像模型。我们的代码、数据和结果可以在中找到。

Rethinking Person Re-identification from a Projection-on-Prototypes Perspective

  • paper_url: http://arxiv.org/abs/2308.10717
  • repo_url: None
  • paper_authors: Qizao Wang, Xuelin Qian, Bin Li, Yanwei Fu, Xiangyang Xue
  • for: 这篇论文是关于人回 identification(Re-ID)任务的研究,它在过去一个 décennial 中取得了很大的发展。
  • methods: 现有的状态 искусственный智能方法是从输入图像中提取特征,然后使用分类器进行分类。但是,由于训练集和测试集之间没有人类重合,因此在推理阶段就会抛弃分类器。只有从图像特征提取出来的特征被用于人回 identification via 距离度量。这篇论文提出了对分类器的新思路,即视分类器为从图像特征到类型原型的投影。这些原型是分类器学习的参数。在这种情况下,我们可以将输入图像的标识视为与所有原型之间的相似性,并将其作为更有特征的特征进行人回 identification。
  • results: 我们提出了一个新的基线模型,即ProNet,它保留了推理阶段的分类器功能。为了促进类型原型的学习,我们采用了 triplet 损失和标识分类损失。对于多级别设计,我们还提出了一个改进版的ProNet++。实验结果表明,我们的提议的ProNet是简单却有效,可以 beat 先前的基线。ProNet++ 也可以与基于 transformer 的竞争对手相比。
    Abstract Person Re-IDentification (Re-ID) as a retrieval task, has achieved tremendous development over the past decade. Existing state-of-the-art methods follow an analogous framework to first extract features from the input images and then categorize them with a classifier. However, since there is no identity overlap between training and testing sets, the classifier is often discarded during inference. Only the extracted features are used for person retrieval via distance metrics. In this paper, we rethink the role of the classifier in person Re-ID, and advocate a new perspective to conceive the classifier as a projection from image features to class prototypes. These prototypes are exactly the learned parameters of the classifier. In this light, we describe the identity of input images as similarities to all prototypes, which are then utilized as more discriminative features to perform person Re-ID. We thereby propose a new baseline ProNet, which innovatively reserves the function of the classifier at the inference stage. To facilitate the learning of class prototypes, both triplet loss and identity classification loss are applied to features that undergo the projection by the classifier. An improved version of ProNet++ is presented by further incorporating multi-granularity designs. Experiments on four benchmarks demonstrate that our proposed ProNet is simple yet effective, and significantly beats previous baselines. ProNet++ also achieves competitive or even better results than transformer-based competitors.
    摘要

Color Prompting for Data-Free Continual Unsupervised Domain Adaptive Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10716
  • repo_url: https://github.com/vimar-gu/colorpromptreid
  • paper_authors: Jianyang Gu, Hao Luo, Kai Wang, Wei Jiang, Yang You, Jian Zhao
  • for: 本文提出了一种数据自由无监督领域人识别(Re-ID)方法,以解决数据注释的压力。
  • methods: 本文使用了一种名为Color Prompting(CoP)方法,它使用轻量级的提示网络来适应当前任务的颜色分布,并将其用于将图像转换成过去的风格。
  • results: 对比于图像重温方法,CoP方法能够更好地避免忘记现象,并且在新领域上快速适应新任务,只需要一小 amounts of无标签图像。实验表明,在 continual 训练管道中,提出的CoP方法可以在seen和unseen领域上提高了6.7%和8.1%的average rank-1。
    Abstract Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in https://github.com/vimar-gu/ColorPromptReID.
    摘要 <>按照以下步骤翻译文本到简化中文:1. 将文本转换为普通文本。2. 使用翻译器将文本翻译成Simplified Chinese。文本:Unsupervised domain adaptive person re-identification (Re-ID) methods alleviate the burden of data annotation through generating pseudo supervision messages. However, real-world Re-ID systems, with continuously accumulating data streams, simultaneously demand more robust adaptation and anti-forgetting capabilities. Methods based on image rehearsal addresses the forgetting issue with limited extra storage but carry the risk of privacy leakage. In this work, we propose a Color Prompting (CoP) method for data-free continual unsupervised domain adaptive person Re-ID. Specifically, we employ a light-weighted prompter network to fit the color distribution of the current task together with Re-ID training. Then for the incoming new tasks, the learned color distribution serves as color style transfer guidance to transfer the images into past styles. CoP achieves accurate color style recovery for past tasks with adequate data diversity, leading to superior anti-forgetting effects compared with image rehearsal methods. Moreover, CoP demonstrates strong generalization performance for fast adaptation into new domains, given only a small amount of unlabeled images. Extensive experiments demonstrate that after the continual training pipeline the proposed CoP achieves 6.7% and 8.1% average rank-1 improvements over the replay method on seen and unseen domains, respectively. The source code for this work is publicly available in https://github.com/vimar-gu/ColorPromptReID.翻译结果:Unsupervised domain adaptive人Re-ID方法通过生成 pseudo 监督信息来减轻数据注释的负担。然而,实际世界Re-ID系统在不断增加数据流时同时需要更加坚强的适应和反忘能力。基于图像回忆 addresses 忘却问题,但具有有限的额外存储和隐私泄露风险。在这个工作中,我们提议了一种颜色提示(CoP)方法,用于无需数据注释的连续不监督频道适应人Re-ID。具体来说,我们使用轻量级的提示网络来适应当前任务的颜色分布,并与Re-ID训练相结合。然后,对于接下来的新任务,已经学习的颜色分布会作为颜色风格传递指导,将图像转换成过去的风格。CoP实现了过去任务的准确颜色风格恢复,使得反忘能力比图像回忆方法更好。此外,CoP还显示了快速适应新频道的强大泛化性能,只需要小量无注释图像。广泛的实验表明,在连续训练管道后,我们的CoP方法可以在seen和unseen频道上实现6.7%和8.1%的平均排名1改进。源代码可以在https://github.com/vimar-gu/ColorPromptReID中下载。

Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction

  • paper_url: http://arxiv.org/abs/2308.10694
  • repo_url: https://github.com/cvg/vp-estimation-with-prior-gravity
  • paper_authors: Rémi Pautrat, Shaohui Liu, Petr Hruby, Marc Pollefeys, Daniel Barath
  • for: 估算曼哈顿幕(三个垂直方向)和相机的不知 фокал距离,利用垂直方向的先验知识。
  • methods: 使用极限测量仪(IMU)的数据,提供了两个新的2线解决方案,其中一个不受先前的解决方案中的弯曲影响。同时,我们设计了一种新的非最小方法,可以在任意数量的线上提高本地优化性能。
  • results: 我们的方法可以在实验室和真实世界数据上实现更高的准确率,与相关方法相比,同时运行时间相对相似。此外,我们还证明了我们的解决方案可以用于相对旋转估计。代码可以在https://github.com/cvg/VP-Estimation-with-Prior-Gravity上下载。
    Abstract We tackle the problem of estimating a Manhattan frame, i.e. three orthogonal vanishing points, and the unknown focal length of the camera, leveraging a prior vertical direction. The direction can come from an Inertial Measurement Unit that is a standard component of recent consumer devices, e.g., smartphones. We provide an exhaustive analysis of minimal line configurations and derive two new 2-line solvers, one of which does not suffer from singularities affecting existing solvers. Additionally, we design a new non-minimal method, running on an arbitrary number of lines, to boost the performance in local optimization. Combining all solvers in a hybrid robust estimator, our method achieves increased accuracy even with a rough prior. Experiments on synthetic and real-world datasets demonstrate the superior accuracy of our method compared to the state of the art, while having comparable runtimes. We further demonstrate the applicability of our solvers for relative rotation estimation. The code is available at https://github.com/cvg/VP-Estimation-with-Prior-Gravity.
    摘要 我们解决了推算曼哈顿幕板,即三个垂直方向的极限点,以及相机未知 фокаль距离问题,利用一个先前的垂直方向。这个方向可以来自一个惯性测量仪,这是现代消费者设备的标准组件,例如智能手机。我们提供了完整的分析 minimal line configuration 和两个新的2-line solver,其中一个不受现有solver中的 singularities 的影响。此外,我们设计了一种新的非最小方法,可以在arbitrary number of lines上运行,以提高本地优化性能。将所有solvers集成到混合稳定估计器中,我们的方法可以在一个粗略的先前中实现更高的准确性。实验表明,我们的方法与当前状态的方法相比,在synthetic和实际数据集上具有更高的准确性,同时具有相似的运行时间。我们还证明了我们的solvers可以用于相对旋转估计。代码可以在https://github.com/cvg/VP-Estimation-with-Prior-Gravity中下载。

Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10692
  • repo_url: None
  • paper_authors: Qizao Wang, Xuelin Qian, Bin Li, Ying Fu, Yanwei Fu, Xiangyang Xue
  • for: This paper aims to tackle the challenges of cloth-changing person Re-ID, which suffer from limited training samples and inferior identity-relevant features.
  • methods: The proposed FIRe$^{2}$ framework consists of a Fine-grained Feature Mining (FFM) module and a Fine-grained Attribute Recomposition (FAR) module, which learn identity-relevant features and recompose image features with different attributes in the latent space.
  • results: The proposed method achieves state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
    Abstract Cloth-changing person Re-IDentification (Re-ID) is a particularly challenging task, suffering from two limitations of inferior identity-relevant features and limited training samples. Existing methods mainly leverage auxiliary information to facilitate discriminative feature learning, including soft-biometrics features of shapes and gaits, and additional labels of clothing. However, these information may be unavailable in real-world applications. In this paper, we propose a novel FIne-grained Representation and Recomposition (FIRe$^{2}$) framework to tackle both limitations without any auxiliary information. Specifically, we first design a Fine-grained Feature Mining (FFM) module to separately cluster images of each person. Images with similar so-called fine-grained attributes (e.g., clothes and viewpoints) are encouraged to cluster together. An attribute-aware classification loss is introduced to perform fine-grained learning based on cluster labels, which are not shared among different people, promoting the model to learn identity-relevant features. Furthermore, by taking full advantage of the clustered fine-grained attributes, we present a Fine-grained Attribute Recomposition (FAR) module to recompose image features with different attributes in the latent space. It can significantly enhance representations for robust feature learning. Extensive experiments demonstrate that FIRe$^{2}$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks.
    摘要 cloth-changing person Re-IDentification (Re-ID) 是一项特别具有挑战性的任务,受到两个限制:一是低质量的身份相关特征,二是有限的训练样本。现有方法主要利用辅助信息来促进特征学习,包括软生物метри学特征(如形状和步伐)以及额外标签(如服装)。但这些信息在实际应用中可能不可用。在这篇论文中,我们提出一种新的细腻表示和重组(FIRe$^{2}$)框架,用于解决这两个限制,无需任何辅助信息。具体来说,我们首先设计了细腻特征挖掘(FFM)模块,用于分别对每个人的图像进行分割。图像具有相似的细腻特征(例如服装和视角)会受到鼓励分割。我们引入了基于分割标签的特征学习损失函数,以进行细腻学习,这些标签不同人之间不同,使模型学习身份相关的特征。此外,我们利用分割的细腻特征,提出了细腻特征重组(FAR)模块,用于重新组合图像特征。这可以明显提高特征表示的稳定性。我们对五种广泛使用的人脸变换人Re-ID标准benchmark进行了广泛的实验,结果显示,FIRe$^{2}$可以达到现有state-of-the-art的性能。

Co-Speech Gesture Detection through Multi-phase Sequence Labeling

  • paper_url: http://arxiv.org/abs/2308.10680
  • repo_url: None
  • paper_authors: Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Peter Uhrig, Judith Holler, Ivan Toni, Aslı Özyürek, Raquel Fernández
  • for: 这篇论文的目的是提出一种新的自动手势检测方法,以更好地捕捉手势的时间序列特征和上下文信息。
  • methods: 该方法使用Transformer编码器学习运动序列中的上下文特征,并使用Conditional Random Fields进行时间序列标注。
  • results: 实验结果表明,该方法与强基eline模型相比,在手势roke检测方面显著提高了性能。此外,通过使用Transformer编码器学习运动序列中的上下文特征,大幅提高了手势单元检测的准确率。
    Abstract Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework's capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.
    摘要 <>手势是面对面通话中的重要组成部分。它们随着时间的推移,经常遵循预测可靠的运动阶段,包括准备阶段、行动阶段和抽取阶段。然而,现有的自动手势检测方法通常将问题定义为二分类问题,即判断一个段落是否包含手势,这会忽略手势的自然顺序和上下文特征。为解决这个问题,我们提出了一种新的框架,它将手势检测问题重新定义为多个阶段序列标注问题,而不是二分类问题。我们的模型处理时间窗口内的骨骼运动序列,使用Transformer编码器学习上下文嵌入,并使用 Conditional Random Fields 进行序列标注。我们对一个大量多样化的面对面对话中的协作手势数据进行了评估。结果表明,我们的方法在检测手势行动阶段方面具有显著的优势,并且在应用Transformer编码器学习上下文嵌入后,手势单位检测得到了明显改善。这些结果表明我们的框架具有捕捉细腻的手势阶段动态特征的能力,为更加精准和准确的手势检测和分析开创了新的可能性。

Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term Person Re-Identification

  • paper_url: http://arxiv.org/abs/2308.10658
  • repo_url: None
  • paper_authors: Feng Liu, Minchul Kim, ZiAng Gu, Anil Jian, Xiaoming Liu
  • for: 本研究旨在扩展长期人识别(LT-ReID)范围,从传统的人脸识别扩展到更广泛的实际生活中的人类活动,同时仍能考虑到衣物变化和时间差等因素。
  • methods: 我们提出了一种新的方法 named 3DInvarReID,它可以分离人体的标识特征和非标识特征(姿势、衣服形状和 текстур),并在一个共同的框架下重建准确的3D衣服人体形状和学习特征。
  • results: 我们通过实验表明,我们的方法可以对人识别进行更好的识别。
    Abstract Long-Term Person Re-Identification (LT-ReID) has become increasingly crucial in computer vision and biometrics. In this work, we aim to extend LT-ReID beyond pedestrian recognition to include a wider range of real-world human activities while still accounting for cloth-changing scenarios over large time gaps. This setting poses additional challenges due to the geometric misalignment and appearance ambiguity caused by the diversity of human pose and clothing. To address these challenges, we propose a new approach 3DInvarReID for (i) disentangling identity from non-identity components (pose, clothing shape, and texture) of 3D clothed humans, and (ii) reconstructing accurate 3D clothed body shapes and learning discriminative features of naked body shapes for person ReID in a joint manner. To better evaluate our study of LT-ReID, we collect a real-world dataset called CCDA, which contains a wide variety of human activities and clothing changes. Experimentally, we show the superior performance of our approach for person ReID.
    摘要 long-term人ReID(LT-ReID)在计算机视觉和生物ometrics中日益重要。在这项工作中,我们希望扩展LT-ReID,包括更加广泛的实际人类活动,同时仍能考虑大量时间差的衣物变化。这种设定带来了更多的挑战,这是因为人体姿态和服装的多样性会导致几何不一致和外观模糊。为解决这些挑战,我们提出了一种新的方法3DInvarReID,它可以(i)分离人体的身份和非身份组成(姿态、服装形状和文本),并(ii)重建准确的3D裸体形状和学习特征性的裸体形状,以便人ReID的学习。为了更好地评估我们的LT-ReID研究,我们收集了一个名为CCDA的实际数据集,该数据集包含了各种人类活动和衣物变化。实验表明,我们的方法在人ReID方面具有显著的优势。

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

  • paper_url: http://arxiv.org/abs/2308.10647
  • repo_url: https://github.com/BengaliAI/bbocr
  • paper_authors: Imam Mohammad Zulkarnain, Shayekh Bin Islam, Md. Zami Al Zunaed Farabe, Md. Mehedi Hasan Shawon, Jawaril Munshad Abedin, Beig Rajibul Hasan, Marsia Haque, Istiak Shihab, Syed Mobassir, MD. Nazmuddoha Ansary, Asif Sushmit, Farig Sadeque
  • for: 这个论文主要目标是提供一个可扩展的开源文档OCR系统,以便在各种低资源语言中进行文档数字化。
  • methods: 该论文使用了一种新的孟加拉文本识别模型,以及两个新的人工合成数据集,来解决低资源语言的OCR问题。
  • results: 论文的实验结果表明,该提案的解决方案在现有状态的孟加拉OCR系统中显著优于。
    Abstract Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.
    摘要 尽管存在许多光学字符识别(OCR)工具,但缺乏完整的开源系统使得文档数字化在各种低资源语言中受阻,其中包括孟加拉语。低资源语言,特别是使用字母符号文字系统的语言,受到文档OCR组件的大规模数据集的缺乏所困,这些组件包括单词水平OCR、文档格式检测和修正倾斜等,这些组件在高资源语言中有很多大规模数据集可供使用。在这篇论文中,我们介绍了孟加拉语$.$AI-BRACU-OCR(bbOCR):一个开源可扩展的文档OCR系统,可以将孟加拉语文档转化为结构化搜索可读化数字格式,利用了一个新的孟加拉文本识别模型和两个新的人工合成数据集。我们提供了广泛的组件级和系统级评估:两者都使用了一个新的多样化评估数据集和完整的评估指标。我们的广泛评估表明,我们的提案超过当前孟加拉语OCR系统的状态。源代码和数据集可以在以下链接获取:

Automated Identification of Failure Cases in Organ at Risk Segmentation Using Distance Metrics: A Study on CT Data

  • paper_url: http://arxiv.org/abs/2308.10636
  • repo_url: None
  • paper_authors: Amin Honarmandi Shandiz, Attila Rádics, Rajesh Tamada, Makk Árpád, Karolina Glowacka, Lehel Ferenczi, Sandeep Dutta, Michael Fanariotis
  • for: 提高放射治疗规划中自动组织器之间误差的检测和修复
  • methods: 使用Dice和 Hausdorff距离的组合来自动标识失败案例,并使用阈值进行分类
  • results: 在20个不同器官的CT图像中,通过设置阈值,可以快速地自动标识失败案例,并且可以视见性地分类12个案例Here’s the breakdown of each point:
  • for: The paper is aimed at improving the accuracy of automated organ segmentation in radiation therapy planning by detecting and correcting failure cases during the training process.
  • methods: The proposed method uses a combination of Dice and Hausdorff distances to automatically identify failure cases, and sets thresholds for these distances to differentiate between various states of failure cases.
  • results: The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets, and was able to automatically identify 12 cases with high accuracy.
    Abstract Automated organ at risk (OAR) segmentation is crucial for radiation therapy planning in CT scans, but the generated contours by automated models can be inaccurate, potentially leading to treatment planning issues. The reasons for these inaccuracies could be varied, such as unclear organ boundaries or inaccurate ground truth due to annotation errors. To improve the model's performance, it is necessary to identify these failure cases during the training process and to correct them with some potential post-processing techniques. However, this process can be time-consuming, as traditionally it requires manual inspection of the predicted output. This paper proposes a method to automatically identify failure cases by setting a threshold for the combination of Dice and Hausdorff distances. This approach reduces the time-consuming task of visually inspecting predicted outputs, allowing for faster identification of failure case candidates. The method was evaluated on 20 cases of six different organs in CT images from clinical expert curated datasets. By setting the thresholds for the Dice and Hausdorff distances, the study was able to differentiate between various states of failure cases and evaluate over 12 cases visually. This thresholding approach could be extended to other organs, leading to faster identification of failure cases and thereby improving the quality of radiation therapy planning.
    摘要 自动化的器官在险 (OAR) 分割是辐射疗法规划 CT 扫描图中的关键,但自动生成的边界可能不准确,可能导致治疗规划问题。这些不准确的原因可能是不清晰的器官边界或 incorrect ground truth due to 注释错误。要提高模型的性能,需要在训练过程中标识失败案例,并使用一些可能的后处理技术来修复。然而,这个过程可能是时间消耗的,因为传统上需要手动检查预测输出。本文提出了一种方法,通过设置 Dice 和 Hausdorff 距离的组合来自动标识失败案例。这种方法可以减少手动检查预测输出的时间消耗,使得更快地标识失败案例候选者。本研究在 20 个不同器官的 CT 图像中进行了20个案例的评估。通过设置 Dice 和 Hausdorff 距离的阈值,研究可以 diferenciate 不同的失败案例,并评估了12个案例。这种阈值设置方法可以扩展到其他器官,从而更快地标识失败案例,提高辐射疗法规划的质量。

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

  • paper_url: http://arxiv.org/abs/2308.10632
  • repo_url: None
  • paper_authors: Peiyan Zhang, Haoyang Liu, Chaozhuo Li, Xing Xie, Sunghun Kim, Haohan Wang
  • for: This paper aims to provide a new method for evaluating the robustness of image classification models by comparing their performance to a surrogate oracle (i.e., a foundation model).
  • methods: The paper introduces a new method that extends the image datasets with new samples that are sufficiently perturbed to be distinct from the original sets, but are still bounded within the same image-label structure. The method uses a foundation model pretrained with a large amount of samples to constrain the perturbations.
  • results: The paper reports that the new method offers a new way to evaluate the models’ robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. The paper also leverage the generated data to understand the behaviors of the model and the new evaluation strategies.
    Abstract Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
    摘要

PsyMo: A Dataset for Estimating Self-Reported Psychological Traits from Gait

  • paper_url: http://arxiv.org/abs/2308.10631
  • repo_url: None
  • paper_authors: Adrian Cosma, Emilian Radoi
  • for: 这个研究的目的是提出一个新的多功能多模式的数据集,用于探索人们的心理特征从步态中找到的各种心理特征。
  • methods: 这个研究使用了312名参与者的7种步态和6个摄像头角度的步态序列,并与每个参与者完善了6个心理测量器,总共17个心理特征相关的个性、自尊、疲劳、攻击性和心理健康。
  • results: 这个研究提出了两种心理特征估计协议,并可以用来对步态进行心理特征估计。同时,这个数据集也可以用于对步态识别技术进行评估。
    Abstract Psychological trait estimation from external factors such as movement and appearance is a challenging and long-standing problem in psychology, and is principally based on the psychological theory of embodiment. To date, attempts to tackle this problem have utilized private small-scale datasets with intrusive body-attached sensors. Potential applications of an automated system for psychological trait estimation include estimation of occupational fatigue and psychology, and marketing and advertisement. In this work, we propose PsyMo (Psychological traits from Motion), a novel, multi-purpose and multi-modal dataset for exploring psychological cues manifested in walking patterns. We gathered walking sequences from 312 subjects in 7 different walking variations and 6 camera angles. In conjunction with walking sequences, participants filled in 6 psychological questionnaires, totalling 17 psychometric attributes related to personality, self-esteem, fatigue, aggressiveness and mental health. We propose two evaluation protocols for psychological trait estimation. Alongside the estimation of self-reported psychological traits from gait, the dataset can be used as a drop-in replacement to benchmark methods for gait recognition. We anonymize all cues related to the identity of the subjects and publicly release only silhouettes, 2D / 3D human skeletons and 3D SMPL human meshes.
    摘要 心理特质估计从外部因素如移动和外观是心理学中长期存在的挑战,基于心理学的肉体理论。到目前为止,解决这个问题的尝试都是使用私人小规模的体积感器。可能的应用包括职业疲劳和心理健康评估,以及市场营销和广告。在这项工作中,我们提出了 PsyMo(心理特质从运动),一个新的多用途和多模式数据集,用于探索在步态中表现的心理征。我们从312名参与者中收集了7种不同的步态和6个摄像头角度的步态序列。与步态序列一起,参与者填充了6种心理测量方法,涉及到个性、自我估计、疲劳、攻击性和心理健康。我们建议了两种心理特质估计评价协议。除了估计步态中的自我报告心理特质之外,该数据集还可以作为步态识别方法的替补数据集使用。我们将所有与身份相关的缓示信息公开发布,只发布了遮盾、2D/3D人体骨架和3D SMPL人体模型。

Polarimetric Information for Multi-Modal 6D Pose Estimation of Photometrically Challenging Objects with Limited Data

  • paper_url: http://arxiv.org/abs/2308.10627
  • repo_url: None
  • paper_authors: Patrick Ruhkamp, Daoyi Gao, HyunJun Jung, Nassir Navab, Benjamin Busam
  • for: 提高6D人体pose estimation的精度, especialy for photometrically challenging objects。
  • methods: 利用RGB-only或RGB-D数据,以及受到的 polarisation information,通过supervised learning和self-supervised learning方法,使用物理特征和形态约束来提高pose estimation的精度。
  • results: 通过利用偏振信息和物理特征,提高了6D人体pose estimation的精度,并且可以处理 textureless surfaces、reflections 和透明物体等难以捕捉的对象。
    Abstract 6D pose estimation pipelines that rely on RGB-only or RGB-D data show limitations for photometrically challenging objects with e.g. textureless surfaces, reflections or transparency. A supervised learning-based method utilising complementary polarisation information as input modality is proposed to overcome such limitations. This supervised approach is then extended to a self-supervised paradigm by leveraging physical characteristics of polarised light, thus eliminating the need for annotated real data. The methods achieve significant advancements in pose estimation by leveraging geometric information from polarised light and incorporating shape priors and invertible physical constraints.
    摘要 6D姿态估计管线,即使使用RGB只或RGB-D数据,会面临照度挑战,如Textureless surface、反射或透明物体。一种基于监督学习的方法,使用补充 polarization 信息作为输入模式,以超越这些限制。这种监督方法然后被扩展到自动化 парадиг,通过利用折射光的物理特性,从而消除需要实际数据标注。这些方法可以在姿态估计中提取折射光的 геометрические信息,并应用形状假设和可逆物理约束,得到显著的进步。

GaitPT: Skeletons Are All You Need For Gait Recognition

  • paper_url: http://arxiv.org/abs/2308.10623
  • repo_url: None
  • paper_authors: Andy Catruna, Adrian Cosma, Emilian Radoi
  • for: 本研究旨在提出一种基于骨骼姿态的新型步态识别模型,用于自动人脸识别。
  • methods: 该模型基于pose estimation骨骼,不需要应用earance信息,可以有效地捕捉人的唯一步态特征。使用层次transformer架构,同时捕捉空间和时间特征,以及人体结构的约束。
  • results: 实验结果显示,GaitPT模型在CASIA-B和GREW dataset上取得了state-of-the-art性能,比其他骨骼基于模型高出6%。在GREW dataset上,GaitPT模型取得了52.16% Rank-1性能,超过了骨骼基于和外观基于模型。
    Abstract The analysis of patterns of walking is an important area of research that has numerous applications in security, healthcare, sports and human-computer interaction. Lately, walking patterns have been regarded as a unique fingerprinting method for automatic person identification at a distance. In this work, we propose a novel gait recognition architecture called Gait Pyramid Transformer (GaitPT) that leverages pose estimation skeletons to capture unique walking patterns, without relying on appearance information. GaitPT adopts a hierarchical transformer architecture that effectively extracts both spatial and temporal features of movement in an anatomically consistent manner, guided by the structure of the human skeleton. Our results show that GaitPT achieves state-of-the-art performance compared to other skeleton-based gait recognition works, in both controlled and in-the-wild scenarios. GaitPT obtains 82.6% average accuracy on CASIA-B, surpassing other works by a margin of 6%. Moreover, it obtains 52.16% Rank-1 accuracy on GREW, outperforming both skeleton-based and appearance-based approaches.
    摘要 研究人行姿势的分析是一个重要的领域,它在安全、医疗、运动和人机交互等领域有广泛的应用。最近,人行姿势被视为一种自动人识别的唯一指纹方法。在这项工作中,我们提出了一种新的步态识别架构,称为步态pyramid transformer(GaitPT),它利用pose estimation骨架来捕捉独特的人行姿势特征,不依赖于外观信息。GaitPT采用了层次转换器架构,能够有效地提取人行运动的空间和时间特征,同时遵循人体骨架的结构。我们的结果显示,GaitPT在CASIA-B上获得了82.6%的平均准确率,比其他skeleton-based步态识别工作高出6%的优异。此外,它在GREW上获得了52.16%的排名第一准确率,超过了skeleton-based和外观based方法。

Multi-Modal Dataset Acquisition for Photometrically Challenging Object

  • paper_url: http://arxiv.org/abs/2308.10621
  • repo_url: None
  • paper_authors: HyunJun Jung, Patrick Ruhkamp, Nassir Navab, Benjamin Busam
  • for: 提高3D视觉任务的数据集准确率、大小、现实性和适用于光学挑战的物体图像模式。
  • methods: 提出了一种新的注释和获取管道,增强现有3D见解和6D对象pose数据集。该方法 integrate robotic forward-kinematics, external infrared trackers, and improved calibration and annotation procedures。
  • results: 介绍了一种多感器仪器组,安装在 робо控端效器上,并证明如何将其集成到高精度数据集的创造中。此外,我们还介绍了一种自由手写程序,以扩展视野覆盖率。两种方法均生成了高质量3D数据,并且具有准确的物体和相机pose注释。
    Abstract This paper addresses the limitations of current datasets for 3D vision tasks in terms of accuracy, size, realism, and suitable imaging modalities for photometrically challenging objects. We propose a novel annotation and acquisition pipeline that enhances existing 3D perception and 6D object pose datasets. Our approach integrates robotic forward-kinematics, external infrared trackers, and improved calibration and annotation procedures. We present a multi-modal sensor rig, mounted on a robotic end-effector, and demonstrate how it is integrated into the creation of highly accurate datasets. Additionally, we introduce a freehand procedure for wider viewpoint coverage. Both approaches yield high-quality 3D data with accurate object and camera pose annotations. Our methods overcome the limitations of existing datasets and provide valuable resources for 3D vision research.
    摘要

Ultrafast and Ultralight Network-Based Intelligent System for Real-time Diagnosis of Ear diseases in Any Devices

  • paper_url: http://arxiv.org/abs/2308.10610
  • repo_url: None
  • paper_authors: Yubiao Yue, Xinyu Zeng, Xiaoqiang Shi, Meiping Zhang, Haihua Liang, Fan Zhang, Yanmei Chen, Zefeng Xie, Wenrui Wu, Zhenzhang Li
    for: 这个研究旨在提高耳疾病诊断效率和准确性,并且提供一个可靠且可行的耳疾病诊断系统。methods: 这个研究使用了深度学习模型,并且开发了一个名为Best-EarNet的高速和轻量级网络,以提高耳疾病诊断的效率和准确性。results: 这个研究获得了95.23%的准确率,并且在实际应用中达到了80帧每秒的帧速度。此外,这个研究还发展了一个名为Ear Keeper的智能诊断系统,可以在日常生活中使用。
    Abstract Traditional ear disease diagnosis heavily depends on experienced specialists and specialized equipment, frequently resulting in misdiagnoses, treatment delays, and financial burdens for some patients. Utilizing deep learning models for efficient ear disease diagnosis has proven effective and affordable. However, existing research overlooked model inference speed and parameter size required for deployment. To tackle these challenges, we constructed a large-scale dataset comprising eight ear disease categories and normal ear canal samples from two hospitals. Inspired by ShuffleNetV2, we developed Best-EarNet, an ultrafast and ultralight network enabling real-time ear disease diagnosis. Best-EarNet incorporates the novel Local-Global Spatial Feature Fusion Module which can capture global and local spatial information simultaneously and guide the network to focus on crucial regions within feature maps at various levels, mitigating low accuracy issues. Moreover, our network uses multiple auxiliary classification heads for efficient parameter optimization. With 0.77M parameters, Best-EarNet achieves an average frames per second of 80 on CPU. Employing transfer learning and five-fold cross-validation with 22,581 images from Hospital-1, the model achieves an impressive 95.23% accuracy. External testing on 1,652 images from Hospital-2 validates its performance, yielding 92.14% accuracy. Compared to state-of-the-art networks, Best-EarNet establishes a new state-of-the-art (SOTA) in practical applications. Most importantly, we developed an intelligent diagnosis system called Ear Keeper, which can be deployed on common electronic devices. By manipulating a compact electronic otoscope, users can perform comprehensive scanning and diagnosis of the ear canal using real-time video. This study provides a novel paradigm for ear endoscopy and other medical endoscopic image recognition applications.
    摘要 传统的耳病诊断依赖于经验丰富的专家和专门的设备,经常导致诊断错误、延迟治疗和患者的经济负担。利用深度学习模型进行高效的耳病诊断已经证明有效并可靠。然而,现有研究却忽略了模型的推理速度和参数大小的问题。为了解决这些挑战,我们建立了大规模的数据集,包括8种耳病类别和正常耳朵样本从两家医院。 Drawing inspiration from ShuffleNetV2, we developed Best-EarNet, an ultrafast and ultralight network that enables real-time ear disease diagnosis. Best-EarNet incorporates a novel Local-Global Spatial Feature Fusion Module that can capture both global and local spatial information simultaneously, guiding the network to focus on crucial regions within feature maps at various levels, thus mitigating low accuracy issues. Moreover, our network uses multiple auxiliary classification heads for efficient parameter optimization. With only 0.77M parameters, Best-EarNet achieves an impressive 80 frames per second on CPU. By employing transfer learning and five-fold cross-validation with 22,581 images from Hospital-1, the model achieves an accuracy of 95.23%. External testing on 1,652 images from Hospital-2 further validates its performance, yielding an accuracy of 92.14%. Compared to state-of-the-art networks, Best-EarNet establishes a new state-of-the-art (SOTA) in practical applications. Most importantly, we developed an intelligent diagnosis system called Ear Keeper, which can be deployed on common electronic devices. By using a compact electronic otoscope, users can perform comprehensive scanning and diagnosis of the ear canal using real-time video. This study provides a novel paradigm for ear endoscopy and other medical endoscopic image recognition applications.

FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly

  • paper_url: http://arxiv.org/abs/2308.10608
  • repo_url: None
  • paper_authors: Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, Bingbing Ni
  • for: 用于提供高级版本的3D文本编辑功能,以便在具体的区域内进行精细编辑。
  • methods: 基于文本提示的基本形状和可编辑部分的混合,以实现精细的3D编辑。具有geometry union和双路渲染功能,可以快速拼接独立的3D部件成全个物体,并且具有可重复使用和部件控制的优势。
  • results: 经过广泛的实验,FOcalDreamer表示出了高级版本的3D文本编辑功能,包括高精度的几何学形状和PBR文本ures,可以与广泛使用的图形引擎相容。
    Abstract While text-3D editing has made significant strides in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.
    摘要 While text-3D editing has made significant progress in leveraging score distillation sampling, emerging approaches still fall short in delivering separable, precise and consistent outcomes that are vital to content creation. In response, we introduce FocalDreamer, a framework that merges base shape with editable parts according to text prompts for fine-grained editing within desired regions. Specifically, equipped with geometry union and dual-path rendering, FocalDreamer assembles independent 3D parts into a complete object, tailored for convenient instance reuse and part-wise control. We propose geometric focal loss and style consistency regularization, which encourage focal fusion and congruent overall appearance. Furthermore, FocalDreamer generates high-fidelity geometry and PBR textures which are compatible with widely-used graphics engines. Extensive experiments have highlighted the superior editing capabilities of FocalDreamer in both quantitative and qualitative evaluations.Here's the translation in Traditional Chinese:而文本3D编辑技术也在利用分数采样方面做出了重要进展,但现有方法仍然缺乏可靠、精细和一致的结果,这些结果是内容创作的关键。为此,我们介绍了FocalDreamer框架,它将基本形状与可编辑部分按照文本提示进行细致的编辑,并在所需区域内进行精细的控制。FocalDreamer具有geometry union和双路渲染功能,可以将独立的3D部件组合成完整的对象,并且可以方便地进行实例重用和部件控制。我们还提出了几何吸引损失和风格一致 regularization,以促进吸引融合和一致的整体外观。此外,FocalDreamer还可以生成高质量的几何学和PBR文件,与广泛使用的图形引擎相容。经过广泛的实验,我们发现FocalDreamer在量化和质量上的评价都有所提高。

A step towards understanding why classification helps regression

  • paper_url: http://arxiv.org/abs/2308.10603
  • repo_url: https://github.com/arkavb/Natural-Language-Processing-of-Company-Review-Data
  • paper_authors: Silvia L. Pintea, Yancong Lin, Jouke Dijkstra, Jan C. van Gemert
  • for: 这些深度回归方法可以提高结果,添加分类损失。
  • methods: 通过 precisely controlled dataset variations and data samplings,发现在偏好数据时,添加分类损失最有效。
  • results: 我们在实验中发现,在偏好数据时,添加分类损失可以提高结果,并且我们可以通过形式化平衡和不平衡损失之间的关系来解释这个现象。
    Abstract A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss.
    摘要 “一些计算机视觉深度回归方法报告了添加分类损失可以提高结果的情况。在这里,我们查讨了这种方法在实践中的利用和有利之处。我们从精心控制的数据集变化和采样开始,发现在不均衡数据时添加分类损失的效果最为明显。我们形式化了不均衡和均衡回归损失之间的关系,并证明了这些实验结论在实际中的影像深度估计(NYUD2-DIR)、年龄估计(IMDB-WIKI-DIR)和视频进程预测(Breakfast)中都有效。我们的主要结论是:如果数据采样不均衡,那么在回归任务中添加分类损失。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Improving the Transferability of Adversarial Examples with Arbitrary Style Transfer

  • paper_url: http://arxiv.org/abs/2308.10601
  • repo_url: https://github.com/zhijin-ge/stm
  • paper_authors: Zhijin Ge, Fanhua Shang, Hongying Liu, Yuanyuan Liu, Liang Wan, Wei Feng, Xiaosen Wang
    for: This paper aims to improve the transferability of adversarial attacks in the black-box setting by leveraging domain generalization.methods: The proposed method, called Style Transfer Method (STM), utilizes an arbitrary style transfer network to transform images into different domains, while maintaining semantic consistency through fine-tuning and random noise addition.results: The proposed method significantly improves the adversarial transferability on both normally trained models and adversarially trained models, outperforming state-of-the-art input transformation-based attacks.Here’s the simplified Chinese text:for: 本文目的是在黑盒 Setting 中提高攻击 transferability,通过域元泛化来实现。methods: 提议的方法是使用一个自定义的样式传输网络,将图像转换到不同的域中,保持语义信息的一致性通过细调和随机噪声添加。results: 实验结果表明,提议的方法可以在 normally 训练的模型和 adversarial 训练的模型上显著提高攻击 transferability,比对 state-of-the-art 输入变换基于的攻击更高效。
    Abstract Deep neural networks are vulnerable to adversarial examples crafted by applying human-imperceptible perturbations on clean inputs. Although many attack methods can achieve high success rates in the white-box setting, they also exhibit weak transferability in the black-box setting. Recently, various methods have been proposed to improve adversarial transferability, in which the input transformation is one of the most effective methods. In this work, we notice that existing input transformation-based works mainly adopt the transformed data in the same domain for augmentation. Inspired by domain generalization, we aim to further improve the transferability using the data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Hence, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform the images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve the adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.
    摘要 Simplified Chinese translation:深度神经网络容易受到人类不可见的攻击,而这些攻击通常可以在白盒模式下达到高成功率。然而,这些攻击方法在黑盒模式下的可 translateability强度很弱。在这个工作中,我们注意到现有的输入转换基于工作主要采用同一个频谱域中的转换数据进行增强。 Drawing inspiration from domain generalization, we aim to further improve transferability using data augmented from different domains. Specifically, a style transfer network can alter the distribution of low-level visual features in an image while preserving semantic content for humans. Therefore, we propose a novel attack method named Style Transfer Method (STM) that utilizes a proposed arbitrary style transfer network to transform images into different domains. To avoid inconsistent semantic information of stylized images for the classification network, we fine-tune the style transfer network and mix up the generated images added by random noise with the original images to maintain semantic consistency and boost input diversity. Our extensive experimental results on the ImageNet-compatible dataset show that our proposed method can significantly improve adversarial transferability on either normally trained models or adversarially trained models than state-of-the-art input transformation-based attacks. Code is available at: https://github.com/Zhijin-Ge/STM.

Image-free Classifier Injection for Zero-Shot Classification

  • paper_url: http://arxiv.org/abs/2308.10599
  • repo_url: https://github.com/explainableml/imagefreezsl
  • paper_authors: Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata
  • for: 这篇论文目的是为了使 pré-训练模型具备零shot类别化能力,不需要训练数据集。
  • methods: 该论文提出了一种名为 Image-free Classifier Injection with Semantics (ICIS)的方法,可以在 pré-训练模型上采用post-hoc方式插入新类别的核心。ICIS使用类别名称或属性等简单的描述器来学习新类别的核心矩阵,并通过(cross-)重建和偏置损失来规范解码过程。
  • results: 实验表明,ICIS可以在标准ZSL数据集上实现强一致(通用)零shot类别化性能。
    Abstract Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification capabilities without the use of image data. We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data. Instead, the existing classifier weights and simple class-wise descriptors, such as class names or attributes, are used. ICIS has two encoder-decoder networks that learn to reconstruct classifier weights from descriptors (and vice versa), exploiting (cross-)reconstruction and cosine losses to regularise the decoding process. Notably, ICIS can be cheaply trained and applied directly on top of pre-trained classification models. Experiments on benchmark ZSL datasets show that ICIS produces unseen classifier weights that achieve strong (generalised) zero-shot classification performance. Code is available at https://github.com/ExplainableML/ImageFreeZSL .
    摘要 Zero-shot learning模型在图像分类任务上具有非常出色的成绩,但是这些模型必须通过特殊的方法进行训练,因此在需要零shot分类时需要训练数据集。在这篇论文中,我们目标是在不使用图像数据的情况下,为预训练模型添加零shot分类能力。我们提出了一种名为图像自由分类插入法(ICIS),可以在预训练分类模型的后续进行post-hoc插入,并不需要图像数据。我们利用预训练分类器的权重和简单的分类描述符,如分类名称或特征,来进行插入。ICIS包括两个encoder-decoder网络,用于从描述符中学习重构分类器权重,并将描述符与分类器权重进行匹配。我们利用(交叉)重构和偏度损失来规范解码过程。可以说,ICIS可以轻松地在预训练分类模型之上进行训练和应用。我们在ZSL数据集上进行了实验,并证明ICIS可以生成高性能的零shot分类器。代码可以在https://github.com/ExplainableML/ImageFreeZSL上找到。

CHORD: Category-level Hand-held Object Reconstruction via Shape Deformation

  • paper_url: http://arxiv.org/abs/2308.10574
  • repo_url: None
  • paper_authors: Kailin Li, Lixin Yang, Haoyu Zhen, Zenan Lin, Xinyu Zhan, Licheng Zhong, Jian Xu, Kejian Wu, Cewu Lu
  • for: 本研究旨在提供一种新的Category-level Hand-held Object Reconstruction方法,帮助AI理解日常任务中手部与物体之间的关系,并学习 manipulate 技能。
  • methods: 该方法基于 shape 权重的拟合,通过 shape 权重的调整来重建手持物体的形状。此外,方法还具有三种意识:外观、形状和互动 pose。
  • results: 对比之前的方法,CHORD 在量化和质量两个方面均表现出优于状态之前的方法。 Code、模型和数据集可以在https://kailinli.github.io/CHORD 上下载。
    Abstract In daily life, humans utilize hands to manipulate objects. Modeling the shape of objects that are manipulated by the hand is essential for AI to comprehend daily tasks and to learn manipulation skills. However, previous approaches have encountered difficulties in reconstructing the precise shapes of hand-held objects, primarily owing to a deficiency in prior shape knowledge and inadequate data for training. As illustrated, given a particular type of tool, such as a mug, despite its infinite variations in shape and appearance, humans have a limited number of 'effective' modes and poses for its manipulation. This can be attributed to the fact that humans have mastered the shape prior of the 'mug' category, and can quickly establish the corresponding relations between different mug instances and the prior, such as where the rim and handle are located. In light of this, we propose a new method, CHORD, for Category-level Hand-held Object Reconstruction via shape Deformation. CHORD deforms a categorical shape prior for reconstructing the intra-class objects. To ensure accurate reconstruction, we empower CHORD with three types of awareness: appearance, shape, and interacting pose. In addition, we have constructed a new dataset, COMIC, of category-level hand-object interaction. COMIC contains a rich array of object instances, materials, hand interactions, and viewing directions. Extensive evaluation shows that CHORD outperforms state-of-the-art approaches in both quantitative and qualitative measures. Code, model, and datasets are available at https://kailinli.github.io/CHORD.
    摘要 日常生活中,人类通过手部 manipulate 物体。模拟手部 manipulation 物体的形状是 AI 理解日常任务和学习 manipulate 技能的关键。然而, previous approaches 遇到了重建具体手持物体的精确形状的困难,主要是因为缺乏先前形状知识和训练数据不充分。例如,给定一种工具,如杯子,尽管它有无数个形态和外观,但人类在 manipulate 这些杯子时只有有限多个有效的模式和姿势。这可以归结于人类已经掌握了杯子的形状先验,可以快速地确定杯子的rim和握 Handle的位置。基于这一点,我们提出了一种新的方法,即 CHORD,用于类型级手持物体重建 via 形态扭曲。CHORD 使用类别形状先验来重建内部类 объек。为确保准确重建,我们赋予 CHORD 三种意识:外观、形状和互动姿势。此外,我们还建立了一个新的数据集,即 COMIC,包括类型级手持物体的丰富数组、物体实例、材质、手动互动和视角方向。EXTensive 评估表明 CHORD 在量化和质量上都高于当前状态的方法。代码、模型和数据集可以在 获取。

Self-Feedback DETR for Temporal Action Detection

  • paper_url: http://arxiv.org/abs/2308.10570
  • repo_url: None
  • paper_authors: Jihwan Kim, Miso Lee, Jae-Pil Heo
  • for: 本研究旨在解决RETR基于模型在视频应用中的时间动作检测(TAD)中的问题,即自注意力模块的时间塌陷问题。
  • methods: 我们提出了一种新的框架,即Self-DETR,它利用decoder的跨注意地图来重新启用自注意力模块。我们通过简单的矩阵乘法将跨注意地图和其转置相乘,恢复编码器和解码器中自注意力模块的关系。
  • results: 我们的广泛实验表明,Self-DETR可以解决自注意力模块的时间塌陷问题,保持高多样性的注意力在所有层次。
    Abstract Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers.
    摘要 Temporal Action Detection (TAD) 是一项挑战性强且基础性强的视频应用领域问题。最近,基于 DE TR 的模型已经为 TAD 提出了方案,但它们的性能还不是很好。在这篇论文中,我们指出 DE TR 中的自注意力问题,即自注意力模块在几个关键元素上的集中注意问题。这会使Encoder 和 Decoder 的自注意力模块失去作用。为解决这个问题,我们提出了一种新的框架,即 Self-DETR,它利用 Decoder 的跨注意地图来重新活化 Encoder 和 Decoder 的自注意力模块。我们通过简单的矩阵乘法将跨注意地图与其规定的转置矩阵进行相乘,恢复 Encoder 中的关键特征之间的关系。同时,我们还可以通过指导 collapse 自注意力地图来解决 Encoder 和 Decoder 中的自注意力塌陷问题。我们的广泛的实验表明,Self-DETR 可以解决自注意力塌陷问题,保持高度多样性的注意力于所有层。

RT-MonoDepth: Real-time Monocular Depth Estimation on Embedded Systems

  • paper_url: http://arxiv.org/abs/2308.10569
  • repo_url: None
  • paper_authors: Cheng Feng, Zhen Chen, Congxuan Zhang, Weiming Hu, Bing Li, Feng Lu
    for: 这 paper 的目的是解决嵌入式系统上实时深度估计问题。methods: 该 paper 提出了两种高效和轻量级 encoder-decoder 网络架构,即 RT-MonoDepth 和 RT-MonoDepth-S,以降低计算复杂性和延迟。results: 该 paper 的方法在 NVIDIA Jetson Nano 和 Jetson AGX Orin 上实现了单个 RGB 图像的640×192 分辨率下的18.4/30.5 FPS 和253.0/364.1 FPS 的实时深度估计,并在 KITTI 数据集上达到了相对state-of-the-art 精度。根据作者们所知,这 paper 在快速单靠谱深度估计方法中实现了最高精度和最快的执行速度。
    Abstract Depth sensing is a crucial function of unmanned aerial vehicles and autonomous vehicles. Due to the small size and simple structure of monocular cameras, there has been a growing interest in depth estimation from a single RGB image. However, state-of-the-art monocular CNN-based depth estimation methods using fairly complex deep neural networks are too slow for real-time inference on embedded platforms. This paper addresses the problem of real-time depth estimation on embedded systems. We propose two efficient and lightweight encoder-decoder network architectures, RT-MonoDepth and RT-MonoDepth-S, to reduce computational complexity and latency. Our methodologies demonstrate that it is possible to achieve similar accuracy as prior state-of-the-art works on depth estimation at a faster inference speed. Our proposed networks, RT-MonoDepth and RT-MonoDepth-S, runs at 18.4\&30.5 FPS on NVIDIA Jetson Nano and 253.0\&364.1 FPS on NVIDIA Jetson AGX Orin on a single RGB image of resolution 640$\times$192, and achieve relative state-of-the-art accuracy on the KITTI dataset. To the best of the authors' knowledge, this paper achieves the best accuracy and fastest inference speed compared with existing fast monocular depth estimation methods.
    摘要 深度感知是无人机和自动车辆的关键功能。由于单目频道摄像头的小尺寸和简单结构,有增加对单个RGB图像的深度估计的兴趣。然而,使用较复杂的深度神经网络的现状-of-the-art单目CNN基于深度估计方法在嵌入式平台上实时推理是太慢。这篇论文解决了实时深度估计在嵌入式系统上的问题。我们提议了两种高效和轻量级编码器-解码器网络架构,RT-MonoDepth和RT-MonoDepth-S,以减少计算复杂度和延迟。我们的方法ologies表明,可以在单个RGB图像的640×192分辨率上达到与先前状态艺术工作相同的准确性,但是在更快的推理速度上。我们提出的网络RT-MonoDepth和RT-MonoDepth-S在NVIDIA Jetson Nano和NVIDIA Jetson AGX Orin上运行于640×192分辨率的单个RGB图像上,具有18.4和30.5帧/秒的推理速度,并在KITTI数据集上达到相对状态艺术的准确性。根据作者们所知,这篇论文在快速单目深度估计方法中实现了最高准确性和最快的推理速度。

Seeing the Intangible: Surveying Automatic High-Level Visual Understanding from Still Images

  • paper_url: http://arxiv.org/abs/2308.10562
  • repo_url: None
  • paper_authors: Delfina Sol Martinez Pandiani, Valentina Presutti
  • for: 本文主要探讨了图像理解领域中对抽象社会概念的检测,即通过图像来捕捉和理解人类的情感、社会价值观和意识形态等抽象概念。
  • methods: 本文使用了多disciplinary的视觉学、计算机科学和认知学的视角,对高级视觉理解的semantic元素进行了研究和归类,并对图像理解任务中的高级视觉任务进行了研究和归类,以确定当前CV工作中对抽象社会概念检测的实践。
  • results: 本文通过对高级视觉理解的semantic元素和图像理解任务的研究和归类,发现了一些CV工作是通过不同的术语描述同一种抽象社会概念检测任务,并提出了一种系统性的review和分类方法来探讨这些任务。
    Abstract The field of Computer Vision (CV) was born with the single grand goal of complete image understanding: providing a complete semantic interpretation of an input image. What exactly this goal entails is not immediately straightforward, but theoretical hierarchies of visual understanding point towards a top level of full semantics, within which sits the most complex and subjective information humans can detect from visual data. In particular, non-concrete concepts including emotions, social values and ideologies seem to be protagonists of this "high-level" visual semantic understanding. While such "abstract concepts" are critical tools for image management and retrieval, their automatic recognition is still a challenge, exactly because they rest at the top of the "semantic pyramid": the well-known semantic gap problem is worsened given their lack of unique perceptual referents, and their reliance on more unspecific features than concrete concepts. Given that there seems to be very scarce explicit work within CV on the task of abstract social concept (ASC) detection, and that many recent works seem to discuss similar non-concrete entities by using different terminology, in this survey we provide a systematic review of CV work that explicitly or implicitly approaches the problem of abstract (specifically social) concept detection from still images. Specifically, this survey performs and provides: (1) A study and clustering of high level visual understanding semantic elements from a multidisciplinary perspective (computer science, visual studies, and cognitive perspectives); (2) A study and clustering of high level visual understanding computer vision tasks dealing with the identified semantic elements, so as to identify current CV work that implicitly deals with AC detection.
    摘要 Computer Vision (CV) 的 birth 也标志着完整的图像理解的单一大目标:提供一个完整的semantic解释的输入图像。 CV 的这个目标并不是立即明确的,但理论上的视觉层次结构显示了一个高层的概念理解,其中包括了非具体的概念,如情感、社会价值观和意识形态。这些“抽象概念”在图像管理和检索中是关键工具,但自动识别它们仍然是一个挑战,因为它们位于“semantic gap”问题的顶层,而且它们的唯一特征是与具体的feature不同。由于在 CV 中对抽象社会概念(ASC)的探测工作很少,而且许多最近的工作通过不同的术语来描述类似的非具体元素,因此在这篇评论中,我们提供了一个系统性的CV工作评论,以下是我们的研究和归类:1. 从多元视角来研究和归类高级视觉理解的semantic元素,包括计算机科学、视觉学和认知科学的视角。2. 研究和归类高级视觉理解的CV任务,以确定当前CV工作中是否有对ASC探测的隐式工作。

Spatial Transform Decoupling for Oriented Object Detection

  • paper_url: http://arxiv.org/abs/2308.10561
  • repo_url: https://github.com/yuhongtian17/spatial-transform-decoupling
  • paper_authors: Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu
  • for: oriented object detection with ViTs
  • methods: separate network branches for position, size, and angle prediction, and aggregating cascaded activation masks (CAMs)
  • results: state-of-the-art performance on benchmark datasets DOTA-v1.0 and HRSC2016, demonstrating the effectiveness of the proposed method.Here’s the summary in Traditional Chinese:
  • for: 针对使用 ViTs 进行 oriented object detection
  • methods: 使用分支网络来预测位置、大小和角度,并聚合缩凝活化图 (CAMs)
  • results: 在 DOTA-v1.0 和 HRSC2016 benchmark 数据集上取得了状态顶峰性能,证明提案的方法的有效性。
    Abstract Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
    摘要 视transformer(ViTs)在计算机视觉任务中已经取得了非常出色的成绩。然而,它们在旋转敏感场景中的潜力尚未得到了充分利用,这可能是因为数据传递过程中缺乏空间不变性的问题。在这种情况下,我们提出了一种新的方法,即空间转换解coupling(STD),它提供了一种简单而有效的解决方案 для oriented object detection with ViTs。基于堆叠的 ViT 块,STD 使用了独立的网络分支来预测矩形框的位置、大小和角度,从而有效地利用了 ViTs 的空间转换潜力。此外,通过聚合缓存在计算的 CAMs,STD 逐渐增强了在区域 interess(RoIs)中的特征,这与自我注意机制相结合。没有各种饰物,STD 在 DOTA-v1.0 和 HRSC2016 的标准 datasets 上达到了状态机器人的性能(82.24% mAP和98.55% mAP),这表明了提案的方法的效果。源代码可以在 GitHub 上找到:https://github.com/yuhongtian17/Spatial-Transform-Decoupling。

Local Spherical Harmonics Improve Skeleton-Based Hand Action Recognition

  • paper_url: http://arxiv.org/abs/2308.10557
  • repo_url: https://github.com/kathpra/lshr_lsht
  • paper_authors: Katharina Prasse, Steffen Jung, Yuxuan Zhou, Margret Keuper
  • for: 这个论文的目的是提出一种特有的手势识别方法,用于解决手势识别难以正确识别的问题。
  • methods: 该方法使用相对angular embedding和本地圆形函数来创建新的手势表示,使得手势识别更加抗抗衡视角和人员差异。
  • results: 经过广泛的实验,该方法在First-Person Hand Action Benchmark中使用RGB-D视频和3D手势注释,以及NTU RGB+D 120数据集上,都能够达到更高的识别精度。
    Abstract Hand action recognition is essential. Communication, human-robot interactions, and gesture control are dependent on it. Skeleton-based action recognition traditionally includes hands, which belong to the classes which remain challenging to correctly recognize to date. We propose a method specifically designed for hand action recognition which uses relative angular embeddings and local Spherical Harmonics to create novel hand representations. The use of Spherical Harmonics creates rotation-invariant representations which make hand action recognition even more robust against inter-subject differences and viewpoint changes. We conduct extensive experiments on the hand joints in the First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations, and on the NTU RGB+D 120 dataset, demonstrating the benefit of using Local Spherical Harmonics Representations. Our code is available at https://github.com/KathPra/LSHR_LSHT.
    摘要 <>TRANSLATE_TEXT手势识别是必备的。人机交互、通信和手势控制都需要它。基于骨架的手势识别传统上包括手部,这些手部类型仍然存在识别错误的问题。我们提出一种特有的手势识别方法,使用相对angular嵌入和本地球形函数来创建新的手势表示。使用球形函数创造旋转不变的表示,使手势识别更加鲁棒对受人体差异和视点变化。我们在First-Person Hand Action Benchmark中使用RGB-D视频和3D手势注解进行了广泛的实验,以及NTU RGB+D 120 dataset,并证明了使用本地球形函数表示的优点。我们的代码可以在https://github.com/KathPra/LSHR_LSHT上找到。Note: "TRANSLATE_TEXT" is a system variable that specifies the text to be translated.

Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations

  • paper_url: http://arxiv.org/abs/2308.10554
  • repo_url: None
  • paper_authors: Seogkyu Jeon, Bei Liu, Pilhyeon Lee, Kibeom Hong, Jianlong Fu, Hyeran Byun
  • for: 这篇论文的目的是提出一种解决零例GAN适应中的模式崩溃问题,以便在没有更多训练数据的情况下,可以将已经训练好的生成器重新使用来生成未见过的目标领域中的图像。
  • methods: 这篇论文使用了CLIP的语像模型来帮助生成器更好地理解目标领域的概念,并且使用了一种新的方法来找到目标文本中的semantic variation,以避免模式崩溃。此外,这篇论文还引入了一些新的损失函数,包括方向几何损失和适应矩阵损失,以保持生成的图像具有丰富的内容信息。
  • results: 这篇论文的实验结果显示,这些新的方法可以帮助零例GAN适应中的生成器获得更多的多样性和质量,并且在不同的领域中都能够获得良好的结果。此外,这篇论文还进行了一些ablation study,以验证每个提案的效果。总之,这篇论文的结果显示,这些新的方法可以帮助零例GAN适应中的生成器更好地生成多样化和高质量的图像。
    Abstract Training deep generative models usually requires a large amount of data. To alleviate the data collection cost, the task of zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain without any further training samples. Due to the data absence, the textual description of the target domain and the vision-language models, e.g., CLIP, are utilized to effectively guide the generator. However, with only a single representative text feature instead of real images, the synthesized images gradually lose diversity as the model is optimized, which is also known as mode collapse. To tackle the problem, we propose a novel method to find semantic variations of the target text in the CLIP space. Specifically, we explore diverse semantic variations based on the informative text feature of the target domain while regularizing the uncontrolled deviation of the semantic information. With the obtained variations, we design a novel directional moment loss that matches the first and second moments of image and text direction distributions. Moreover, we introduce elastic weight consolidation and a relation consistency loss to effectively preserve valuable content information from the source domain, e.g., appearances. Through extensive experiments, we demonstrate the efficacy of the proposed methods in ensuring sample diversity in various scenarios of zero-shot GAN adaptation. We also conduct ablation studies to validate the effect of each proposed component. Notably, our model achieves a new state-of-the-art on zero-shot GAN adaptation in terms of both diversity and quality.
    摘要 通常需要大量数据来训练深度生成模型。为了减少数据收集成本,Zero-shot GAN适应任务的目标是 reuse 已经训练过的生成器,无需更多的训练样本来生成未看到的目标领域的图像。由于数据缺失,文本描述和视觉语言模型,例如 CLIP,被利用来有效地指导生成器。然而,只有单个文本特征而不是真实的图像,生成的图像会逐渐失去多样性,这也被称为模式塌损。为解决这个问题,我们提出了一种新的方法,即在 CLIP 空间找到目标文本的semantic variation。特别是,我们在目标领域的文本特征上探索具有信息的semantic variation,并对这些变化进行规范化。使用获得的变化,我们设计了一种新的方向分布匹配权重整合loss,以保持图像和文本方向分布的第一和第二 moment的匹配。此外,我们引入了塑性重叠和关系一致损失,以有效地保留源领域的有价值内容信息,例如外观。通过广泛的实验,我们证明了我们提出的方法的效果,可以在零shot GAN适应中确保样本多样性。我们还进行了剥离研究,以验证每个提出的组件的效果。值得一提的是,我们的模型在零shot GAN适应中达到了新的state-of-the-art, both 多样性和质量方面。

Learning Weakly Convex Regularizers for Convergent Image-Reconstruction Algorithms

  • paper_url: http://arxiv.org/abs/2308.10542
  • repo_url: None
  • paper_authors: Alexis Goujon, Sebastian Neumayer, Michael Unser
  • for: 这个论文目的是学习非凸规则化器,以便在有upper bound的弱凸性modulus下实现variational混淆。
  • methods: 这个论文使用的方法是基于几万个参数(less than 15,000)的非凸规则化器,它们可以用signal-processing的意义来解释,并且可以模仿手工制定的简约化规则。
  • results: 这个论文通过数值实验表明,使用这种非凸规则化器可以超过凸规则化器以及BM3D混淆器的性能,同时学习的规则可以用iterative scheme来解决反向问题,并且在CT和MRI重建中得到了极好的权衡。
    Abstract We propose to learn non-convex regularizers with a prescribed upper bound on their weak-convexity modulus. Such regularizers give rise to variational denoisers that minimize a convex energy. They rely on few parameters (less than 15,000) and offer a signal-processing interpretation as they mimic handcrafted sparsity-promoting regularizers. Through numerical experiments, we show that such denoisers outperform convex-regularization methods as well as the popular BM3D denoiser. Additionally, the learned regularizer can be deployed to solve inverse problems with iterative schemes that provably converge. For both CT and MRI reconstruction, the regularizer generalizes well and offers an excellent tradeoff between performance, number of parameters, guarantees, and interpretability when compared to other data-driven approaches.
    摘要 我们提议使用严格上界的非凸调理器,以实现强度下降的 convex 能量。这些调理器具有少于 15,000 个参数,并且具有信号处理解释,因为它们模仿了手工创建的稀疏过滤器。通过数据验证,我们发现这些推断器在扩散过滤器中表现更好,并且可以透过迭代方法来解决反射问题。在 CT 和 MRI 重建中,这个调理器具有卓越的协调性、表现能力、保证和解释性,并且与其他数据驱动方法进行比较。

Joint learning of images and videos with a single Vision Transformer

  • paper_url: http://arxiv.org/abs/2308.10533
  • repo_url: None
  • paper_authors: Shuki Shimizu, Toru Tamaki
  • for: 这种方法可以用于同时学习图像和视频,尤其是在图像和视频模型之间进行结合学习。
  • methods: 该方法使用Vision Transformer IV-ViT模型,并将图像批处理和视频帧进行延迟融合。
  • results: 实验结果表明,该方法可以在两个图像集和两个动作识别集上达到比较好的性能。
    Abstract In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.
    摘要 在本研究中,我们提出了一种方法,用于同时学习图像和视频,使用单个模型。通常情况下,图像和视频通常由分别的模型进行训练。我们在本篇论文中提出了一种方法,将批处图像输入到视transformer IV-ViT中,并使用延迟融合来将视频帧进行时间聚合。我们在两个图像数据集和两个动作识别数据集上进行了实验。

SRFormer: Empowering Regression-Based Text Detection Transformer with Segmentation

  • paper_url: http://arxiv.org/abs/2308.10531
  • repo_url: https://github.com/retsuh-bqw/SRFormer-Text-Det
  • paper_authors: Qingwen Bu, Sungrae Park, Minsoo Khang, Yichuan Cheng
  • for: 这个论文的目的是提出一种基于DETR的综合分割和回归模型,以优化Instance-level分割和回归预测。
  • methods: 该模型使用了DETR的基本思想,并在其基础之上增加了分割和回归两个分支。在每个分支中,使用了不同的激活函数和权重学习策略,以优化分割和回归预测。
  • results: 实验结果表明,该模型在多个benchmark上达到了优秀的性能水平,同时具有出色的训练和数据使用效率。此外,模型还可以在不同的环境下进行适应性和灵活性的提升。
    Abstract Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based methods and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing additional computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.
    摘要 现有的文本检测技术可以分为两个主要群组:分割方法和回归方法。分割模型可以提供更高的稳定性,但需要复杂的后处理,导致计算 overhead 高。回归方法可以进行实例化预测,但由于依赖高级表示,限制了其稳定性和数据效率。在我们的学术尝试中,我们提出了 SRFormer,一种基于 DETR 的混合 Segmentation 和 Regression 模型,旨在同时利用分割表示中的内在稳定性,以及实例化预测的直观后处理。我们的实验分析表明,在初始decoder层可以获得有利的分割预测。因此,我们对分割支持进行了限制,只在首些decoder层进行 incorporation,并采用进度ive regression 精度,从而实现性能提升而不增加计算负担。此外,我们还提出了一个Mask-informed Query Enhancement模块,通过使用分割结果作为自然的软 ROI, Pool 和提取稳定的像素表示,然后将其用于增强和多样化实例查询。我们在多个 benchmark 上进行了广泛的实验,得到了吸引人的发现,表明我们的方法具有出色的稳定性、超过常规训练和数据效率,以及当前领域的状态ArrayList 表现。

LightDepth: Single-View Depth Self-Supervision from Illumination Decline

  • paper_url: http://arxiv.org/abs/2308.10525
  • repo_url: None
  • paper_authors: Javier Rodríguez-Puigvert, Víctor M. Batlle, J. M. M. Montiel, Ruben Martinez Cantin, Pascal Fua, Juan D. Tardós, Javier Civera
  • for: 医学拍摄(endoscopy)中无法获得深度数据的情况下,提供一种单视自动深度估计方法,以减少性能下降。
  • methods: 基于单视自动深度估计和人工智能转移的方法。
  • results: 实验显示,我们的自动深度估计模型可以与全面监督模型匹配,而无需深度数据。
    Abstract Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data.
    摘要 一视图深度估计可以非常有效,只要有足够的真实深度数据进行监督训练。但在医学中的内镜拍摄等场景下,这些数据往往无法获得。在这些情况下,多视图自我监督和 sintetic-to-real 传播作为备选方案,但与监督 caso 的性能有很大的差异。而我们提议的单视图自我监督方法可以达到与监督 caso 相似的性能。在某些医疗设备中,如内镜,镜头和光源都位于 targets 表面的小距离上。因此,我们可以利用这一点,对于任何给定的反射率和表面方向,每个像素的亮度与表面 distance 平方成正比,从而提供了一个强大的单视图自我监督信号。在我们的实验中,我们的自我监督模型可以与完全监督模型具有相同的准确率,而无需深度真实数据。

Information Theory-Guided Heuristic Progressive Multi-View Coding

  • paper_url: http://arxiv.org/abs/2308.10522
  • repo_url: None
  • paper_authors: Jiangmeng Li, Hang Gao, Wenwen Qiang, Changwen Zheng
  • for: 本研究的目的是提出一种基于信息理论的普适多视图学习方法,以解决现有的视图学习方法存在的缺陷。
  • methods: 本研究提出了一种基于信息理论的多视图代表学习方法,包括分布层、集合层和实例层三个级别的进程。在分布层,IPMC方法将多视图的分布进行归一化,以减少视图特定的噪声。在集合层,IPMC方法建立了自适应对比池,并通过视图筛选器进行适应性修改。在实例层,我们采用了一种专门设计的统一损失函数来学习表示和减少梯度干扰。
  • results: 理论和实验研究表明,IPMC方法在比较现有方法时表现出色,具有更高的表示能力和更好的鲁棒性。
    Abstract Multi-view representation learning aims to capture comprehensive information from multiple views of a shared context. Recent works intuitively apply contrastive learning to different views in a pairwise manner, which is still scalable: view-specific noise is not filtered in learning view-shared representations; the fake negative pairs, where the negative terms are actually within the same class as the positive, and the real negative pairs are coequally treated; evenly measuring the similarities between terms might interfere with optimization. Importantly, few works study the theoretical framework of generalized self-supervised multi-view learning, especially for more than two views. To this end, we rethink the existing multi-view learning paradigm from the perspective of information theory and then propose a novel information theoretical framework for generalized multi-view learning. Guided by it, we build a multi-view coding method with a three-tier progressive architecture, namely Information theory-guided hierarchical Progressive Multi-view Coding (IPMC). In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC constructs self-adjusted contrasting pools, which are adaptively modified by a view filter. Lastly, in the instance-tier, we adopt a designed unified loss to learn representations and reduce the gradient interference. Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods.
    摘要 To address these limitations, we propose a novel information theoretical framework for generalized multi-view learning. This framework is based on information theory and provides a theoretical foundation for multi-view learning. Guided by this framework, we develop a multi-view coding method called Information theory-guided hierarchical Progressive Multi-view Coding (IPMC).IPMC consists of three tiers: distribution-tier, set-tier, and instance-tier. In the distribution-tier, IPMC aligns the distribution between views to reduce view-specific noise. In the set-tier, IPMC constructs self-adjusted contrasting pools, which are adaptively modified by a view filter. Finally, in the instance-tier, we adopt a designed unified loss to learn representations and reduce the gradient interference.Theoretically and empirically, we demonstrate the superiority of IPMC over state-of-the-art methods. Our approach is able to capture more comprehensive information from multiple views and reduce the impact of view-specific noise, leading to better performance in various tasks.

PHE-SICH-CT-IDS: A Benchmark CT Image Dataset for Evaluation Semantic Segmentation, Object Detection and Radiomic Feature Extraction of Perihematomal Edema in Spontaneous Intracerebral Hemorrhage

  • paper_url: http://arxiv.org/abs/2308.10521
  • repo_url: None
  • paper_authors: Deguo Ma, Chen Li, Lin Qiao, Tianming Du, Dechao Tang, Zhiyu Ma, Marcin Grzegorzek Hongzan, Hongzan Sun
  • For: This paper is written for researchers and clinicians who are interested in developing and evaluating computer-aided diagnostic methods for perihematomal edema (PHE) in spontaneous intracerebral hemorrhage (SICH).* Methods: The paper uses a publicly available CT dataset named PHE-SICH-CT-IDS, which comprises 120 brain CT scans and 7,022 CT images, along with corresponding medical information of the patients. The dataset is suitable for assessing the performance of segmentation, detection, and radiomic feature extraction methods.* Results: The paper demonstrates the effectiveness of PHE-SICH-CT-IDS for evaluating the performance of classical algorithms for semantic segmentation, object detection, and radiomic feature extraction. The experimental results confirm the suitability of the dataset for assessing the performance of these methods.Here is the information in Simplified Chinese text:* для: 这篇论文是为研究者和临床医生制定和评估计算机支持的诊断方法,尤其是 périhematomal edema(PHE)在意外内脑出血(SICH)中。* 方法: 这篇论文使用一个公共可用的CT数据集,名为PHE-SICH-CT-IDS,该数据集包含120个脑CT扫描和7022个CT图像,以及病人的医疗信息。该数据集适用于评估segementation、检测和 радиом特征提取方法的性能。* 结果: 这篇论文证明PHE-SICH-CT-IDS适用于评估计算机支持的segementation、检测和 радиом特征提取方法的性能。实验结果表明该数据集适用于评估这些方法的性能。
    Abstract Intracerebral hemorrhage is one of the diseases with the highest mortality and poorest prognosis worldwide. Spontaneous intracerebral hemorrhage (SICH) typically presents acutely, prompt and expedited radiological examination is crucial for diagnosis, localization, and quantification of the hemorrhage. Early detection and accurate segmentation of perihematomal edema (PHE) play a critical role in guiding appropriate clinical intervention and enhancing patient prognosis. However, the progress and assessment of computer-aided diagnostic methods for PHE segmentation and detection face challenges due to the scarcity of publicly accessible brain CT image datasets. This study establishes a publicly available CT dataset named PHE-SICH-CT-IDS for perihematomal edema in spontaneous intracerebral hemorrhage. The dataset comprises 120 brain CT scans and 7,022 CT images, along with corresponding medical information of the patients. To demonstrate its effectiveness, classical algorithms for semantic segmentation, object detection, and radiomic feature extraction are evaluated. The experimental results confirm the suitability of PHE-SICH-CT-IDS for assessing the performance of segmentation, detection and radiomic feature extraction methods. To the best of our knowledge, this is the first publicly available dataset for PHE in SICH, comprising various data formats suitable for applications across diverse medical scenarios. We believe that PHE-SICH-CT-IDS will allure researchers to explore novel algorithms, providing valuable support for clinicians and patients in the clinical setting. PHE-SICH-CT-IDS is freely published for non-commercial purpose at: https://figshare.com/articles/dataset/PHE-SICH-CT-IDS/23957937.
    摘要 急性脑出血是全球最高死亡率和最差预后的疾病之一。不自愿的急性脑出血(SICH)通常会出现突然,对诊断、定位和量化脑出血进行急速的医学实验是非常重要。早期检测和精准分类脑出血附近的肿瘤(PHE)是指导适当的临床干预和提高病人预后的关键。但是,计算机协助诊断技术的进步和评估受到脑出血附近肿瘤数据集的缺乏的公共存储问题所困扰。本研究建立了一个公共可用的脑CT数据集,名为PHE-SICH-CT-IDS,用于肿瘤脑出血的评估。这个数据集包含120个脑CT扫描和7,022个CT图像,以及病人的医疗信息。为证明其效果,本研究评估了古典的分类算法、物体检测算法和几何特征EXTRACTING算法。实验结果确认PHE-SICH-CT-IDS的适用性。根据我们所知,这是肿瘤脑出血中第一个公共可用的数据集,包括多种数据格式,适合各种医疗enario中的应用。我们相信PHE-SICH-CT-IDS将吸引研究人员探索新的算法,提供宝贵的支持 для医生和病人在临床设置中。PHE-SICH-CT-IDS是免费发布的非商业用途,可以在https://figshare.com/articles/dataset/PHE-SICH-CT-IDS/23957937中找到。

QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2308.10515
  • repo_url: None
  • paper_authors: Yifan Zhang, Zhen Dong, Huanrui Yang, Ming Lu, Cheng-Ching Tseng, Yuan Du, Kurt Keutzer, Li Du, Shanghang Zhang
  • for: 多视图3D检测任务中的 birds-eye-view (BEV) 方法在最近获得了显著改进。
  • methods: 我们的方法QD-BEV使用了一种新的视图指导采集 (VGD) 目标函数,可以稳定化量化感知 (QAT) 的训练,同时利用图像特征和 BEV 特征来提高模型性能。
  • results: 我们的实验表明,QD-BEV 可以与之前的方法具有相同或更好的准确度,同时具有显著的效率提升。在 nuScenes 数据集上,量化后的 QD-BEV-Tiny 模型可以达到 37.2% NDS,比 BevFormer-Tiny 提高 1.8%,即使使用了 8 倍压缩。在 Small 和 Base 变体中,QD-BEV 模型也表现出色,分别达到 47.9% NDS (28.2 MB) 和 50.9% NDS (32.9 MB)。
    Abstract Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements. However, the huge memory consumption of state-of-the-art models makes it hard to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Despite the wide application of quantization to lighten models, we show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively.
    摘要 Recently, multi-view 3D detection based on bird-eye-view (BEV) has made significant improvements. However, the large memory consumption of state-of-the-art models makes it difficult to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Although quantization is widely used to lighten models, we show in our paper that directly applying quantization to BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV uses a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively.Note:* NDS: Normalized Displacement Score* QAT: Quantization-Aware Training* QD-BEV: Quantization-aware Distillation for Bird-Eye-View* BEV: Bird-Eye-View* nuScenes: nuScenes dataset

Frequency Compensated Diffusion Model for Real-scene Dehazing

  • paper_url: http://arxiv.org/abs/2308.10510
  • repo_url: None
  • paper_authors: Jing Wang, Songtao Wu, Kuanhong Xu, Zhiqiang Yuan
  • for: 提高深度学习基于图像抑震的性能,适应实际世界霾照图像。
  • methods: 基于条件扩散模型,提出一种抑震框架,并设计了一个增强高频信号的网络单元(频率补偿块)来解决深度网络受到霾照图像的影响。
  • results: 对比STATE-OF-THE-ART方法,提出的抑震 diffusion model在实际世界霾照图像上显示出明显的性能提升。
    Abstract Due to distribution shift, deep learning based methods for image dehazing suffer from performance degradation when applied to real-world hazy images. In this paper, we consider a dehazing framework based on conditional diffusion models for improved generalization to real haze. First, we find that optimizing the training objective of diffusion models, i.e., Gaussian noise vectors, is non-trivial. The spectral bias of deep networks hinders the higher frequency modes in Gaussian vectors from being learned and hence impairs the reconstruction of image details. To tackle this issue, we design a network unit, named Frequency Compensation block (FCB), with a bank of filters that jointly emphasize the mid-to-high frequencies of an input signal. We demonstrate that diffusion models with FCB achieve significant gains in both perceptual and distortion metrics. Second, to further boost the generalization performance, we propose a novel data synthesis pipeline, HazeAug, to augment haze in terms of degree and diversity. Within the framework, a solid baseline for blind dehazing is set up where models are trained on synthetic hazy-clean pairs, and directly generalize to real data. Extensive evaluations show that the proposed dehazing diffusion model significantly outperforms state-of-the-art methods on real-world images.
    摘要 Translated into Simplified Chinese:由于分布shift,深度学习基于方法的图像霾化受到实际图像的影响,性能下降。在这篇论文中,我们考虑了基于条件扩散模型的霾化框架,以提高对实际霾的泛化性。首先,我们发现优化扩散模型的训练目标,即高斯噪声矢量,是非常困难的。深度网络的spectral bias导致高频模式在高斯矢量中不能学习,从而降低图像细节的重建。为解决这个问题,我们设计了一个网络单元,名为频率补偿块(FCB),它通过一系列滤波器联合强调输入信号中的中高频谱。我们示出了 diffusion模型与FCB可以在 Both perceptual and distortion metrics中获得显著改进。其次,为进一步提高泛化性,我们提出了一种新的数据生成管道,名为HazeAug,用于增强霾的度和多样性。在这个框架中,我们设置了一个坚实的基线 для盲目霾化,其中模型通过synthetic霾混清对应的训练,直接泛化到实际数据。我们进行了广泛的评估,并证明了我们的霾化扩散模型在实际图像上具有显著性能优势。

Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition

  • paper_url: http://arxiv.org/abs/2308.10493
  • repo_url: https://github.com/liuzhuang1024/SAM
  • paper_authors: Zhuang Liu, Ye Yuan, Zhilong Ji, Jingfeng Bai, Xiang Bai
  • for: 提高手写数学表达识别(HMER)的性能
  • methods: 提出了一种简单 yet efficient的方法,即semantic interaction learning(SIL),可以强化模型对符号之间的关系理解
  • results: 在公共 benchmark 数据集上进行了广泛的实验,结果表明,我们的提posed module可以有效地提高识别性能,并在 CROHME 和 HME100K 数据集上达到了优于优于 Prior 艺术的表现
    Abstract Handwritten mathematical expression recognition (HMER) has attracted extensive attention recently. However, current methods cannot explicitly study the interactions between different symbols, which may fail when faced similar symbols. To alleviate this issue, we propose a simple but efficient method to enhance semantic interaction learning (SIL). Specifically, we firstly construct a semantic graph based on the statistical symbol co-occurrence probabilities. Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space. The cosine distance between different projected vectors indicates the correlation between symbols. And jointly optimizing HMER and SIL can explicitly enhances the model's understanding of symbol relationships. In addition, SAM can be easily plugged into existing attention-based models for HMER and consistently bring improvement. Extensive experiments on public benchmark datasets demonstrate that our proposed module can effectively enhance the recognition performance. Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.
    摘要

SynDrone – Multi-modal UAV Dataset for Urban Scenarios

  • paper_url: http://arxiv.org/abs/2308.10491
  • repo_url: https://github.com/lttm/syndrone
  • paper_authors: Giulia Rizzoli, Francesco Barbato, Matteo Caligiuri, Pietro Zanuttigh
  • For: The paper aims to address the challenge of limited annotated high-resolution aerial data for developing computer vision algorithms for Unmanned Aerial Vehicles (UAVs).* Methods: The paper proposes a multimodal synthetic dataset that includes both images and 3D data taken at multiple flying heights, with object-level annotations and pixel-level labeling in 28 classes.* Results: The dataset contains 72k labeled samples that enable effective training of deep architectures, showing promising results in synthetic-to-real adaptation.Here’s the simplified Chinese text for the three points:* For: 本研究旨在解决无人机图像处理算法开发中的高分辨率飞行数据缺乏问题。* Methods: 本文提出了一个多模态的 sintetic 数据集,包括图像和多个飞行高度的3D数据,并包括对象级别的标注和28类像素级别标注。* Results: 数据集包含72k个标注样本,可以有效地训练深度架构,在 sintetic-to-real 适应中显示出扎实的成果。
    Abstract The development of computer vision algorithms for Unmanned Aerial Vehicles (UAVs) imagery heavily relies on the availability of annotated high-resolution aerial data. However, the scarcity of large-scale real datasets with pixel-level annotations poses a significant challenge to researchers as the limited number of images in existing datasets hinders the effectiveness of deep learning models that require a large amount of training data. In this paper, we propose a multimodal synthetic dataset containing both images and 3D data taken at multiple flying heights to address these limitations. In addition to object-level annotations, the provided data also include pixel-level labeling in 28 classes, enabling exploration of the potential advantages in tasks like semantic segmentation. In total, our dataset contains 72k labeled samples that allow for effective training of deep architectures showing promising results in synthetic-to-real adaptation. The dataset will be made publicly available to support the development of novel computer vision methods targeting UAV applications.
    摘要 Computer vision algorithms for Unmanned Aerial Vehicles (UAVs) 图像的开发受到高分辨率飞行数据的可用性的限制。然而,现有数据集中的图像数量有限,使得深度学习模型的训练数据量有限,从而降低了模型的效果。在本文中,我们提出了一个多modal的 sintetic 数据集,包括图像和3D数据, captured at multiple flying heights。此外,数据集还包括对象级别的标注,以及每个像素的28个分类标注,使得可以进行 semantic segmentation 任务的探索。总的来说,我们的数据集包含72k个标注样本,可以有效地训练深度建筑,并且在 sintetic-to-real 适应中显示出了良好的结果。这个数据集将会公开发布,以支持计算机视觉方法的开发,targeting UAV应用。

Enhancing Medical Image Segmentation: Optimizing Cross-Entropy Weights and Post-Processing with Autoencoders

  • paper_url: http://arxiv.org/abs/2308.10488
  • repo_url: None
  • paper_authors: Pranav Singh, Luoyao Chen, Mei Chen, Jinqian Pan, Raviteja Chukkapalli, Shravan Chaudhari, Jacopo Cirrone
  • for: 这个研究旨在提出一种适应医学影像分类的深度学习方法,以提高医学影像分类的精度和效率。
  • methods: 我们提出了一种基于深度学习的医学影像分类方法,使用了U-Net和U-Net++的架构,并且调整了损失函数的重要性。
  • results: 我们的方法在关于疖症病理 dataset 上的评估中,与现有的state-of-the-art方法相比,平均提高了12.26%和12.04%。此外,我们还对三个具有挑战性的医学影像分类任务进行了 benchmarking。
    Abstract The task of medical image segmentation presents unique challenges, necessitating both localized and holistic semantic understanding to accurately delineate areas of interest, such as critical tissues or aberrant features. This complexity is heightened in medical image segmentation due to the high degree of inter-class similarities, intra-class variations, and possible image obfuscation. The segmentation task further diversifies when considering the study of histopathology slides for autoimmune diseases like dermatomyositis. The analysis of cell inflammation and interaction in these cases has been less studied due to constraints in data acquisition pipelines. Despite the progressive strides in medical science, we lack a comprehensive collection of autoimmune diseases. As autoimmune diseases globally escalate in prevalence and exhibit associations with COVID-19, their study becomes increasingly essential. While there is existing research that integrates artificial intelligence in the analysis of various autoimmune diseases, the exploration of dermatomyositis remains relatively underrepresented. In this paper, we present a deep-learning approach tailored for Medical image segmentation. Our proposed method outperforms the current state-of-the-art techniques by an average of 12.26% for U-Net and 12.04% for U-Net++ across the ResNet family of encoders on the dermatomyositis dataset. Furthermore, we probe the importance of optimizing loss function weights and benchmark our methodology on three challenging medical image segmentation tasks
    摘要 医疗影像分割任务存在独特的挑战,需要同时具备本地化和整体性的semantic理解,以准确划分关键组织或异常特征。这种复杂性在医疗影像分割中更加强调,因为影像中的间类 similarity和内部变化很高,同时可能存在图像遮挡。医疗影像分割任务在研究 histopathology 报告中进一步复杂,特别是在 autoimmune 疾病如 dermatomyositis 中。在这些情况下,分析细胞Inflammation和互动尚未得到了足够的研究,这主要归结于数据获取管道的限制。虽然医学科技在进步,但我们仍然缺乏一个全面的 autoimmune 疾病集成。随着 autoimmune 疾病全球蔓延和 COVID-19 的相关性,其研究变得越来越重要。虽然现有的研究已经将人工智能integrated into the analysis of various autoimmune diseases,但dermatomyositis 的研究仍然相对落后。在这篇文章中,我们提出了一种适应医疗影像分割的深度学习方法。我们的提议方法在 ResNet 家族的encoder上的 U-Net 和 U-Net++ 中平均提高了12.26%和12.04%。此外,我们还评估了优化损失函数权重的重要性,并在三个医疗影像分割任务上进行了 benchmark。

ADNet: Lane Shape Prediction via Anchor Decomposition

  • paper_url: http://arxiv.org/abs/2308.10481
  • repo_url: None
  • paper_authors: Lingyu Xiao, Xiang Li, Sen Yang, Wankou Yang
  • for: 提高anchor-based检测方法的限制,使其能够适应不同的检测场景和数据集。
  • methods: 将anchors decomposed into学习热点的映射和相关的方向,从而提高anchors的灵活性和质量。还引入了大kernel注意力(LKA)来增强特征pyramid网络(FPN)的感知范围,以更好地捕捉到lane线的全图Context。
  • results: 在三个广泛使用的检测 benchmark上(VIL-100、CULane和TuSimple)实验表明,我们的方法比现有方法在VIL-100上表现出色,在CULane和TuSimple上具有竞争性的准确率。
    Abstract In this paper, we revisit the limitations of anchor-based lane detection methods, which have predominantly focused on fixed anchors that stem from the edges of the image, disregarding their versatility and quality. To overcome the inflexibility of anchors, we decompose them into learning the heat map of starting points and their associated directions. This decomposition removes the limitations on the starting point of anchors, making our algorithm adaptable to different lane types in various datasets. To enhance the quality of anchors, we introduce the Large Kernel Attention (LKA) for Feature Pyramid Network (FPN). This significantly increases the receptive field, which is crucial in capturing the sufficient context as lane lines typically run throughout the entire image. We have named our proposed system the Anchor Decomposition Network (ADNet). Additionally, we propose the General Lane IoU (GLIoU) loss, which significantly improves the performance of ADNet in complex scenarios. Experimental results on three widely used lane detection benchmarks, VIL-100, CULane, and TuSimple, demonstrate that our approach outperforms the state-of-the-art methods on VIL-100 and exhibits competitive accuracy on CULane and TuSimple. Code and models will be released on https://github.com/ Sephirex-X/ADNet.
    摘要 在这篇论文中,我们重新检视了锚点基本的巡航检测方法的局限性,这些方法主要关注于图像边缘的固定锚点,忽略它们的灵活性和质量。为了超越锚点的限制,我们将锚点 decomposed into 学习热点的热度地图和其相关的方向。这种 decomposing 方法使我们的算法可以适应不同的巡航类型在不同的数据集中。为了提高锚点的质量,我们引入了大kernels attention(LKA) для feature pyramid network(FPN)。这些 significantly increases the receptive field,这是关键的 capture 巡航线通常在整个图像中进行。我们称我们的提出的系统为 anchor decomposition network(ADNet)。此外,我们提出了通用巡航 IoU(GLIoU)损失,这些 Significantly improves 我们的ADNet在复杂的情况下的性能。实验结果表明,我们的方法在 VIL-100 上 exceeds state-of-the-art 方法的性能,并在 CULane 和 TuSimple 上展现了竞争性的准确率。代码和模型将在 上发布。

STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

  • paper_url: http://arxiv.org/abs/2308.10468
  • repo_url: https://github.com/taohan10200/steerer
  • paper_authors: Tao Han, Lei Bai, Lingbo Liu, Wanli Ouyang
  • for: solves the problem of scale variations in object counting, which has not been effectively addressed by existing scale-aware algorithms.
  • methods: proposes a novel method called STEERER, which selectively forwards scale-customized features at each scale and uses a Masked Selection and Inheritance Loss to achieve high-quality density maps across all scales.
  • results: demonstrates unprecedented scale generalization ability on nine datasets with counting and localization tasks, achieving high accuracy and outperforming state-of-the-art methods.
    Abstract Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (\textbf{S}elec\textbf{T}iv\textbf{E} inh\textbf{ER}itance l\textbf{E}a\textbf{R}ning) that addresses the issue of scale variations in object counting. STEERER selects the most suitable scale for patch objects to boost feature extraction and only inherits discriminative features from lower to higher resolution progressively. The main insights of STEERER are a dedicated Feature Selection and Inheritance Adaptor (FSIA), which selectively forwards scale-customized features at each scale, and a Masked Selection and Inheritance Loss (MSIL) that helps to achieve high-quality density maps across all scales. Our experimental results on nine datasets with counting and localization tasks demonstrate the unprecedented scale generalization ability of STEERER. Code is available at \url{https://github.com/taohan10200/STEERER}.
    摘要 减噪是深刻的问题在对象计数中,现有的缩放意识算法没有有效地解决这一问题。一个重要的原因是,它们通常会在多个分辨率之间进行合作学习,这可能会导致学习每个分辨率上最有用的特征。在这篇论文中,我们提出了一种新的方法,称为STEERER( SelecTive InhEritance for objEct Recongition,简称STEERER),它解决了对象计数中的缩放问题。STEERER选择最适合的缩放比例 для小 objet,以提高特征提取,并只往下一个分辨率继承有用的特征。我们的主要发现包括一种专门的特征选择和继承器(FSIA),它在每个缩放级别上选择缩放特定的特征,以及一种做好的掩码选择和继承损失(MSIL),它帮助实现高质量的扩散图 across all scales。我们的实验结果表明,STEERER在九个数据集上的计数和本地化任务中表现出了前所未有的缩放普适性。代码可以在 \url{https://github.com/taohan10200/STEERER} 上获取。

Privacy-Preserving Face Recognition Using Random Frequency Components

  • paper_url: http://arxiv.org/abs/2308.10461
  • repo_url: https://github.com/Tencent/TFace
  • paper_authors: Yuxi Mi, Yuge Huang, Jiazhen Ji, Minyi Zhao, Jiaxiang Wu, Xingkun Xu, Shouhong Ding, Shuigeng Zhou
  • for: 隐私保护 face 图像的视觉信息和模型可还原性
  • methods: 提出了一种基于人类视觉的低频成分剔除技术,以降低模型可还原性;同时提出了一种基于 latest 理论发现和模型注意力的解决方案,即在训练和推理过程中随机选择频谱成分。
  • results: 对比 experiments 表明,提出的方法可以均衡隐私保护目标和识别率的要求。
    Abstract The ubiquitous use of face recognition has sparked increasing privacy concerns, as unauthorized access to sensitive face images could compromise the information of individuals. This paper presents an in-depth study of the privacy protection of face images' visual information and against recovery. Drawing on the perceptual disparity between humans and models, we propose to conceal visual information by pruning human-perceivable low-frequency components. For impeding recovery, we first elucidate the seeming paradox between reducing model-exploitable information and retaining high recognition accuracy. Based on recent theoretical insights and our observation on model attention, we propose a solution to the dilemma, by advocating for the training and inference of recognition models on randomly selected frequency components. We distill our findings into a novel privacy-preserving face recognition method, PartialFace. Extensive experiments demonstrate that PartialFace effectively balances privacy protection goals and recognition accuracy. Code is available at: https://github.com/Tencent/TFace.
    摘要 “人脸识别技术的普遍使用导致隐私问题的增加,因为未经授权的存取具有敏感信息的人脸图像可能会妥协个人资讯。本文对人脸图像的隐私保护进行了深入的研究,并提出了一种隐私保护方法。基于人类和模型之间的感知差异,我们提议使用人类可见的低频率成分剔除可见信息。为防止回复,我们首先解释了模型可以利用的信息减少和高识别率之间的悖论,然后提出了一种基于最近的理论发现和我们对模型注意力的观察,将训练和测试识别模型的阶段随机选择频率成分。我们将结果整合为一个名为PartialFace的隐私保护人脸识别方法。实验结果显示,PartialFace能够均衡隐私保护目标和识别率。代码可以在https://github.com/Tencent/TFace上获取。”

DOMINO++: Domain-aware Loss Regularization for Deep Learning Generalizability

  • paper_url: http://arxiv.org/abs/2308.10453
  • repo_url: None
  • paper_authors: Skylar E. Stolte, Kyle Volle, Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods, Kevin Brink, Matthew Hale, Ruogu Fang
  • for: 提高现代深度学习(DL)中的对外部数据(OOD)泛化性能,解决现有DL模型在不同测试数据上的表现不稳定问题。
  • methods: 提出了一种基于双指导和动态领域相关loss regularization的DOMINO++模型,将专家指导和数据指导知识 integrate到模型中,并通过动态 scaling factor和适应regularization rate来调整regulation。
  • results: 对于MRIs的头部分 segmentation任务,COMPARISON WITH DOMINO和基eline模型表明,DOMINO++在OOD数据上表现出优秀的性能, indicating its potential to improve the reliable deployment of DL on real clinical data.
    Abstract Out-of-distribution (OOD) generalization poses a serious challenge for modern deep learning (DL). OOD data consists of test data that is significantly different from the model's training data. DL models that perform well on in-domain test data could struggle on OOD data. Overcoming this discrepancy is essential to the reliable deployment of DL. Proper model calibration decreases the number of spurious connections that are made between model features and class outputs. Hence, calibrated DL can improve OOD generalization by only learning features that are truly indicative of the respective classes. Previous work proposed domain-aware model calibration (DOMINO) to improve DL calibration, but it lacks designs for model generalizability to OOD data. In this work, we propose DOMINO++, a dual-guidance and dynamic domain-aware loss regularization focused on OOD generalizability. DOMINO++ integrates expert-guided and data-guided knowledge in its regularization. Unlike DOMINO which imposed a fixed scaling and regularization rate, DOMINO++ designs a dynamic scaling factor and an adaptive regularization rate. Comprehensive evaluations compare DOMINO++ with DOMINO and the baseline model for head tissue segmentation from magnetic resonance images (MRIs) on OOD data. The OOD data consists of synthetic noisy and rotated datasets, as well as real data using a different MRI scanner from a separate site. DOMINO++'s superior performance demonstrates its potential to improve the trustworthy deployment of DL on real clinical data.
    摘要 现代深度学习(DL)面临跨频训练数据(OOD)泛化的严重挑战。OOD数据包括测试数据,与DL模型训练数据差异很大。DL模型在域内测试数据上表现良好,但在OOD数据上却有很大的差异。解决这个差异是DL模型的可靠部署的关键。正确的模型均衡可以减少DL模型中的偶极连接,从而提高OOD泛化。先前的工作提出了域快照型均衡(DOMINO)以提高DL均衡,但它缺乏对OOD数据的设计。在这个工作中,我们提出了DOMINO++,它是双指导和动态域快照型均衡,专注于OOD泛化。DOMİNO++结合专家指导和数据指导的知识在其均衡中。与DOMINO不同,DOMİNO++不同的是它的固定均衡率和均衡率。COMPREHENSIVE评估比较了DOMİNO++与DOMINO和基eline模型在MRIs的头部分 segmentation任务上的性能。OOD数据包括synthetic noisy和旋转数据集,以及实际数据使用不同的MRI扫描仪。DOMİNO++的优秀表现表明它的可能性,以提高DL在实际临床数据上可靠部署。

COCA: Classifier-Oriented Calibration for Source-Free Universal Domain Adaptation via Textual Prototype

  • paper_url: http://arxiv.org/abs/2308.10450
  • repo_url: None
  • paper_authors: Xinghong Liu, Yi Zhou, Tao Zhou, Chun-Mei Feng, Ling Shao
  • for: 本研究旨在解决源频率不足的情况下实现高效的通用领域适应(UniDA)和源自由UniDA(SF-UniDA)方法,以便在实际应用中减少标注成本。
  • methods: 我们提出了一种基于文本原型的分类器准备(COCA)方法,利用少量源频率学习来替代大量标注的需求,从而大幅减少标注成本。
  • results: 我们的方法在对比国际顶尖UniDA和SF-UniDA模型时表现出色,并且在源频率不足的情况下实现了高效的适应性。
    Abstract Universal Domain Adaptation (UniDA) aims to distinguish common and private classes between the source and target domains where domain shift exists. Recently, due to more stringent data restrictions, researchers have introduced Source-Free UniDA (SF-UniDA) in more realistic scenarios. SF-UniDA methods eliminate the need for direct access to source samples when performing adaptation to the target domain. However, existing SF-UniDA methods still require an extensive quantity of labeled source samples to train a source model, resulting in significant labeling costs. To tackle this issue, we present a novel Classifier-Oriented Calibration (COCA) method. This method, which leverages textual prototypes, is formulated for the source model based on few-shot learning. Specifically, we propose studying few-shot learning, usually explored for closed-set scenarios, to identify common and domain-private classes despite a significant domain shift between source and target domains. Essentially, we present a novel paradigm based on the vision-language model to learn SF-UniDA and hugely reduce the labeling costs on the source domain. Experimental results demonstrate that our approach outperforms state-of-the-art UniDA and SF-UniDA models.
    摘要 universal 预设适应 (UniDA) 目标是在源预设和目标预设之间分别识别公共和私有类别,而且当存在预设迁移时,研究者们已经引入了无源预设UniDA (SF-UniDA) 方法。SF-UniDA 方法消除了直接访问源样本的需求,并在更现实的应用中进行适应。然而,现有的 SF-UniDA 方法仍然需要大量的标签源样本来训练源模型,从而导致标签成本增加。为了解决这个问题,我们提出了一个新的标签模型均衡 (COCA) 方法。这个方法,基于文本原型,是基于少量学习的形式,对源模型进行均衡。具体而言,我们提出了研究少量学习,通常在关闭集成enario中进行探索,以识别源和预设迁移之间的公共和预设类别,即使存在巨大的预设迁移。我们提出了一个基于视觉语言模型的新方法,以学习SF-UniDA并大幅降低源预设标签成本。实验结果显示,我们的方法比现有的UniDA和SF-UniDA模型表现更好。

Explore and Tell: Embodied Visual Captioning in 3D Environments

  • paper_url: http://arxiv.org/abs/2308.10447
  • repo_url: https://github.com/AIM3-RUC/ExploreAndTell
  • paper_authors: Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin
  • for: 提高视觉描述模型的准确率和可靠性,使其能够更好地理解场景和描述对象。
  • methods: 提出一种新任务 called Embodied Captioning,即让视觉描述模型具有导航能力,从不同视点获取场景信息,提高场景理解和描述准确率。
  • results: 建立了ET-Cap数据集,并提出了一种Cascade Embodied Captioning模型(CaBOT),可以帮助解决这个任务。模型在实验中表现出色,超过了其他特制基elines。
    Abstract While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.
    摘要 当前的视觉描述模型已经实现了各种印象的表现,但是它们通常假设图像是完整、有利的捕捉了场景。然而,在实际情况下,单个图像可能不会提供好的视点,从而降低细腻的场景理解。为了解决这个限制,我们提出了一项新任务,即Embodied Captioning,这个任务使得视觉描述模型具有探索能力,以减少从不佳视点的视觉含义不确定性。具体来说,从随机视点开始,一个代理人需要在环境中探索信息,并生成描述全场景中所有物品的完整段落。为支持这个任务,我们构建了ET-Cap数据集,使用Kubric simulator,包含10000个3D场景,每个场景有杂乱的物品,并且每个场景有三个注释段落。我们提出了CaBOT模型,它包括探索器和描述器,用于解决这个任务。探索器预测在环境中需要执行哪些动作,而描述器基于整个探索轨迹生成段落描述。我们的实验表明,我们的模型在其他优化的基elines上表现出色。我们的数据集、代码和模型可以在https://aim3-ruc.github.io/ExploreAndTell中获取。

When Prompt-based Incremental Learning Does Not Meet Strong Pretraining

  • paper_url: http://arxiv.org/abs/2308.10445
  • repo_url: https://github.com/tom-tym/apg
  • paper_authors: Yu-Ming Tang, Yi-Xing Peng, Wei-Shi Zheng
  • for: 这篇论文主要针对 incremental learning 问题,即如何使深度网络在不同任务之间快速学习,而不导致 catastrophic forgetting。
  • methods: 这篇论文提出了一种 learnable Adaptive Prompt Generator (APG),即将 prompt retrieval 和 prompt learning 过程结合在一起,以便在不同任务之间快速适应。此外, authors 还提出了一种知识池来规范 APG,以避免学习不效的知识。
  • results: compare to 先进的方法,这篇论文的方法在无强大预训练(typically trained on ImageNet-21k)的情况下,在 exemplar-free incremental learning 中显著超过了其他方法。此外,在强 RETRAINING 情况下,这篇论文的方法也有相当的比较性,表明它可以从预训练中吸取有用的知识。
    Abstract Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequential tasks by learning task-specific prompts. However, existing prompt-based methods heavily rely on strong pretraining (typically trained on ImageNet-21k), and we find that their models could be trapped if the potential gap between the pretraining task and unknown future tasks is large. In this work, we develop a learnable Adaptive Prompt Generator (APG). The key is to unify the prompt retrieval and prompt learning processes into a learnable prompt generator. Hence, the whole prompting process can be optimized to reduce the negative effects of the gap between tasks effectively. To make our APG avoid learning ineffective knowledge, we maintain a knowledge pool to regularize APG with the feature distribution of each class. Extensive experiments show that our method significantly outperforms advanced methods in exemplar-free incremental learning without (strong) pretraining. Besides, under strong retraining, our method also has comparable performance to existing prompt-based models, showing that our method can still benefit from pretraining. Codes can be found at https://github.com/TOM-tym/APG
    摘要 “增量学习”是为了解决深度网络在顺序任务上学习时的“迷失学习”问题。现有的方法通过学习任务特定的提示来采用固定的背部来sequential tasks。然而,现有的方法具有强大的预训练(通常是在ImageNet-21k上进行),而我们发现其模型在未知的未来任务之间的 gap 太大时会被锁定。在这种情况下,我们开发了一个可学习的 Adaptive Prompt Generator (APG)。关键在于将提取提示和学习提示过程统一到一个可学习的提示生成器中。因此,整个提示过程可以被优化,以降低任务之间的负面效果。此外,为了保证 APG 不会学习无用的知识,我们维护了一个知识池,用于规范 APG 的特征分布。我们的方法在无(强)预训练情况下与先进方法相比较出色,并且在强 RETRAINING 情况下也有相当的性能,表明我们的方法可以从预训练中受益。代码可以在 GitHub 上找到:https://github.com/TOM-tym/APG。

Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10438
  • repo_url: https://github.com/akimoto-cris/rd_vit_prune
  • paper_authors: Kaixin Xu, Zhe Wang, Xue Geng, Jie Lin, Min Wu, Xiaoli Li, Weisi Lin
  • for: 提高深度神经网络(DNN)的表现,特别是在输出误差最小化的情况下,同时满足一定的剔除率约束。
  • methods: 提出了一种层 adaptive Weight 剔除方法,利用多层weight的加法性来转化剔除问题为一个 combinatorial 优化问题,通过动态编程解决。
  • results: 在 CIFAR-10 和 ImageNet datasets 上实现了优于现有方法的表现,在 ResNet-32、VGG-16 和 DenseNet-121 上实现了remarkable 的提高,对于 VGG-16 和 ResNet-50 在 ImageNet 上实现了4.7% 和 4.6% 的顶部一Accuracy 提高。
    Abstract In this paper, we propose a novel layer-adaptive weight-pruning approach for Deep Neural Networks (DNNs) that addresses the challenge of optimizing the output distortion minimization while adhering to a target pruning ratio constraint. Our approach takes into account the collective influence of all layers to design a layer-adaptive pruning scheme. We discover and utilize a very important additivity property of output distortion caused by pruning weights on multiple layers. This property enables us to formulate the pruning as a combinatorial optimization problem and efficiently solve it through dynamic programming. By decomposing the problem into sub-problems, we achieve linear time complexity, making our optimization algorithm fast and feasible to run on CPUs. Our extensive experiments demonstrate the superiority of our approach over existing methods on the ImageNet and CIFAR-10 datasets. On CIFAR-10, our method achieves remarkable improvements, outperforming others by up to 1.0% for ResNet-32, 0.5% for VGG-16, and 0.7% for DenseNet-121 in terms of top-1 accuracy. On ImageNet, we achieve up to 4.7% and 4.6% higher top-1 accuracy compared to other methods for VGG-16 and ResNet-50, respectively. These results highlight the effectiveness and practicality of our approach for enhancing DNN performance through layer-adaptive weight pruning. Code will be available on https://github.com/Akimoto-Cris/RD_VIT_PRUNE.
    摘要 在这篇论文中,我们提出了一种新的层adaptive weight采样方法,用于深度神经网络(DNNs),以最小化输出误差的同时遵循target采样比例约束。我们的方法考虑了所有层的集合影响,设计了层adaptive采样方案。我们发现和利用了多层采样weight对输出误差的加性性质。这个性质使得我们可以将采样转换为一个 combinatorial 优化问题,通过动态Programming efficiently解决。我们将问题 decomposes into sub-problems,实现了线性时间复杂度,使我们的优化算法快速并可以在CPU上运行。我们的广泛实验表明我们的方法在ImageNet和CIFAR-10 datasets上表现出色,与其他方法相比,在CIFAR-10上出色的提高了top-1准确率,对ResNet-32、VGG-16和DenseNet-121的top-1准确率提高了0.5%、1.0%和0.7%。在ImageNet上,我们的方法对VGG-16和ResNet-50 achieved up to 4.7%和4.6% higher top-1 accuracy than other methods。这些结果表明我们的方法可以增强DNN性能 through layer-adaptive weight pruning。代码将于https://github.com/Akimoto-Cris/RD_VIT_PRUNE上发布。

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

  • paper_url: http://arxiv.org/abs/2308.10421
  • repo_url: https://github.com/hollow-503/unim2ae
  • paper_authors: Jian Zou, Tianyu Huang, Guanglei Yang, Zhenhua Guo, Wangmeng Zuo
  • for: 本研究旨在开发一种多模态自适应预训练方法,以提高自动驾驶中多个感知器的数据融合。
  • methods: 本研究提出了一种名为UniM$^2$AE的多模态遮盲自动encoder方法,包括两个主要设计:首先,将多modalities的特征项映射到一个共同的3D卷积空间中,然后使用多模态3D交互模块(MMIM)来促进多模态之间的有效交互。
  • results: 实验表明,UniM$^2$AE可以提高3D物体检测和bird’s eye view(BEV)地图分割的性能,相比原始方法提高1.2%(NDS)和6.5%(mIoU)。
    Abstract Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. While integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable gap in MAE methods addressing this integration. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, the UniM$^2$AE is proposed. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space, ingeniously expanded from the bird's eye view (BEV) to include the height dimension. The extension makes it possible to back-project the informative features, obtained by fusing features from both modalities, into their native modalities to reconstruct the multiple masked inputs. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\%(NDS) and 6.5\% (mIoU), respectively. Code is available at https://github.com/hollow-503/UniM2AE.
    摘要 masked autoencoders (MAE) 在学习强大表示方面发挥着重要作用,在多种3D感知任务中达到了出色的结果。在实际驾驶场景中,通常会使用多种感知器来捕捉环境。虽然将多Modal特征集成到这些感知器可以生成丰富和强大的特征,但是MAE方法中的集成还有一定的空白。这项研究探讨了针对自动驾驶场景的多Modal Masked Autoencoders,以实现更有效的感知器融合。为了结合图像和LiDAR点云中的semantics和几何特征,我们提出了UniM$^2$AE模型。这是一种简单而强大的多Modal自我超vised预训练框架,主要包括两个设计。首先,我们将两种模式的特征映射到一个共同的3D体积空间中,通过从鸟瞰视角(BEV)扩展而包含高度维度。这使得可以将多modal特征拼接而成的有用特征回溯到其原始模式来重建多个遮盲输入。其次,我们采用多Modal3D互动模块(MMIM)来促进多Modal之间的效率互动。对于nuScenes Dataset进行了广泛的实验,结果表明UniM$^2$AE模型在3D物体检测和BEV地图分割方面提高了1.2%(NDS)和6.5%(mIoU)。代码可以在https://github.com/hollow-503/UniM2AE找到。

The Change You Want to See (Now in 3D)

  • paper_url: http://arxiv.org/abs/2308.10417
  • repo_url: https://github.com/PETRUS1980/ayaka-v2
  • paper_authors: Ragav Sachdeva, Andrew Zisserman
  • for: 检测三维场景中的变化,即两个不同摄像机位置和时间点上的图像之间的差异。
  • methods: 使用自动生成的数据集进行训练,并使用自适应彩色差分来检测变化。
  • results: 可以直接使用两个RGB图像进行检测,不需要访问地图信息或其他辅助数据。 plus, a new evaluation dataset is released with human-annotated differences.
    Abstract The goal of this paper is to detect what has changed, if anything, between two "in the wild" images of the same 3D scene acquired from different camera positions and at different temporal instances. The open-set nature of this problem, occlusions/dis-occlusions due to the shift in viewpoint, and the lack of suitable training datasets, presents substantial challenges in devising a solution. To address this problem, we contribute a change detection model that is trained entirely on synthetic data and is class-agnostic, yet it is performant out-of-the-box on real world images without requiring fine-tuning. Our solution entails a "register and difference" approach that leverages self-supervised frozen embeddings and feature differences, which allows the model to generalise to a wide variety of scenes and domains. The model is able to operate directly on two RGB images, without requiring access to ground truth camera intrinsics, extrinsics, depth maps, point clouds, or additional before-after images. Finally, we collect and release a new evaluation dataset consisting of real-world image pairs with human-annotated differences and demonstrate the efficacy of our method. The code, datasets and pre-trained model can be found at: https://github.com/ragavsachdeva/CYWS-3D
    摘要 “这篇论文的目标是检测在不同摄像机位和时间点上Acquired的同一个3D场景中的变化。由于这是一个开放集合问题,加上 occlusions/dis-occlusions 以及lack of suitable training datasets,这具有很大的挑战。为解决这个问题,我们提出了一种基于Synthetic Data的变化检测模型,这种模型是无类别的,可以在实际世界图像上表现出高性能,而不需要微调。我们的解决方案基于"register and difference"方法,利用自我supervised frozen embeddings和特征差异,使得模型可以适应各种场景和频谱。该模型可以直接操作两个RGB图像,不需要访问地面 truth camera intrinsics, extrinsics, depth maps, point clouds, or additional before-after images。最后,我们收集了和发布了一个新的评估数据集,该数据集包含人注解的差异,并证明了我们的方法的有效性。代码、数据集和预训练模型可以在:https://github.com/ragavsachdeva/CYWS-3D 找到。”

In-Rack Test Tube Pose Estimation Using RGB-D Data

  • paper_url: http://arxiv.org/abs/2308.10411
  • repo_url: None
  • paper_authors: Hao Chen, Weiwei Wan, Masaki Matsushita, Takeyuki Kotaka, Kensuke Harada
  • for: 本研究旨在提高生物和医疗领域 robotic manipulate 试管的精度和安全性,通过检测和确定试管的位置和orientation。
  • methods: 本研究使用 YOLO 对象检测器和点云注册技术来检测和确定试管和试管架的位置和orientation。
  • results: 本研究提出了一种基于优化算法的 pose 估计方法,可以在受到噪声和不完整的点云数据的情况下提供稳定和准确的 pose 估计结果。
    Abstract Accurate robotic manipulation of test tubes in biology and medical industries is becoming increasingly important to address workforce shortages and improve worker safety. The detection and localization of test tubes are essential for the robots to successfully manipulate test tubes. In this paper, we present a framework to detect and estimate poses for the in-rack test tubes using color and depth data. The methodology involves the utilization of a YOLO object detector to effectively classify and localize both the test tubes and the tube racks within the provided image data. Subsequently, the pose of the tube rack is estimated through point cloud registration techniques. During the process of estimating the poses of the test tubes, we capitalize on constraints derived from the arrangement of rack slots. By employing an optimization-based algorithm, we effectively evaluate and refine the pose of the test tubes. This strategic approach ensures the robustness of pose estimation, even when confronted with noisy and incomplete point cloud data.
    摘要 现代生物和医疗行业中的精度机器人操作试管是面临人力短缺和工作者安全问题的关键。检测和确定试管的位置是机器人操作试管的关键。在这篇论文中,我们提出了一种框架,用于通过色彩和深度数据检测和估算试管的位置。我们利用YOLO对象检测器有效地分类和定位试管和试管架 Within the provided image data.然后,我们通过点云 региSTR的技术来估算试管架的 pose。在估算试管的位置时,我们利用试管架的槽布局约束来约束估算。通过使用优化算法,我们有效地评估和修正试管的位置,从而确保pose估算的稳定性,即使面临噪点云数据时。

Turning a CLIP Model into a Scene Text Spotter

  • paper_url: http://arxiv.org/abs/2308.10408
  • repo_url: https://github.com/wenwenyu/tcm
  • paper_authors: Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai
  • for: 这paper aimed to enhance scene text detection and spotting tasks using the large-scale Contrastive Language-Image Pretraining (CLIP) model.
  • methods: 这paper proposed a new backbone called FastTCM-CR50, which utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. It also introduces an instance-language matching process to refine text regions.
  • results: FastTCM-CR50 offers several advantages, including improved performance (1.7% and 1.5% on average), faster inference speed (48.5% increase), and robust few-shot training capabilities (26.5% and 5.5% improvement on text detection and spotting tasks, respectively). It also consistently enhances performance on out-of-distribution text detection and spotting datasets.
    Abstract We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks, transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge. Using predefined and learnable prompts, FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings, thereby refining text regions. Our Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt generation, enabling offline computations and improving performance. FastTCM-CR50 offers several advantages: 1) It can enhance existing text detectors and spotters, improving performance by an average of 1.7% and 1.5%, respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an average improvement of 0.2% and 0.56% in text detection and spotting tasks, along with a 48.5% increase in inference speed. 3) It showcases robust few-shot training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50 improves performance by an average of 26.5% and 5.5% for text detection and spotting tasks, respectively. 4) It consistently enhances performance on out-of-distribution text detection and spotting datasets, particularly the NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented object detection. The code is available at https://github.com/wenwenyu/TCM.
    摘要 我们利用大规模的对比语言图像预训练(CLIP)模型来提高场景文本检测和搜寻任务,转化其成一种强大的基础模型,快速TCM-CR50。这个基础模型利用视觉推议和跨Modal的注意力在CLIP中提取图像和文本基于的先前知识。使用预定义和学习的推议,快速TCM-CR50引入了一个实例语言匹配过程,以提高图像和文本嵌入的同步性,从而细化文本区域。我们的双模态匹配(BSM)模块实现了动态语言推议生成,可以进行离线计算,提高性能。快速TCM-CR50具有以下优点:1)可以提高现有的文本检测和搜寻器,提高性能约1.7%和1.5%,分别。2)超过之前的TCM-CR50基础模型,平均提高0.2%和0.56%的文本检测和搜寻任务,同时提高执行速度48.5%。3)展现了强大的几个shot训练能力。只使用10%的超vised数据,快速TCM-CR50提高性能约26.5%和5.5%的文本检测和搜寻任务,分别。4)一致提高了对于Out-of-distribution的文本检测和搜寻数据集的性能,特别是ICDAR2019-ArT的夜间Time-ArT子集和DOTA数据集用于方向对象检测。代码可以在https://github.com/wenwenyu/TCM上获取。

Towards Generalizable Morph Attack Detection with Consistency Regularization

  • paper_url: http://arxiv.org/abs/2308.10392
  • repo_url: None
  • paper_authors: Hossein Kashiani, Niloufar Alipour Talemi, Mohammad Saeed Ebrahimi Saadabadi, Nasser M. Nasrabadi
  • for: 本文主要针对 morph attack detection 的泛化能力增强。
  • methods: 本文提出了两种简单 yet effective的 morph-wise 扩展方法,以探索在实际场景中的各种可能的杂化变换。然后,通过对模型的 logit 和嵌入层进行常规化训练,实现模型在不同来源的杂化图像上学习一致性。
  • results: 实验结果表明,提出的方法在对抗不同来源的杂化图像时,具有更高的泛化性和鲁棒性表现。
    Abstract Though recent studies have made significant progress in morph attack detection by virtue of deep neural networks, they often fail to generalize well to unseen morph attacks. With numerous morph attacks emerging frequently, generalizable morph attack detection has gained significant attention. This paper focuses on enhancing the generalization capability of morph attack detection from the perspective of consistency regularization. Consistency regularization operates under the premise that generalizable morph attack detection should output consistent predictions irrespective of the possible variations that may occur in the input space. In this work, to reach this objective, two simple yet effective morph-wise augmentations are proposed to explore a wide space of realistic morph transformations in our consistency regularization. Then, the model is regularized to learn consistently at the logit as well as embedding levels across a wide range of morph-wise augmented images. The proposed consistency regularization aligns the abstraction in the hidden layers of our model across the morph attack images which are generated from diverse domains in the wild. Experimental results demonstrate the superior generalization and robustness performance of our proposed method compared to the state-of-the-art studies.
    摘要 尽管最近的研究已经在深度神经网络的支持下做出了重要的进步,但它们经常无法良好地泛化到未经见过的形态攻击。随着形态攻击的不断出现,泛化形态攻击检测已经受到了广泛的关注。这篇论文从一致规范的角度来提高形态攻击检测的泛化能力。我们认为,泛化形态攻击检测应该输出不同变化可能出现在输入空间时的一致预测。为达到这个目标,我们提出了两种简单 yet effective的形态增强技术,以探索在实际中可能出现的广泛的形态变换空间。然后,我们将模型规范化以在logit和嵌入层之间学习一致的方法,以涵盖各种形态增强图像的广泛范围。我们的一致规范将模型在不同来源的动态形态图像中的抽象层级划入一致。实验结果表明,我们的提议方法在比较方法中的泛化和鲁棒性性能表现出色。

Automated mapping of virtual environments with visual predictive coding

  • paper_url: http://arxiv.org/abs/2308.10913
  • repo_url: None
  • paper_authors: James Gornet, Matthew Thomson
  • for: 本研究旨在探讨人类如何直接从感知输入中构建内部的认知地图,以及如何使用预测编码来建立这些地图。
  • methods: 本研究使用了一种自我注意力插入 convolutional neural network 来实现预测编码,并在虚拟环境中训练 Agent 进行下一幅图像预测任务。
  • results: 研究发现,通过使用预测编码,Agent 自动构建了一个内部表示环境的vectorized编码,可以支持vector导航并准确地确定自己的位置 relative to 附近的地标。
    Abstract Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.
    摘要 人类直接从感知输入中构建内部的认知地图,无需访问特定坐标或距离测量系统。而机器学习算法如SLAM使用专门的视觉推理过程来识别视觉特征并构建空间地图从视觉和运动数据,然而大脑内部的认知地图具有一般的算法策略,这种策略可以普适地应用于auditory、感觉和语言输入。在这里,我们示示了预测编码提供了一种自然的神经网络算法,用于使用感知数据构建空间地图。我们介绍了一个框架,在该框架中,一个代理人在虚拟环境中游走,同时使用自我注意力扩展 convolutional neural network 进行视觉预测任务。在学习过程中,代理人自动构建了一个内部表示环境,该表示环境可以量化地表示距离。内部地图允许代理人使用仅视觉信息定位自己的位置。预测编码网络生成了一个矢量化编码环境,该编码环境支持矢量导航,其中个别的latent空间单元界定了环境中的地方化、重叠的地方。总之,我们的工作引入预测编码作为一种统一的算法框架,可以自然地扩展到auditory、感觉和语言输入的认知地图构建。

HoSNN: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds

  • paper_url: http://arxiv.org/abs/2308.10373
  • repo_url: None
  • paper_authors: Hejia Geng, Peng Li
  • for: 本研究旨在开发一种可以抵抗 adversarial 攻击的神经网络模型,以提高神经网络的可靠性和安全性。
  • methods: 我们采用了一种基于 neural homeostasis 的解决方案,并提出了一种bio-inspired的 threshold-adapting leaky integrate-and-fire (TA-LIF) neuron模型,以建立我们的 adversarially robust homeostatic SNN (HoSNN)。
  • results: 我们的 HoSNN 在 CIFAR-10 上表现出了很好的 robustness,其中without explicit adversarial training,我们的 HoSNN 在 FGSM 和 PGD 攻击下的准确率提高至 72.6% 和 54.19%,比传统 LIF 神经网络高出 29.99% 和 47.83%。
    Abstract Spiking neural networks (SNNs) offer promise for efficient and powerful neurally inspired computation. Common to other types of neural networks, however, SNNs face the severe issue of vulnerability to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to develop a bio-inspired solution that counters the susceptibilities of SNNs to adversarial onslaughts. At the heart of our approach is a novel threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model, which we adopt to construct the proposed adversarially robust homeostatic SNN (HoSNN). Distinct from traditional LIF models, our TA-LIF model incorporates a self-stabilizing dynamic thresholding mechanism, curtailing adversarial noise propagation and safeguarding the robustness of HoSNNs in an unsupervised manner. Theoretical analysis is presented to shed light on the stability and convergence properties of the TA-LIF neurons, underscoring their superior dynamic robustness under input distributional shifts over traditional LIF neurons. Remarkably, without explicit adversarial training, our HoSNNs demonstrate inherent robustness on CIFAR-10, with accuracy improvements to 72.6% and 54.19% against FGSM and PGD attacks, up from 20.97% and 0.6%, respectively. Furthermore, with minimal FGSM adversarial training, our HoSNNs surpass previous models by 29.99% under FGSM and 47.83% under PGD attacks on CIFAR-10. Our findings offer a new perspective on harnessing biological principles for bolstering SNNs adversarial robustness and defense, paving the way to more resilient neuromorphic computing.
    摘要 神经网络(SNN)具有高效和强大的神经元灵感计算的承诺。然而,SNN也面临严重的抗击敌方攻击的问题。我们提出了首次受神经家ostenosis(Neural Homeostasis)的启发,开发了一种生物发现的解决方案,以强化SNN对抗敌方攻击的抵抗力。我们的方法基于一种新的阈值调整泄漏Integrate-and-Fire(TA-LIF) neuron模型,我们采用这种模型construct了我们的提议的鲁棒化SNN(HoSNN)。与传统的LIF模型不同,我们的TA-LIF模型包括一种自适应的阈值调整机制,使抗击噪声的传播和保护HoSNNs的鲁棒性,这种机制在无supervise的情况下自动稳定。我们提供了理论分析,以便更好地理解TA-LIF neuron的稳定性和抽象性,这些分析表明TA-LIF neuron在输入分布变化时的稳定性较高。很意外地,无需显式的抗击训练,我们的HoSNNs在CIFAR-10上显示了自然的鲁棒性,准确率提高至72.6%和54.19%,比FGSM和PGD攻击的原始准确率提高了20.97%和0.6%。此外,通过最小化FGSM抗击训练,我们的HoSNNs超过了之前的模型,在FGSM和PGD攻击下CIFAR-10上的准确率提高了29.99%和47.83%。我们的发现开发了一种新的神经元灵感的鲁棒性和防御方法,为神经网络计算带来更加可靠的保护。

Developing a Machine Learning-Based Clinical Decision Support Tool for Uterine Tumor Imaging

  • paper_url: http://arxiv.org/abs/2308.10372
  • repo_url: None
  • paper_authors: Darryl E. Wright, Adriana V. Gregory, Deema Anaam, Sepideh Yadollahi, Sumana Ramanathan, Kafayat A. Oyemade, Reem Alsibai, Heather Holmes, Harrison Gottlich, Cherie-Akilah G. Browne, Sarah L. Cohen Rassier, Isabel Green, Elizabeth A. Stewart, Hiroaki Takahashi, Bohyun Kim, Shannon Laughlin-Tommaso, Timothy L. Kline
  • for: 这些研究旨在开发一种自动 segmentation 方法,以便 diferenciar between uterine tumors (UTs) and distinguish between different types of UTs.
  • methods: 研究人员使用 nnU-Net 模型,并 explore 不同的训练集大小对性能的影响。他们还使用了 радиomiic 特征来分类 UTs。
  • results: 研究人员发现,使用整个训练集可以达到人类水平的性能,但是在分类 benign versus malignant 和 degenerated LM versus LMS 任务中,自动分类仍然是一个挑战。
    Abstract Uterine leiomyosarcoma (LMS) is a rare but aggressive malignancy. On imaging, it is difficult to differentiate LMS from, for example, degenerated leiomyoma (LM), a prevalent but benign condition. We curated a data set of 115 axial T2-weighted MRI images from 110 patients (mean [range] age=45 [17-81] years) with UTs that included five different tumor types. These data were randomly split stratifying on tumor volume into training (n=85) and test sets (n=30). An independent second reader (reader 2) provided manual segmentations for all test set images. To automate segmentation, we applied nnU-Net and explored the effect of training set size on performance by randomly generating subsets with 25, 45, 65 and 85 training set images. We evaluated the ability of radiomic features to distinguish between types of UT individually and when combined through feature selection and machine learning. Using the entire training set the mean [95% CI] fibroid DSC was measured as 0.87 [0.59-1.00] and the agreement between the two readers was 0.89 [0.77-1.0] on the test set. When classifying degenerated LM from LMS we achieve a test set F1-score of 0.80. Classifying UTs based on radiomic features we identify classifiers achieving F1-scores of 0.53 [0.45, 0.61] and 0.80 [0.80, 0.80] on the test set for the benign versus malignant, and degenerated LM versus LMS tasks. We show that it is possible to develop an automated method for 3D segmentation of the uterus and UT that is close to human-level performance with fewer than 150 annotated images. For distinguishing UT types, while we train models that merit further investigation with additional data, reliable automatic differentiation of UTs remains a challenge.
    摘要 uterine leiomyosarcoma (LMS) 是一种罕见 pero 攻击性的恶性肿瘤。在影像检查中,很难将LMS与例如,衰变的子宫阴膜(LM)相区分,这两种疾病都是常见的,但是它们的疾病性不同。我们为了解决这个问题,收集了115个轴向T2加重的MRI图像,来自110名患者(年龄的mean值为45岁,range为17-81岁),这些图像包含五种不同的肿瘤类型。这些数据被随机分割,将85个图像作为训练集,并将30个图像作为测试集。一名独立的第二读者(读者2)为测试集图像提供了手动分割。为了自动分割,我们使用了nnU-Net,并研究了训练集大小对性能的影响,我们随机生成了25、45、65和85个训练集图像。我们评估了各种辐射学特征的分类能力,并通过特征选择和机器学习来结合这些特征。使用整个训练集时,测试集的 fibroid DSC 的 mean值为0.87(95% CI:0.59-1.00),两名读者之间的一致性为0.89(77-1)。在分类衰变LM和LMS时,我们在测试集上取得了F1分数为0.80。基于辐射学特征,我们确定了一些分类器,其中在测试集上取得了F1分数为0.53(0.45-0.61)和0.80(0.80-0.80)。我们证明了,可以使用 fewer than 150 个标注图像来开发一种自动化的uterus和UT分割方法,并且该方法的性能几乎与人类水平。然而,在分类不同类型的UT时,我们发现,可靠地自动区分UT仍然是一个挑战。

Prediction of Pneumonia and COVID-19 Using Deep Neural Networks

  • paper_url: http://arxiv.org/abs/2308.10368
  • repo_url: None
  • paper_authors: M. S. Haque, M. S. Taluckder, S. B. Shawkat, M. A. Shahriyar, M. A. Sayed, C. Modak
  • for: 这个研究旨在探讨医疗影像分析可以如何帮助早期识别感染病毒和细菌所致的肺炎。
  • methods: 本研究使用机器学习技术预测肺炎基于胸部X射像。
  • results: 研究发现,使用DenseNet121模型可以实现肺炎患者的准确预测,精度为99.58%。
    Abstract Pneumonia, caused by bacteria and viruses, is a rapidly spreading viral infection with global implications. Prompt identification of infected individuals is crucial for containing its transmission. This study explores the potential of medical image analysis to address this challenge. We propose machine-learning techniques for predicting Pneumonia from chest X-ray images. Chest X-ray imaging is vital for Pneumonia diagnosis due to its accessibility and cost-effectiveness. However, interpreting X-rays for Pneumonia detection can be complex, as radiographic features can overlap with other respiratory conditions. We evaluate the performance of different machine learning models, including DenseNet121, Inception Resnet-v2, Inception Resnet-v3, Resnet50, and Xception, using chest X-ray images of pneumonia patients. Performance measures and confusion matrices are employed to assess and compare the models. The findings reveal that DenseNet121 outperforms other models, achieving an accuracy rate of 99.58%. This study underscores the significance of machine learning in the accurate detection of Pneumonia, leveraging chest X-ray images. Our study offers insights into the potential of technology to mitigate the spread of pneumonia through precise diagnostics.
    摘要 《肺炎,由病毒和细菌引起,是一种迅速传播的病毒感染,具有全球化意义。promptly identifying infected individuals是控制其传播的关键。本研究探讨了医疗图像分析在此挑战中的潜在作用。我们提议使用机器学习技术预测肺炎。胸部X射线成像是肺炎诊断的关键手段,因为它具有访问性和成本效果。然而,解释X射线图像可以困难,因为肺炎的放射学特征可能与其他呼吸道疾病重叠。我们评估了不同的机器学习模型,包括DenseNet121、Inception Resnet-v2、Inception Resnet-v3、Resnet50和Xception,使用胸部X射线图像。我们使用性能指标和冲突矩阵来评估和比较这些模型。研究发现,DenseNet121模型在识别肺炎方面表现出色,达到了99.58%的准确率。本研究强调了机器学习在准确诊断肺炎方面的重要性,利用胸部X射线图像。我们的研究为抑止肺炎传播提供了新的思路,并为医疗技术的进步提供了新的可能性。

Vehicle Cameras Guide mmWave Beams: Approach and Real-World V2V Demonstration

  • paper_url: http://arxiv.org/abs/2308.10362
  • repo_url: None
  • paper_authors: Tawfik Osman, Gouranga Charan, Ahmed Alkhateeb
  • For: The paper is written for the purpose of exploring the use of vision sensors in vehicle-to-vehicle (V2V) communication scenarios to improve the accuracy of millimeter-wave (mmWave) and terahertz (THz) beam prediction.* Methods: The paper proposes a deep learning solution that uses images from a 360 camera attached to the vehicle to predict future beams in V2V communication scenarios.* Results: The proposed vision-aided solution achieves an accuracy of approximately 85% in top-5 beam prediction, while significantly reducing the beam training overhead, highlighting the potential of utilizing vision for enabling highly-mobile V2V communications.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了探讨在交通自动车(V2V)通信场景中使用视觉感知器来提高毫米波(mmWave)和tera射频(THz)的扫描方向预测的目的而写的。* Methods: 论文提出了一种基于视觉学习的解决方案,使用附加到车辆上的360度摄像机拍摄的图像来预测V2V通信场景中的未来扫描方向。* Results: 提出的视觉协助解决方案在实际的多模式mmWave V2V通信数据集上被评估,实现了约85%的top-5扫描方向预测精度,同时显著减少了扫描方向训练过程的开销,这表明可以通过使用视觉来实现高度移动的V2V通信。
    Abstract Accurately aligning millimeter-wave (mmWave) and terahertz (THz) narrow beams is essential to satisfy reliability and high data rates of 5G and beyond wireless communication systems. However, achieving this objective is difficult, especially in vehicle-to-vehicle (V2V) communication scenarios, where both transmitter and receiver are constantly mobile. Recently, additional sensing modalities, such as visual sensors, have attracted significant interest due to their capability to provide accurate information about the wireless environment. To that end, in this paper, we develop a deep learning solution for V2V scenarios to predict future beams using images from a 360 camera attached to the vehicle. The developed solution is evaluated on a real-world multi-modal mmWave V2V communication dataset comprising co-existing 360 camera and mmWave beam training data. The proposed vision-aided solution achieves $\approx 85\%$ top-5 beam prediction accuracy while significantly reducing the beam training overhead. This highlights the potential of utilizing vision for enabling highly-mobile V2V communications.
    摘要 Accurately aligning millimeter-wave (mmWave) and terahertz (THz) narrow beams is crucial for ensuring the reliability and high data rates of 5G and beyond wireless communication systems. However, achieving this goal is challenging, especially in vehicle-to-vehicle (V2V) communication scenarios where both the transmitter and receiver are constantly moving. Recently, additional sensing modalities, such as visual sensors, have attracted significant attention due to their ability to provide accurate information about the wireless environment. To address this challenge, we propose a deep learning solution for V2V scenarios that uses images from a 360-degree camera attached to the vehicle to predict future beams. Our proposed solution is evaluated on a real-world multi-modal mmWave V2V communication dataset that includes co-existing 360-degree camera and mmWave beam training data. The vision-aided solution achieves approximately 85% top-5 beam prediction accuracy while significantly reducing the beam training overhead, which highlights the potential of utilizing vision for enabling highly mobile V2V communications.

Strata-NeRF : Neural Radiance Fields for Stratified Scenes

  • paper_url: http://arxiv.org/abs/2308.10337
  • repo_url: None
  • paper_authors: Ankit Dhiman, Srinath R, Harsh Rangwani, Rishubh Parihar, Lokesh R Boregowda, Srinath Sridhar, R Venkatesh Babu
  • for: 这个论文旨在创建一个具有多层构造的场景,并使用神经陷阱场(NeRF)来学习场景的下面代表。
  • methods: 本论文使用Vector Quantized(VQ)对应来调整NeRF,以便在场景中突然变化 scene structure。
  • results: 根据多个synthetic dataset和RealEstate10K dataset的evalution,Strata-NeRF能够有效地捕捉多层场景,减少错误,并生成高频率的视角。
    Abstract Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photo-realistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument's exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate10K dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches.
    摘要 neural radiance field (NeRF) 方法学习场景中的下一个3D表示,并生成高准确率的新视图。然而,大多数提议都集中在单个 объек 或场景中的模型。然而,在实际世界中,我们可能会捕捉场景在多个层次,导致层次捕捉。例如,旅游者通常会拍摄一座纪念碑的外部结构之前拍摄其内部结构。使用3D模型在多个层次之间进行无缝融合可以极大提高吸引人体验。然而,大多数现有技术在模型这些场景时遇到困难。我们提议Strata-NeRF,一个基于神经雨树(NeRF)的单个神经网络场景模型,使用量化(VQ) latent representation来适应场景结构的突然变化。我们在多层 synthetic 数据集中评估Strata-NeRF的效果,然后进一步验证其在实际世界的 RealEstate10K 数据集上的普适性。我们发现Strata-NeRF可以有效地捕捉层次场景,减少artefacts,并生成高准确率的视图,与现有方法相比。

Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

  • paper_url: http://arxiv.org/abs/2308.10334
  • repo_url: None
  • paper_authors: Haoyuan Li, Haoye Dong, Hanchao Jia, Dong Huang, Michael C. Kampffmeyer, Liang Lin, Xiaodan Liang
  • for: 这篇论文旨在提高多人3D矩阵重建从视频中的精度,以便自动感知群体行为在虚拟现实、物理治疗等领域。
  • methods: 该论文提出了Coordinate transFormer(CoordFormer)模型,直接模型多人空间时间关系,并同时完成多个矩阵重建的终端方式。它不 partitions 特征图into coarse-scale patch-wise tokens,而是使用一种新的坐标感知注意力来保留像素级别的空间时间坐标信息。此外,论文还提出了一种简单 yet effective的Body Center Attention机制来融合位置信息。
  • results: 实验结果表明,CoordFormer在3DPW数据集上显著提高了状态的艺术,与之前最佳结果相比,提高4.2%、8.8%和4.7%的MPJPE、PAMPJPE和PVE指标。同时,CoordFormer比现有的视频基于的方法更快,提高40%。代码可以在https://github.com/Li-Hao-yuan/CoordFormer中找到。
    Abstract Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.
    摘要 多人3D矩阵恢复从视频是自动识别人群行为的关键首先步骤,在虚拟现实、物理治疗等领域中具有广泛的应用前景。然而,现有的方法都是基于多个阶段 paradigm,其中人员检测和跟踪阶段在多人场景中进行,而时间动力模型只是对一个人进行模型。因此,它们的性能受到多人间空间temporal矩阵恢复中缺失的人员交互作用的限制,以及检测和跟踪 Defects。为解决这些挑战,我们提出了坐标变换器(CoordFormer),它直接模型多人空间temporal关系,并在端到端 manner中同时进行多个矩阵恢复。而不是将特征图分解成粗略的 patch-wise 块,CoordFormer 利用了一种新的坐标相关注意力来保留像素级别的空间temporal坐标信息。此外,我们还提出了一种简单 yet effective的 Body Center Attention 机制来融合位置信息。我们对3DPW数据集进行了广泛的实验,结果显示,CoordFormer 可以 Significantly improve the state-of-the-art,与之前最佳结果相比,提高 MPJPE、PAMPJPE 和 PVE 度量的成绩,分别提高4.2%、8.8%和4.7%,而且比最近的视频基于方法更快40%。代码可以在 找到。

Towards Real-World Visual Tracking with Temporal Contexts

  • paper_url: http://arxiv.org/abs/2308.10330
  • repo_url: https://github.com/vision4robotics/tctrack
  • paper_authors: Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong Fu
  • for: 提高视觉跟踪性能,特别是在真实世界条件下。
  • methods: 提议一个两级框架(TCTrack),利用时间上下文来增强特征提取和相似度地图匹配。
  • results: 对8个知名测试集进行了广泛的实验,证明TCTrack++的超越性。真实世界测试直接证明TCTrack++可以快速应用于实际应用场景。
    Abstract Visual tracking has made significant improvements in the past few decades. Most existing state-of-the-art trackers 1) merely aim for performance in ideal conditions while overlooking the real-world conditions; 2) adopt the tracking-by-detection paradigm, neglecting rich temporal contexts; 3) only integrate the temporal information into the template, where temporal contexts among consecutive frames are far from being fully utilized. To handle those problems, we propose a two-level framework (TCTrack) that can exploit temporal contexts efficiently. Based on it, we propose a stronger version for real-world visual tracking, i.e., TCTrack++. It boils down to two levels: features and similarity maps. Specifically, for feature extraction, we propose an attention-based temporally adaptive convolution to enhance the spatial features using temporal information, which is achieved by dynamically calibrating the convolution weights. For similarity map refinement, we introduce an adaptive temporal transformer to encode the temporal knowledge efficiently and decode it for the accurate refinement of the similarity map. To further improve the performance, we additionally introduce a curriculum learning strategy. Also, we adopt online evaluation to measure performance in real-world conditions. Exhaustive experiments on 8 wellknown benchmarks demonstrate the superiority of TCTrack++. Real-world tests directly verify that TCTrack++ can be readily used in real-world applications.
    摘要 Visual tracking 在过去几十年内取得了 significiant improvement. 现有的大多数先进trackers 1)即便在理想条件下,也忽略了实际世界中的条件; 2)采用了跟踪检测 paradigm,忽略了丰富的时间上下文; 3)只是将时间信息 integrate into the template,而不是充分利用 consecutiver frames 之间的时间上下文。为解决这些问题,我们提出了一个two-level框架(TCTrack),可以高效利用时间上下文。基于其,我们提出了更加强大的实际世界视觉跟踪方法,即TCTrack++。它分为两个层:特征和相似度图。specifically,我们提出了一种注意力基于的时间适应性核心抽取特征,通过动态调整核心权重来增强空间特征。此外,我们还引入了一种适应时间变换器来编码时间知识,并将其解码为准确地修改相似度图。为了进一步提高性能,我们还采用了一种课程学习策略。此外,我们采用在线评估来衡量实际世界中的性能。广泛的实验表明TCTrack++ 的优越性。实际测试直接证明TCTrack++ 可以直接应用于实际应用中。

Hyper Association Graph Matching with Uncertainty Quantification for Coronary Artery Semantic Labeling

  • paper_url: http://arxiv.org/abs/2308.10320
  • repo_url: None
  • paper_authors: Chen Zhao, Michele Esposito, Zhihui Xu, Weihua Zhou
  • for: 您的论文旨在帮助您更好地诊断心血管疾病(CAD)?
  • methods: 您使用了一种新型的强化学习模型,即图像匹配神经网络(HAGMN),以确定 coronary angiogram(ICA)中的血管分支。
  • results: 您的模型在实际患者数据上实现了0.9345的准确率,并且具有快速的计算速度,可以在实时临床决策场景中提供有效和高效的预测。
    Abstract Coronary artery disease (CAD) is one of the primary causes leading to death worldwide. Accurate extraction of individual arterial branches on invasive coronary angiograms (ICA) is important for stenosis detection and CAD diagnosis. However, deep learning-based models face challenges in generating semantic segmentation for coronary arteries due to the morphological similarity among different types of coronary arteries. To address this challenge, we propose an innovative approach using the hyper association graph-matching neural network with uncertainty quantification (HAGMN-UQ) for coronary artery semantic labeling on ICAs. The graph-matching procedure maps the arterial branches between two individual graphs, so that the unlabeled arterial segments are classified by the labeled segments, and the coronary artery semantic labeling is achieved. By incorporating the anatomical structural loss and uncertainty, our model achieved an accuracy of 0.9345 for coronary artery semantic labeling with a fast inference speed, leading to an effective and efficient prediction in real-time clinical decision-making scenarios.
    摘要 coronary artery disease (CAD) 是全球主要的死亡原因之一。精准地提取个体动脉分支在侵入性 coronary angiography (ICA) 上是必要的 для梗阻检测和 CAD 诊断。然而,深度学习基本模型在生成 semantic segmentation 的 coronary arteries 方面遇到了挑战,因为不同类型的 coronary arteries 之间存在形态学 similarity。为解决这个挑战,我们提出了一种创新的方法,利用 hyper association graph-matching neural network with uncertainty quantification (HAGMN-UQ) для coronary artery semantic labeling on ICAs。图形匹配过程将两个个体图表中的动脉分支映射到一起,以便通过已知的分支进行分类,并实现 coronary artery semantic labeling。通过 incorporating анатомиче结构损失和不确定性,我们的模型实现了 coronary artery semantic labeling 的准确率为 0.9345,并且具有快速的推理速度,从而在实时临床决策场景中实现了有效和高效的预测。

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

  • paper_url: http://arxiv.org/abs/2308.10315
  • repo_url: https://github.com/shikiw/robustmae
  • paper_authors: Qidong Huang, Xiaoyi Dong, Dongdong Chen, Yinpeng Chen, Lu Yuan, Gang Hua, Weiming Zhang, Nenghai Yu
  • for: 这个论文研究了基于BERT预训练的视觉转换器的逆向攻击性 robustness。
  • methods: 这个论文使用了BEiT和MAE等基于BERT预训练方法,并对这些方法进行了比较。
  • results: 研究发现,MAE的逆向攻击性 robustness比其他预训练方法更差,这导致了对这些预训练方法的基本差异和对逆向攻击性 robustness的影响进行了分析。研究还发现,预训练方法的逆向攻击性 robustness与重建目标有关,即预测图像裂割的raw像素会导致模型的逆向攻击性 robustness下降,而预测图像裂割的semantic context会导致模型的逆向攻击性 robustness上升。根据这种分析, authors提出了一种简单 yet有效的方法来提高MAE的逆向攻击性 robustness,即通过使用预训练数据集中提取的频谱知识来填充频谱空间,从而缩小攻击者的优化空间。
    Abstract In this paper, we investigate the adversarial robustness of vision transformers that are equipped with BERT pretraining (e.g., BEiT, MAE). A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods. This observation drives us to rethink the basic differences between these BERT pretraining methods and how these differences affect the robustness against adversarial perturbations. Our empirical analysis reveals that the adversarial robustness of BERT pretraining is highly related to the reconstruction target, i.e., predicting the raw pixels of masked image patches will degrade more adversarial robustness of the model than predicting the semantic context, since it guides the model to concentrate more on medium-/high-frequency components of images. Based on our analysis, we provide a simple yet effective way to boost the adversarial robustness of MAE. The basic idea is using the dataset-extracted domain knowledge to occupy the medium-/high-frequency of images, thus narrowing the optimization space of adversarial perturbations. Specifically, we group the distribution of pretraining data and optimize a set of cluster-specific visual prompts on frequency domain. These prompts are incorporated with input images through prototype-based prompt selection during test period. Extensive evaluation shows that our method clearly boost MAE's adversarial robustness while maintaining its clean performance on ImageNet-1k classification. Our code is available at: https://github.com/shikiw/RobustMAE.
    摘要 在这篇论文中,我们研究了具有BERT预训练的视觉变换器的抗对抗性。我们发现MAE具有较差的抗对抗性,这使我们思考这些BERT预训练方法之间的基本差异以及如何这些差异影响对抗扰动的Robustness。我们的实验分析表明,BERT预训练的抗对抗性强相关于预测目标,即预测图像覆盖区域的Raw像素或semantic上下文,而不是强制预测图像的媒体/高频成分。基于我们的分析,我们提出了一种简单 yet有效的方法来提高MAE的抗对抗性。这种方法是通过使用预处理数据集中提取的频谱知识来占用对抗扰动的优化空间。具体来说,我们将预处理数据集分布分组,然后在测试期间使用频谱域上优化一组特定的视觉提示。这些提示与输入图像进行权重平均,以提高MAE的抗对抗性。我们的实验表明,我们的方法可以明显提高MAE的抗对抗性,而不会影响其在ImageNet-1k分类任务中的清晰性。我们的代码可以在GitHub上找到:https://github.com/shikiw/RobustMAE。

DVGaze: Dual-View Gaze Estimation

  • paper_url: http://arxiv.org/abs/2308.10310
  • repo_url: https://github.com/yihuacheng/dvgaze
  • paper_authors: Yihua Cheng, Feng Lu
  • for: 这个论文是为了提出一种基于双摄像头的眼动估计方法(DV-Gaze),该方法可以利用双摄像头获取更多的面部信息,从而改善眼动估计性能。
  • methods: 该方法使用了一种叫做 dual-view interactive convolution (DIC) 块,该块在多个特征尺度上进行了双摄像头之间的交互式 convolution,以捕捉双摄像头之间的关系。此外,该方法还使用了一种 dual-view transformer 来估计眼动方向。
  • results: 该方法在 ETH-XGaze 和 EVE 数据集上 achieved state-of-the-art 性能,并且我们的实验也证明了双摄像头 gaze estimation 的潜在优势。 codes 可以在 https://github.com/yihuacheng/DVGaze 中下载。
    Abstract Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been integrated in many devices. This development suggests that we can further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensates for the original feature with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in https://github.com/yihuacheng/DVGaze.
    摘要 “ gaze estimation 方法 根据 facial appearance 来Estimate gaze ,但因为单一 camera 的局限,所Captured facial appearance 无法提供完整的 facial information,因此复杂了 gaze estimation 问题。然而,随着 camera 设备的快速更新,用户可以轻松地使用 dual cameras,这些开发建议我们可以透过 dual-view gaze estimation 进一步改善 gaze estimation 性能。在这篇文章中,我们提出了 dual-view gaze estimation 网络 (DV-Gaze),DV-Gaze 可以从对照两个图像中Estimate dual-view gaze direction。我们首先提出了 dual-view interactive convolution (DIC) 封顶,DIC 封顶在多个尺度中进行了互动式 convolution,并将 dual-view 的特征融合到 epipolar 线上。我们还提出了 dual-view transformer 来估算 gaze direction,并将 camera pose 编码为 transformer 中的位置信息。我们还考虑了 dual-view gaze direction 的几何关系,并提出了 dual-view gaze consistency loss 来确保 DV-Gaze 的性能。实验结果显示 DV-Gaze 在 ETH-XGaze 和 EVE 数据集上实现了 state-of-the-art 的性能。我们释出了代码在 https://github.com/yihuacheng/DVGaze。”

Representation Disparity-aware Distillation for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2308.10308
  • repo_url: None
  • paper_authors: Yanjing Li, Sheng Xu, Mingbao Lin, Jihao Yin, Baochang Zhang, Xianbin Cao
  • For: The paper aims to develop a novel knowledge distillation (KD) method for compact 3D detectors, addressing the representation disparity issue between the teacher model and the student counterpart.* Methods: The proposed method, called representation disparity-aware distillation (RDD), is built on the information bottleneck (IB) principle to minimize the disparity of proposal region pairs in features and logits.* Results: The proposed RDD method demonstrates superior performance over existing KD methods, achieving a mAP of 57.1% on the nuScenes dataset with only 42% FLOPs, even surpassing the teacher performance.Here are the three points in Simplified Chinese text:* For: 本文主要探讨了一种基于知识储存(KD)的减少表示不均(representation disparity)问题的方法,以提高紧凑型3D探测器的性能。* Methods: 提议的方法基于信息瓶颈(IB)原理,以减少学生和教师模型之间的提案区域对的差异。* Results: 相比现有的KD方法,提议的RDD方法在nuScenes数据集上达到57.1%的mAP,只用42%的FLOPs,甚至超过了教师模型的性能。
    Abstract In this paper, we focus on developing knowledge distillation (KD) for compact 3D detectors. We observe that off-the-shelf KD methods manifest their efficacy only when the teacher model and student counterpart share similar intermediate feature representations. This might explain why they are less effective in building extreme-compact 3D detectors where significant representation disparity arises due primarily to the intrinsic sparsity and irregularity in 3D point clouds. This paper presents a novel representation disparity-aware distillation (RDD) method to address the representation disparity issue and reduce performance gap between compact students and over-parameterized teachers. This is accomplished by building our RDD from an innovative perspective of information bottleneck (IB), which can effectively minimize the disparity of proposal region pairs from student and teacher in features and logits. Extensive experiments are performed to demonstrate the superiority of our RDD over existing KD methods. For example, our RDD increases mAP of CP-Voxel-S to 57.1% on nuScenes dataset, which even surpasses teacher performance while taking up only 42% FLOPs.
    摘要 在这篇论文中,我们关注开发知识储备(KD)技术,以建立高效的3D检测器。我们发现,现有的KD方法只在教师模型和学生模型之间存在相似的中间特征表示时才能发挥作用。这可能解释为什么它们在建立EXTREME-COMPACT 3D检测器时效果较差,因为3D点云中的自然稀疏性和不规则性导致了显著的表示差异。这篇论文提出了一种新的表示差异意识的知识储备(RDD)方法,以降低教师和学生模型之间的表示差异,从而降低 compact 学生模型和过参数化教师模型之间的性能差距。我们通过基于信息瓶颈(IB)的创新性观点,可以有效地减少学生和教师模型之间的提案区域对的特征和极值差异。我们进行了广泛的实验,以证明我们的RDD方法在现有KD方法的基础上具有明显的优势。例如,我们的RDD方法在nuScenes数据集上提高了CP-Voxel-S的MAP得分至57.1%, Even surpassed the teacher performance while taking up only 42% FLOPs.

Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation

  • paper_url: http://arxiv.org/abs/2308.10306
  • repo_url: None
  • paper_authors: Jinyu Chen, Wenguan Wang, Si Liu, Hongsheng Li, Yi Yang
  • for: 这篇论文的目的是提出一种基于cross-task Navigation skill transfer的音视频导航器(ORAN),以便在3D环境中寻找声音源。
  • methods: 这篇论文使用了一种 confidence-aware cross-task policy distillation(CCPD)策略,将基本的点到点导航技能传授给ORAN,以便更好地掌握音视频导航。同时,ORAN还配备了一种全方位信息收集(OIG)机制,从不同方向收集视觉-听觉观察数据。
  • results: 结果表明,与前一代竞争对手相比,ORAN在Soundspaces Challenge 2022中获得了1st名,提高了SPL和SR的表现,相对提高了53%和35%。
    Abstract Audio-visual navigation is an audio-targeted wayfinding task where a robot agent is entailed to travel a never-before-seen 3D environment towards the sounding source. In this article, we present ORAN, an omnidirectional audio-visual navigator based on cross-task navigation skill transfer. In particular, ORAN sharpens its two basic abilities for a such challenging task, namely wayfinding and audio-visual information gathering. First, ORAN is trained with a confidence-aware cross-task policy distillation (CCPD) strategy. CCPD transfers the fundamental, point-to-point wayfinding skill that is well trained on the large-scale PointGoal task to ORAN, so as to help ORAN to better master audio-visual navigation with far fewer training samples. To improve the efficiency of knowledge transfer and address the domain gap, CCPD is made to be adaptive to the decision confidence of the teacher policy. Second, ORAN is equipped with an omnidirectional information gathering (OIG) mechanism, i.e., gleaning visual-acoustic observations from different directions before decision-making. As a result, ORAN yields more robust navigation behaviour. Taking CCPD and OIG together, ORAN significantly outperforms previous competitors. After the model ensemble, we got 1st in Soundspaces Challenge 2022, improving SPL and SR by 53% and 35% relatively.
    摘要 audio-visual navigation是一种围绕声音目标的方向寻路任务, robot代理需要在未经过seen的3D环境中前往声音源。在这篇文章中,我们介绍了ORAN,一种全方位audio-visual导航器,基于交叉任务导航技能传递。特别是,ORAN在这种复杂任务中尝试练习两种基本能力:寻路和audio-visual信息收集。首先,ORAN通过信息损失率感知的交叉任务策略储存(CCPD)来学习基本的点对点寻路技能,从而帮助ORAN更好地掌握audio-visual导航。为了提高知识传递的效率和地域差距,CCPD采用可变的决策Confidence来适应教师策略。其次,ORAN配备了全方位信息收集机制(OIG),即在决策之前从不同方向收集视觉-听音观察。因此,ORAN的导航行为变得更加稳定。总之,CCPD和OIG的结合使ORAN在前一代竞争者之上表现出色。最终,我们的模型集成后,在Soundspaces Challenge 2022中获得了1st,提高了SPL和SR的效果比53%和35%。