2023-09-02

cs.CV

cs.CV - 2023-09-02

SEPAL: Spatial Gene Expression Prediction from Local Graphs

paper_url: http://arxiv.org/abs/2309.01036
repo_url: https://github.com/bcv-uniandes/sepal
paper_authors: Gabriel Mejia, Paula Cárdenas, Daniela Ruiz, Angela Castillo, Pablo Arbeláez
for: 这项研究旨在开发一种新的模型，以便从视觉组织形态中预测基因表达 Profiling。
methods: 该方法利用生物学上的偏好，直接使用相对均值对表达进行超级视觉上的预测，并在每个坐标上使用图神经网络利用本地视觉上下文进行预测。
results: 研究表明，SEPAL模型在两个人类乳腺癌数据集中表现出色，超过了之前的州OF-the-art方法和其他包含空间上下文的机制。

Abstract
Spatial transcriptomics is an emerging technology that aligns histopathology images with spatially resolved gene expression profiling. It holds the potential for understanding many diseases but faces significant bottlenecks such as specialized equipment and domain expertise. In this work, we present SEPAL, a new model for predicting genetic profiles from visual tissue appearance. Our method exploits the biological biases of the problem by directly supervising relative differences with respect to mean expression, and leverages local visual context at every coordinate to make predictions using a graph neural network. This approach closes the gap between complete locality and complete globality in current methods. In addition, we propose a novel benchmark that aims to better define the task by following current best practices in transcriptomics and restricting the prediction variables to only those with clear spatial patterns. Our extensive evaluation in two different human breast cancer datasets indicates that SEPAL outperforms previous state-of-the-art methods and other mechanisms of including spatial context.

摘要
《空间转录组学是一种emerging技术，可以将组织学图像与空间地定的蛋白表达 profiling进行对应。它有很大的潜力用于理解多种疾病，但面临着重要的瓶颈，例如专业设备和领域专业知识。在这项工作中，我们提出了一种新的模型，可以从视觉组织表现中预测基因谱。我们的方法利用生物学上的偏见，直接监督表达差异相对于平均表达水平，并利用每个坐标点的本地视觉上下文来进行预测，使用图 neural network。这种方法可以在当前方法中关闭完全地方性和完全全球性之间的差距。此外，我们还提出了一个新的标准测试，以更好地定义任务，并且只Predicting variables with clear spatial patterns。我们对两个不同的人乳癌组织数据集进行了广泛的评估，结果显示，SEPAL在前一个状态的方法和其他包含空间上下文的机制上表现出色。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Contrastive Grouping with Transformer for Referring Image Segmentation

paper_url: http://arxiv.org/abs/2309.01017
repo_url: https://github.com/toneyaya/cgformer
paper_authors: Jiajin Tang, Ge Zheng, Cheng Shi, Sibei Yang
for: 本研究旨在提高图像分割中的参照语言表达的准确率，使用mask classification方法和token-based查询 grouping策略来捕捉 объек级信息。
methods: 本研究提出了一种名为Contrastive Grouping with Transformer network（CGFormer）的掩码分类方法，该方法通过学习可变的查询токен来表示 объек，然后在每两层之间交叉升级查询语言特征和视觉特征，以实现对象感知的跨模态理解。此外，CGFormer还应用了对比学习策略来识别查询Token和其掩码。
results: 实验结果表明，CGFormer在分割和泛化设定下都能够具有州元的性能，与现有的一阶方法相比，CGFormer在对象分割任务中具有显著的优势。

Abstract
Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly.

摘要
传统的一个阶段方法使用每个像素的分类框架，直接将视觉和语言对齐到像素级别，因此无法捕捉关键的物体水平信息。在这篇论文中，我们提出了一种面Mask分类框架，即对比集成Transformers网络（CGFormer），它显式地捕捉物体水平信息，通过启用可学习的查询令和分组策略。具体来说，CGFormer首先引入了可学习的查询令，用于表示物体，然后在每两层交替地查询语言特征和视觉特征，并将视觉特征分组到查询令中进行对应的横向推理。此外，CGFormer实现了交叉层交互，在每两层中同时更新查询令和推理mask。最后，CGFormer与对比学习结合分组策略，以确定查询令和其对应的推理mask。实验结果表明，CGFormer在 segmentation 和通用设定下能够一直高效地和稳定地 exceed 状态元方法。

Comparative Analysis of Deep Learning Architectures for Breast Cancer Diagnosis Using the BreaKHis Dataset

paper_url: http://arxiv.org/abs/2309.01007
repo_url: None
paper_authors: İrem Sayın, Muhammed Ali Soydaş, Yunus Emre Mert, Arda Yarkataş, Berk Ergun, Selma Sözen Yeh, Hüseyin Üvet
for: 评估深度学习模型在识别乳腺癌中的表现
methods: 使用和比较五种知名的深度学习模型进行诊断：VGG、ResNet、Xception、Inception和InceptionResNet
results: Xception模型在F1分数方面达到了0.9，准确率达到了89%，而Inception和InceptionResNet模型均达到了87%的准确率，但Inception模型的F1分数为87，而InceptionResNet模型的F1分数为86。这些结果表明深度学习方法在诊断乳腺癌中的重要性，并且有助于提供更好的诊断服务给患者。

Abstract
Cancer is an extremely difficult and dangerous health problem because it manifests in so many different ways and affects so many different organs and tissues. The primary goal of this research was to evaluate deep learning models' ability to correctly identify breast cancer cases using the BreakHis dataset. The BreakHis dataset covers a wide range of breast cancer subtypes through its huge collection of histopathological pictures. In this study, we use and compare the performance of five well-known deep learning models for cancer classification: VGG, ResNet, Xception, Inception, and InceptionResNet. The results placed the Xception model at the top, with an F1 score of 0.9 and an accuracy of 89%. At the same time, the Inception and InceptionResNet models both hit accuracy of 87% . However, the F1 score for the Inception model was 87, while that for the InceptionResNet model was 86. These results demonstrate the importance of deep learning methods in making correct breast cancer diagnoses. This highlights the potential to provide improved diagnostic services to patients. The findings of this study not only improve current methods of cancer diagnosis, but also make significant contributions to the creation of new and improved cancer treatment strategies. In a nutshell, the results of this study represent a major advancement in the direction of achieving these vital healthcare goals.

摘要
乳癌是一种极其困难和危险的健康问题，因为它可以出现在多种不同的形式和影响多种不同的器官和组织。本研究的主要目标是评估深度学习模型在识别乳癌案例方面的性能，使用BreakHis数据集。BreakHis数据集包括多种乳癌Subtype的历史病理图像，因此我们可以使用和比较五种常见的深度学习模型对于肿瘤分类的性能：VGG、ResNet、Xception、Inception和InceptionResNet。结果显示，Xception模型在F1分数方面得分为0.9，准确率为89%。同时，Inception和InceptionResNet模型都达到了87%的准确率。但是，Inception模型的F1分数为87，而InceptionResNet模型的F1分数为86。这些结果表明深度学习方法在识别乳癌案例中的重要性，这也提供了改善临床诊断服务的可能性。这些发现不仅改进了当前肿瘤诊断方法，还为创造新的和改进的肿瘤治疗策略做出了重要贡献。总之，本研究的结果代表了肿瘤诊断领域的一大突破。

RevColV2: Exploring Disentangled Representations in Masked Image Modeling

paper_url: http://arxiv.org/abs/2309.01005
repo_url: https://github.com/megvii-research/revcol
paper_authors: Qi Han, Yuxuan Cai, Xiangyu Zhang
for: 这个研究目的是提出一个新的架构，RevColV2，来解决现有的掩护图像模型（MIM）方法中驱逐decoder网络的问题，以提高下游任务的表现。
methods: RevColV2架构包含底部Column和顶部Column，这两种Column之间的信息是逆向传递和慢态分解的。这种设计使得RevColV2架构保持了掩护图像模型中的独立低阶和semantic信息。
results: 实验结果显示，使用RevColV2架构的基础模型可以在多个下游视觉任务中实现竞争性的表现，例如图像分类、semantic segmentation和物件检测。例如，在ImageNet-22K dataset上进行中途精通的finetuning后，RevColV2-L可以实现88.4%的top-1准确率和58.6 mIoU的semantic segmentation准确率。

Abstract
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For example, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4% top-1 accuracy on ImageNet-1K classification and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU on ADE20K semantic segmentation. Code and models are released at https://github.com/megvii-research/RevCol

摘要
受预训练掩模型（MIM）的普遍使用，已经取得了领先的表现。然而，现有的MIM方法在下游应用中抛弃了解码器网络，导致预训练和细化 phases的表现不一致，从而降低下游任务的表现。在本文中，我们提出了一种新的架构——RevColV2，以解决这个问题。RevColV2架构包括底层列和顶层列，这两个列之间的信息在反向传播的过程中被恰当地传递和慢慢分离。这种设计使得RevColV2架构保持了预训练和细化 phases中的独立特征，从而实现了维持低级别特征和 semantic 信息的优良性。我们的实验结果表明，一个基于RevColV2架构的基础模型可以在多个下游视觉任务上达到竞争性的表现，如图像分类、semantic segmentation和物体检测。例如，在ImageNet-22K数据集上进行中间细化训练后，RevColV2-L模型可以达到88.4%的顶层准确率和58.6 mIoU的semantic segmentation精度。另外，通过添加教师和大规模数据集，RevColV2-L模型可以达到62.1个box AP和60.4 mIoU的semantic segmentation精度。代码和模型可以在https://github.com/megvii-research/RevCol 上下载。

Constrained CycleGAN for Effective Generation of Ultrasound Sector Images of Improved Spatial Resolution

paper_url: http://arxiv.org/abs/2309.00995
repo_url: https://github.com/xfsun99/ccyclegan-tf2
paper_authors: Xiaofei Sun, He Li, Wei-Ning Lee
for: 这个研究的目的是将多普勒超声图像（US）的空间分辨率改善，以提高心脏动态运动的评估质量。methods: 这个研究使用了一种名为CCycleGAN的新型的循环GAN模型，该模型直接使用不同的超声探针所获取的无对对的US图像进行生成。此外，CCycleGAN还引入了一种新的束约条件，以保证生成图像的结构一致性和吸收信号特征的一致性。results: 实验结果表明，CCycleGAN可以成功地生成高空间分辨率的US图像，同时也提高了图像的峰信号噪声比（PSNR）和结构相似度（SSIM）。此外，CCycleGAN生成的US图像在人体内部的心脏运动评估中也有更高的质量，特别是在深部区域。

Abstract
Objective. A phased or a curvilinear array produces ultrasound (US) images with a sector field of view (FOV), which inherently exhibits spatially-varying image resolution with inferior quality in the far zone and towards the two sides azimuthally. Sector US images with improved spatial resolutions are favorable for accurate quantitative analysis of large and dynamic organs, such as the heart. Therefore, this study aims to translate US images with spatially-varying resolution to ones with less spatially-varying resolution. CycleGAN has been a prominent choice for unpaired medical image translation; however, it neither guarantees structural consistency nor preserves backscattering patterns between input and generated images for unpaired US images. Approach. To circumvent this limitation, we propose a constrained CycleGAN (CCycleGAN), which directly performs US image generation with unpaired images acquired by different ultrasound array probes. In addition to conventional adversarial and cycle-consistency losses of CycleGAN, CCycleGAN introduces an identical loss and a correlation coefficient loss based on intrinsic US backscattered signal properties to constrain structural consistency and backscattering patterns, respectively. Instead of post-processed B-mode images, CCycleGAN uses envelope data directly obtained from beamformed radio-frequency signals without any other non-linear postprocessing. Main Results. In vitro phantom results demonstrate that CCycleGAN successfully generates images with improved spatial resolution as well as higher peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) compared with benchmarks. Significance. CCycleGAN-generated US images of the in vivo human beating heart further facilitate higher quality heart wall motion estimation than benchmarks-generated ones, particularly in deep regions.

摘要
目标：使用phasized或curvilinear array生成ultrasound（US）图像，图像具有锐度场视野（FOV），这些图像自然而然地在远区和两侧扫描方向展现空间不均匀的图像解析质量。锐度US图像有助于准确地量化大小和动态的器官，如心脏。因此，本研究的目标是将US图像中的空间不均匀的图像解析转换为更少的空间不均匀的图像解析。方法：我们提出了一种受限制的CycleGAN（CCycleGAN），该模型直接使用不同ultrasound array探针获取的无对应图像来生成US图像。除了传统的对抗学习和循环一致性损失外，CCycleGAN还引入了基于US回射信号特性的相同损失和相关系数损失，以确保结构一致性和回射特征的保持。而不是使用后处理的B模式图像，CCycleGAN使用直接从Radio frequency信号中得到的封包数据，无需其他非线性后处理。主要结果：在医学实验室中，我们使用了医学实验室中的人工胚膜模型，对CCycleGAN进行了测试。结果表明，CCycleGAN成功地生成了高分辨率的US图像，同时具有更高的PSNR和SSIM值，比benchmarks更高。意义：CCycleGAN生成的US图像可以更好地估计人体心脏墙运动，特别是在深部区域。这些结果表明，CCycleGAN可以成为一种有用的医学图像翻译工具，可以帮助医生更好地诊断和治疗各种疾病。

Deep-Learning Framework for Optimal Selection of Soil Sampling Sites

paper_url: http://arxiv.org/abs/2309.00974
repo_url: None
paper_authors: Tan-Hanh Pham, Praneel Acharya, Sravanthi Bachina, Kristopher Osterloh, Kim-Doang Nguyen
for: 本研究使用深度学习技术来找到适合挖取样本的场地。
methods: 本研究使用了两种方法：一是使用现有的state-of-the-art模型，二是开发了一种基于转换器和自注意的深度学习设计。
results: 研究结果表明，我们提出的模型在测试数据集上达到了99.52%的准确率，57.35%的交集 overlap（IoU）和71.47%的 dice相似度，而现有的CNN模型的性能指标分别为66.08%、3.85%和1.98%。这表明我们的模型在土壤抽样数据集上表现出了优异性。

Abstract
This work leverages the recent advancements of deep learning in image processing to find optimal locations that present the important characteristics of a field. The data for training are collected at different fields in local farms with five features: aspect, flow accumulation, slope, NDVI (normalized difference vegetation index), and yield. The soil sampling dataset is challenging because the ground truth is highly imbalanced binary images. Therefore, we approached the problem with two methods, the first approach involves utilizing a state-of-the-art model with the convolutional neural network (CNN) backbone, while the second is to innovate a deep-learning design grounded in the concepts of transformer and self-attention. Our framework is constructed with an encoder-decoder architecture with the self-attention mechanism as the backbone. In the encoder, the self-attention mechanism is the key feature extractor, which produces feature maps. In the decoder, we introduce atrous convolution networks to concatenate, fuse the extracted features, and then export the optimal locations for soil sampling. Currently, the model has achieved impressive results on the testing dataset, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%, while the performance metrics of the state-of-the-art CNN-based model are 66.08%, 3.85%, and 1.98%, respectively. This indicates that our proposed model outperforms the CNN-based method on the soil-sampling dataset. To the best of our knowledge, our work is the first to provide a soil-sampling dataset with multiple attributes and leverage deep learning techniques to enable the automatic selection of soil-sampling sites. This work lays a foundation for novel applications of data science and machine-learning technologies to solve other emerging agricultural problems.

摘要
The team tested their model on a challenging dataset and achieved impressive results, with a mean accuracy of 99.52%, a mean Intersection over Union (IoU) of 57.35%, and a mean Dice Coefficient of 71.47%. These results are significantly better than those achieved by a state-of-the-art CNN-based model, which had a mean accuracy of 66.08%, a mean IoU of 3.85%, and a mean Dice Coefficient of 1.98%.This work is the first to provide a soil-sampling dataset with multiple attributes and use deep learning techniques to automatically select soil-sampling sites. The team believes that their approach could be used to solve other emerging agricultural problems and lay the foundation for novel applications of data science and machine learning in agriculture.

AdLER: Adversarial Training with Label Error Rectification for One-Shot Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.00971
repo_url: https://github.com/hsiangyuzhao/adler
paper_authors: Xiangyu Zhao, Sheng Wang, Zhiyun Song, Zhenrong Shen, Linlin Yao, Haolei Yuan, Qian Wang, Lichi Zhang
for: 这个研究的目的是提高医疗图像自动分类的精度，特别是在丑化训练数据的情况下。
methods: 这个方法使用了学习变数的一对一分类（OSSLT），包括不断变数注册、学习注册和注册变数的分类。
results: 这个研究的结果显示，这个新的一对一分类方法（AdLER）可以提高分类性能，并且在没有足够训练数据的情况下具有更好的一致性和更高的精度。

Abstract
Accurate automatic segmentation of medical images typically requires large datasets with high-quality annotations, making it less applicable in clinical settings due to limited training data. One-shot segmentation based on learned transformations (OSSLT) has shown promise when labeled data is extremely limited, typically including unsupervised deformable registration, data augmentation with learned registration, and segmentation learned from augmented data. However, current one-shot segmentation methods are challenged by limited data diversity during augmentation, and potential label errors caused by imperfect registration. To address these issues, we propose a novel one-shot medical image segmentation method with adversarial training and label error rectification (AdLER), with the aim of improving the diversity of generated data and correcting label errors to enhance segmentation performance. Specifically, we implement a novel dual consistency constraint to ensure anatomy-aligned registration that lessens registration errors. Furthermore, we develop an adversarial training strategy to augment the atlas image, which ensures both generation diversity and segmentation robustness. We also propose to rectify potential label errors in the augmented atlas images by estimating segmentation uncertainty, which can compensate for the imperfect nature of deformable registration and improve segmentation authenticity. Experiments on the CANDI and ABIDE datasets demonstrate that the proposed AdLER outperforms previous state-of-the-art methods by 0.7% (CANDI), 3.6% (ABIDE "seen"), and 4.9% (ABIDE "unseen") in segmentation based on Dice scores, respectively. The source code will be available at https://github.com/hsiangyuzhao/AdLER.

摘要
通常，医疗图像自动分割需要大量高质量标注数据，因此在临床设置下采用自动分割是更加困难的。一旦分割基于学习的变换（OSSLT）已经展示了在有限的标注数据下可以取得满意的结果，包括不supervised deformable registration、数据增强通过学习 registration以及基于增强数据进行分割学习。然而，当数据多样性很低时，当前一旦分割方法会受到多样性不足的限制，以及可能的标签错误引起的registration错误。为了解决这些问题，我们提出了一种基于对抗学习和标签修正的新一代医疗图像分割方法（AdLER），以提高分割性能。具体来说，我们实现了一种双重一致性约束，以降低注射错误。此外，我们开发了一种对抗训练策略，以增强生成数据的多样性和分割的Robustness。此外，我们还提出了一种纠正可能存在的标签错误的方法，通过估算分割不确定性，可以补偿杂论注射和提高分割的authenticity。实验结果表明，提案的AdLER方法在CANDI和ABIDE datasets上比前一代方法提高0.7%（CANDI）、3.6%（ABIDE "seen")和4.9%（ABIDE "unseen")的分割基于dice scores，分别。源代码将在https://github.com/hsiangyuzhao/AdLER上提供。

paper_url: http://arxiv.org/abs/2309.00962
repo_url: https://github.com/junzhang2016/NTU4DRadLM
paper_authors: Jun Zhang, Huayang Zhuge, Yiyao Liu, Guohao Peng, Zhenyu Wu, Haoyuan Zhang, Qiyang Lyu, Heshan Li, Chunyang Zhao, Dogan Kircali, Sanat Mharolkar, Xun Yang, Su Yi, Yuanzhe Wang, Danwei Wang
for: This paper is written for researchers and developers who are interested in Simultaneous Localization and Mapping (SLAM) using 4D radar, thermal camera, and Inertial Measurement Unit (IMU).
methods: The paper presents a new dataset called NTU4DRadLM, which includes all 6 sensors (4D radar, thermal camera, IMU, 3D LiDAR, visual camera, and RTK GPS) and is specifically designed for SLAM tasks.
results: The paper evaluates three types of SLAM algorithms using the NTU4DRadLM dataset and reports the results, which include the accuracy of the algorithms in various environments.Here’s the simplified Chinese version:
for: 这篇论文是为了帮助关注同时定位和地图建模（SLAM）领域的研究人员和开发者。
methods: 这篇论文提出了一个新的数据集called NTU4DRadLM，该数据集包含了所有6种感知器（4D radar、热成像、IMU、3D LiDAR、视频相机和RTK GPS），并且特意设计用于SLAM任务。
results: 这篇论文使用NTU4DRadLM数据集评估了三种SLAM算法的性能，并报告了结果，其中包括不同环境下的算法准确性。

Abstract
Simultaneous Localization and Mapping (SLAM) is moving towards a robust perception age. However, LiDAR- and visual- SLAM may easily fail in adverse conditions (rain, snow, smoke and fog, etc.). In comparison, SLAM based on 4D Radar, thermal camera and IMU can work robustly. But only a few literature can be found. A major reason is the lack of related datasets, which seriously hinders the research. Even though some datasets are proposed based on 4D radar in past four years, they are mainly designed for object detection, rather than SLAM. Furthermore, they normally do not include thermal camera. Therefore, in this paper, NTU4DRadLM is presented to meet this requirement. The main characteristics are: 1) It is the only dataset that simultaneously includes all 6 sensors: 4D radar, thermal camera, IMU, 3D LiDAR, visual camera and RTK GPS. 2) Specifically designed for SLAM tasks, which provides fine-tuned ground truth odometry and intentionally formulated loop closures. 3) Considered both low-speed robot platform and fast-speed unmanned vehicle platform. 4) Covered structured, unstructured and semi-structured environments. 5) Considered both middle- and large- scale outdoor environments, i.e., the 6 trajectories range from 246m to 6.95km. 6) Comprehensively evaluated three types of SLAM algorithms. Totally, the dataset is around 17.6km, 85mins, 50GB and it will be accessible from this link: https://github.com/junzhang2016/NTU4DRadLM

摘要
《同时地位和地图Localization（SLAM）正在迈向一个强大的感知年代。然而，雷达和视觉SLAM可能在不利的条件下（雨、雪、烟雾等）容易失败。相比之下，基于4D雷达、热成像和IMU的SLAM可以工作稳定。然而，相关的数据集很少，这使得研究受到了严重的阻碍。尽管过去四年有一些基于4D雷达的数据集被提出，但是它们主要是为了对象检测而不是SLAM。此外，它们通常不包括热成像。因此，本文提出了NTU4DRadLM数据集。NTU4DRadLM的主要特点包括：1. 同时包含6种感知器：4D雷达、热成像、IMU、3D雷达、视觉摄像头和RTK GPS。2. 专门为SLAM任务设计，提供精度调整的地理位置轨迹和故意设计的循环关闭。3. 考虑了中速和快速无人车平台。4. 覆盖结构化、无结构化和半结构化环境。5. 考虑了中型和大型的外部环境，即6个轨迹的距离从246米到6.95公里。6. 全面评估了三种SLAM算法。总的来说，数据集约为17.6公里，85分钟，50GB，可以从以下链接获取：https://github.com/junzhang2016/NTU4DRadLM。

ASF-Net: Robust Video Deraining via Temporal Alignment and Online Adaptive Learning

paper_url: http://arxiv.org/abs/2309.00956
repo_url: None
paper_authors: Xinwei Xue, Jia He, Long Ma, Xiangyu Meng, Wenlin Li, Risheng Liu
for: 本研究旨在解决视频雨几何学习方法中的两个关键挑战：利用邻帧的时间相关性和适应未知实际场景。
methods: 我们提出了一种新的计算模式——归一化偏移网络（ASF-Net），它包括一个时间偏移模块，可以更深入探索邻帧的时间信息，并在特征空间内进行通道级别信息交换。
results: 我们在基于新建的 dataset 上进行了参数学习过程，并开发了一种创新的恢复学习策略，该策略可以将 sintetic 和实际场景之间的差异bridged，从而提高场景适应性。我们的提出方法在三个标准准点上表现出优于常见方法，并在实际场景中具有惊喜的视觉质量。

Abstract
In recent times, learning-based methods for video deraining have demonstrated commendable results. However, there are two critical challenges that these methods are yet to address: exploiting temporal correlations among adjacent frames and ensuring adaptability to unknown real-world scenarios. To overcome these challenges, we explore video deraining from a paradigm design perspective to learning strategy construction. Specifically, we propose a new computational paradigm, Alignment-Shift-Fusion Network (ASF-Net), which incorporates a temporal shift module. This module is novel to this field and provides deeper exploration of temporal information by facilitating the exchange of channel-level information within the feature space. To fully discharge the model's characterization capability, we further construct a LArge-scale RAiny video dataset (LARA) which also supports the development of this community. On the basis of the newly-constructed dataset, we explore the parameters learning process by developing an innovative re-degraded learning strategy. This strategy bridges the gap between synthetic and real-world scenes, resulting in stronger scene adaptability. Our proposed approach exhibits superior performance in three benchmarks and compelling visual quality in real-world scenarios, underscoring its efficacy. The code is available at https://github.com/vis-opt-group/ASF-Net.

摘要

Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning

paper_url: http://arxiv.org/abs/2309.00942
repo_url: None
paper_authors: Sha Meng, Dian Shao, Jiacheng Guo, Shan Gao
for: 这篇论文的目的是提出一种无监督学习方法，以增强多对象跟踪（MOT）任务中对象的识别和跟踪。
methods: 该方法利用样本特征的共同性，包括自我协同、跨帧协同和歧义协同三种冲突模块，从而提取特征表示。
results: 该方法在现有的标准测试集上比既有的无监督方法和部分监督方法提供更高的准确率，并且与完全监督方法相当或甚至超过。

Abstract
Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods.

摘要
<> translate "Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods."into Simplified Chinese.以下是文章中的简化中文翻译：Unsupervised learning是一项复杂的任务，因为缺乏标签。多个对象跟踪（MOT），它无法避免互相干扰、遮挡等问题，更加困难无标签指导。在这篇文章中，我们探索视频帧中样本特征的潜在一致性，并提出了无监督相似性学习方法（UCSL），包括三种对比模块：自我对比、相互对比和抽象对比。特别是：i) 自我对比使用内帧直接和间帧间接对比，以获得特征表示，充分发挥自我相似性。ii) 相互对比将相互匹配和连续帧匹配结果对齐，解决对象遮挡的持续性负面影响。iii) 抽象对比将抽象对象相互对应，进一步增加后续对象关联的确定性。在现有的benchmark上，我们的方法比现有的无监督方法使用更少的ReID头的帮助，甚至提供了高于许多全监督方法的准确率。

Exploring the Robustness of Human Parsers Towards Common Corruptions

paper_url: http://arxiv.org/abs/2309.00938
repo_url: None
paper_authors: Sanyi Zhang, Xiaochun Cao, Rui Wang, Guo-Jun Qi, Jie Zhou
for: 提高人像分割模型的 robustness，使其能够更好地处理各种图像损害。
methods: 构建了三个损害robustness benchmark，并提出了一种基于多视图增强的异质增强机制，通过将两种不同视图的数据增强综合在一起，以适应常见的图像损害。
results: 实验结果表明，提出的方法可以提高人像分割模型的robustness，并且可以在不同的图像损害情况下保持相对的表现。

Abstract
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

摘要
人类分割目标是将每个人像像Pixel segmentation with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation

paper_url: http://arxiv.org/abs/2309.00933
repo_url: None
paper_authors: Zhengming Zhou, Qiulei Dong
for: 提出了一种能够同时处理单目和双目深度估计任务的 Two-in-One自主学习深度估计网络（TiO-Depth），以提高估计精度。
methods: 使用了一种SIAMESE架构，每个子网络可以作为单目深度估计模型，而为双目深度估计，提出了一种单目特征匹配模块，将两个图像之间的stereo知识integrated进模型中。
results: 实验结果表明，TiO-Depth在KITTI、Cityscapes和DDAD datasets上大多数情况下都能够超过单目和双目现有方法的性能，并证明了一种两个任务合一的网络可以为单目和双目深度估计提供更高的精度。

Abstract
Monocular and binocular self-supervised depth estimations are two important and related tasks in computer vision, which aim to predict scene depths from single images and stereo image pairs respectively. In literature, the two tasks are usually tackled separately by two different kinds of models, and binocular models generally fail to predict depth from single images, while the prediction accuracy of monocular models is generally inferior to binocular models. In this paper, we propose a Two-in-One self-supervised depth estimation network, called TiO-Depth, which could not only compatibly handle the two tasks, but also improve the prediction accuracy. TiO-Depth employs a Siamese architecture and each sub-network of it could be used as a monocular depth estimation model. For binocular depth estimation, a Monocular Feature Matching module is proposed for incorporating the stereo knowledge between the two images, and the full TiO-Depth is used to predict depths. We also design a multi-stage joint-training strategy for improving the performances of TiO-Depth in both two tasks by combining the relative advantages of them. Experimental results on the KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms both the monocular and binocular state-of-the-art methods in most cases, and further verify the feasibility of a two-in-one network for monocular and binocular depth estimation. The code is available at https://github.com/ZM-Zhou/TiO-Depth_pytorch.

摘要
眼镜和双眼自助深度估计是计算机视觉中两个重要和相关的任务，它们目标是从单个图像和双图像对中预测场景的深度。在文献中，这两个任务通常由两种不同的模型来解决，而双眼模型通常无法从单个图像中预测深度，而眼镜模型的预测精度通常落后于双眼模型。在这篇论文中，我们提出了一个名为 TiO-Depth 的 Two-in-One 自助深度估计网络，可以同时处理这两个任务，并提高预测精度。TiO-Depth 使用了同构网络，并且每个子网络可以作为眼镜深度估计模型使用。对于双眼深度估计，我们提出了一个名为 Monocular Feature Matching 的单眼特征匹配模块，以利用双图像之间的相似性，并使用全 TiO-Depth 来预测深度。我们还设计了一种多阶段联合培训策略，以提高 TiO-Depth 在这两个任务中的性能。实验结果表明，TiO-Depth 在 KITTI、Cityscapes 和 DDAD 数据集上的表现都较为出色，大多数情况下超过了眼镜和双眼状态的目标方法，并证明了 Two-in-One 网络的可行性。代码可以在 GitHub 上找到：https://github.com/ZM-Zhou/TiO-Depth_pytorch。

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

paper_url: http://arxiv.org/abs/2309.00928
repo_url: None
paper_authors: Xuan He, Kailun Yang, Junwei Zheng, Jin Yuan, Luis M. Bergasa, Hui Zhang, Zhiyong Li
for: 提高单目3D物体检测的准确率，特别是对多类目物体的检测。
methods: 提出了一种新的“超级visedShape&Scale-perceptive Deformable Attention”（S$^3$-DA）模块，利用视觉和深度特征生成多种形态和比例的多样化本地特征，并同时预测匹配分布，以强制每个查询点拥有有价值的形态&比例识别。
results: 对KITTI和Waymo开放 dataset进行了广泛的实验，显示S$^3$-DA可以显著提高检测精度，在单一训练过程中实现单类和多类3D物体检测的州际最佳性能。

Abstract
Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervised attention mechanisms without any geometry appearance awareness in transformers are susceptible to producing noisy features for query points, which severely limits the network performance and also makes the model have a poor ability to detect multi-category objects in a single training process. To tackle this problem, this paper proposes a novel "Supervised Shape&Scale-perceptive Deformable Attention" (S$^3$-DA) module for monocular 3D object detection. Concretely, S$^3$-DA utilizes visual and depth features to generate diverse local features with various shapes and scales and predict the corresponding matching distribution simultaneously to impose valuable shape&scale perception for each query. Benefiting from this, S$^3$-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features. Besides, we propose a Multi-classification-based Shape$\&$Scale Matching (MSM) loss to supervise the above process. Extensive experiments on KITTI and Waymo Open datasets demonstrate that S$^3$-DA significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches. The source code will be made publicly available at https://github.com/mikasa3lili/S3-MonoDETR.

摘要
最近，基于transformer的方法在单视图3D物体检测中表现出色，可以从单个2D图像中预测3D特征。这些方法通常使用视觉和深度表示来生成查询点对象， whose quality具有决定性的影响于检测精度。然而，当前无supervised attention机制，无法考虑对象的几何外观，这会使transformer中的网络性能受到严重的限制，同时也使得模型无法在单一训练过程中检测多类对象。为解决这个问题，本文提出了一种novel的“Supervised Shape&Scale-perceptive Deformable Attention”（S$^3$-DA）模块。具体来说，S$^3$-DA利用视觉和深度特征来生成多样的本地特征，同时预测匹配分布，以便为每个查询点强制实施有价值的形状&比例见解。这使得S$^3$-DA可以efficiently估计查询点所属类别的接受领域，从而生成Robust查询特征。此外，我们提出了一种Multi-classification-based Shape$\&$Scale Matching（MSM）损失函数来监督上述过程。广泛的实验表明，S$^3$-DA可以显著提高检测精度，在单个训练过程中实现单类和多类3D物体检测的state-of-the-art性能。网站将在https://github.com/mikasa3lili/S3-MonoDETR中公开源代码。

GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning

paper_url: http://arxiv.org/abs/2309.00923
repo_url: None
paper_authors: Ziming Liu, Jingcai Guo, Xiaocheng Lu, Song Guo, Peiran Dong, Jiewei Zhang
for: 这个论文 investigate了零shot学习在多标签场景（MLZSL）中的挑战问题，即在一个样本（如图像）中识别多个未经训练的类，基于已经训练的类和 auxillary knowledge，如semantic information。
methods: 该论文提出了一种新的并有效的集群强化框架（GBE-MLZSL），以全面利用图像的本地和全局特征，并提高预测精度和稳定性。特别是，该框架将特征地图分成多个特征组，每个特征组可以独立地在Local Information Distinguishing Module（LID）中进行训练，以保证唯一性。同时，Global Enhancement Module（GEM）是设计来保持图像的主要方向。此外，还设计了一个静止图 Structured 构建本地特征之间的相关性。
results: 实验表明，提出的GBE-MLZSL方法在大规模的MLZSL benchmark数据集NUS-WIDE和Open-Images-v4上，与其他当前state-of-the-art方法之间的margin比较大。

Abstract
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein, the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics, and transfer the learned model to unseen ones. But they ignore the effective integration of local and global features. That is, in the process of inferring unseen classes, global features represent the principal direction of the image in the feature space, while local features should maintain uniqueness within a certain range. This integrated neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and effective group bi-enhancement framework for MLZSL, dubbed GBE-MLZSL, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. Specifically, we split the feature maps into several feature groups, of which each feature group can be trained independently with the Local Information Distinguishing Module (LID) to ensure uniqueness. Meanwhile, a Global Enhancement Module (GEM) is designed to preserve the principal direction. Besides, a static graph structure is designed to construct the correlation of local features. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.

摘要
In this paper, we propose a novel and effective group bi-enhancement framework for MLZSL, called GBE-MLZSL, to fully utilize such properties and enable more accurate and robust visual-semantic projections. Specifically, we split the feature maps into several feature groups, each of which can be trained independently with the Local Information Distinguishing Module (LID) to ensure uniqueness. Additionally, a Global Enhancement Module (GEM) is designed to preserve the principal direction. Furthermore, a static graph structure is designed to construct the correlation of local features. Experiments on large-scale MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the proposed GBE-MLZSL outperforms other state-of-the-art methods with large margins.Translated into Simplified Chinese, the text would be:这篇论文研究了多类零例学习（MLZSL）问题，即在一个样本（例如一张图像）中识别多个未经见过的类，基于已经见过的类和auxiliary知识，如semantic信息。现有的方法通常是分析样本中不同类的关系，从空间或semantic特征的角度来转移已经学习的模型到未经见过的类。但它们忽略了Integrate Local and Global Features的效果。即在推断未经见过的类时，global feature在特征空间中表示样本的主要方向，而local feature在某些范围内保持uniqueness。这种总体忽略会使模型失去样本的主要组成部分。只靠基于seen类的local存在来进行推断会引入不可避免的偏见。在这篇论文中，我们提出了一种新的和有效的集群强化框架，名为GBE-MLZSL，以便充分利用这些特性并实现更加准确和可靠的视semantic投影。具体来说，我们将特征图分成多个特征组，每个特征组可以独立地通过Local Information Distinguishing Module（LID）来确保uniqueness。同时，我们设计了Global Enhancement Module（GEM）来保持主要方向。此外，我们还设计了一个静态图Structured Graph来建立本地特征之间的相关性。实验结果表明，提出的GBE-MLZSL在大规模MLZSL benchmark数据集NUS-WIDE和Open-Images-v4上舜拓了其他状态的方法。

A novel framework employing deep multi-attention channels network for the autonomous detection of metastasizing cells through fluorescence microscopy

paper_url: http://arxiv.org/abs/2309.00911
repo_url: None
paper_authors: Michail Mamalakis, Sarah C. Macfarlane, Scott V. Notley, Annica K. B Gad, George Panoutsos
for: distinguishing between normal and metastasizing human cells
methods: combines multi-attention channels network and global explainable techniques using fluorescence microscopy images of actin and vimentin filaments
results: unprecedented understanding of cytoskeletal changes accompanying oncogenic transformation, and potential spatial biomarker for diagnostic tools against metastasis (spatial distribution of vimentin)Here is the same information in Simplified Chinese text:
for: 分辨normal和转移性人类细胞
methods: 结合多通道注意力网络和全球可解释技术，使用 fluorescence microscopy 图像显示 actin 和 vimentin 纤维蛋白的空间组织
results: 未曾有的细胞变化理解，并可能提供将来的诊断工具 против转移细胞 (细胞分布的 vimentin)

Abstract
We developed a transparent computational large-scale imaging-based framework that can distinguish between normal and metastasizing human cells. The method relies on fluorescence microscopy images showing the spatial organization of actin and vimentin filaments in normal and metastasizing single cells, using a combination of multi-attention channels network and global explainable techniques. We test a classification between normal cells (Bj primary fibroblast), and their isogenically matched, transformed and invasive counterpart (BjTertSV40TRasV12). Manual annotation is not trivial to automate due to the intricacy of the biologically relevant features. In this research, we utilized established deep learning networks and our new multi-attention channel architecture. To increase the interpretability of the network - crucial for this application area - we developed an interpretable global explainable approach correlating the weighted geometric mean of the total cell images and their local GradCam scores. The significant results from our analysis unprecedently allowed a more detailed, and biologically relevant understanding of the cytoskeletal changes that accompany oncogenic transformation of normal to invasive and metastasizing cells. We also paved the way for a possible spatial micrometre-level biomarker for future development of diagnostic tools against metastasis (spatial distribution of vimentin).

摘要

MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation

paper_url: http://arxiv.org/abs/2309.00908
repo_url: None
paper_authors: Hanshu Yan, Jun Hao Liew, Long Mai, Shanchuan Lin, Jiashi Feng
for: 修改视频的视觉外观 while preserving its motion.
methods: 提议一种新的框架 MagicProp，包括两个阶段：外观编辑和动作感知外观升级。在第一个阶段，MagicProp 选择输入视频中一帧，并应用图像修饰技术来修改内容和/或风格。在第二个阶段，MagicProp 使用编辑后的帧作为外观参考，并使用一种泛化生成模型来生成剩下的帧。
results: MagicProp 结合了图像修饰技术的灵活性和泛化生成模型的高效性，可以在输入视频中任意地区进行对象类型和艺术风格的修改，同时保持视频帧之间的 temporal consistency。广泛的实验表明 MagicProp 有效。

Abstract
This paper addresses the issue of modifying the visual appearance of videos while preserving their motion. A novel framework, named MagicProp, is proposed, which disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. The flexibility of these techniques enables the editing of arbitrary regions within the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach. To achieve this, a diffusion-based conditional generation model, called PropDPM, is developed, which synthesizes the target frame by conditioning on the reference appearance, the target motion, and its previous appearance. The autoregressive editing approach ensures temporal consistency in the resulting videos. Overall, MagicProp combines the flexibility of image-editing techniques with the superior temporal consistency of autoregressive modeling, enabling flexible editing of object types and aesthetic styles in arbitrary regions of input videos while maintaining good temporal consistency across frames. Extensive experiments in various video editing scenarios demonstrate the effectiveness of MagicProp.

摘要

A Generic Fundus Image Enhancement Network Boosted by Frequency Self-supervised Representation Learning

paper_url: http://arxiv.org/abs/2309.00885
repo_url: https://github.com/liamheng/Annotation-free-Fundus-Image-Enhancement
paper_authors: Heng Li, Haofeng Liu, Huazhu Fu, Yanwu Xu, Hui Shu, Ke Niu, Yan Hu, Jiang Liu
for: 本研究旨在开发一种能够Robustly Correct unknown fundus images的基本图像增强网络（GFE-Net），以便在低质量图像下进行诊断和智能系统的临床应用。
methods: 该研究使用了自我监督学习来学习不supervised的图像信息，并将图像增强和表征学习融合在一起，以实现高质量图像增强和结构保持。
results: 对比state-of-the-art算法，GFE-Net在数据依赖度、增强性能、部署效率和扩展可用性等方面表现出优异，并且可以方便进行后续的基本图像分析。

Abstract
Fundus photography is prone to suffer from image quality degradation that impacts clinical examination performed by ophthalmologists or intelligent systems. Though enhancement algorithms have been developed to promote fundus observation on degraded images, high data demands and limited applicability hinder their clinical deployment. To circumvent this bottleneck, a generic fundus image enhancement network (GFE-Net) is developed in this study to robustly correct unknown fundus images without supervised or extra data. Levering image frequency information, self-supervised representation learning is conducted to learn robust structure-aware representations from degraded images. Then with a seamless architecture that couples representation learning and image enhancement, GFE-Net can accurately correct fundus images and meanwhile preserve retinal structures. Comprehensive experiments are implemented to demonstrate the effectiveness and advantages of GFE-Net. Compared with state-of-the-art algorithms, GFE-Net achieves superior performance in data dependency, enhancement performance, deployment efficiency, and scale generalizability. Follow-up fundus image analysis is also facilitated by GFE-Net, whose modules are respectively verified to be effective for image enhancement.

摘要
血液照片容易受到影像质量下降的影响，这会对眼科医生或智能系统进行临床诊断带来困难。虽然有增强算法可以提高血液图像质量，但这些算法具有高数据需求和局限性，使其在临床应用中受到限制。为了绕过这个瓶颈，本研究提出了一种通用血液图像增强网络（GFE-Net），可以不需要指导或额外数据，强制约束血液图像中的结构。GFE-Net 利用图像频率信息，自我指导学习来学习血液图像中的结构，然后通过将 representation learning 和图像增强结合在一起，GFE-Net 可以准确地 corrections 血液图像，同时保持血液结构。我们进行了全面的实验，以证明 GFE-Net 的有效性和优势。相比之前的算法，GFE-Net 在数据依赖、增强性、部署效率和扩展可行性等方面具有显著优势。此外，GFE-Net 的模块也在不同的应用中进行了验证，其中每个模块都能够准确地进行图像增强。

Fearless Luminance Adaptation: A Macro-Micro-Hierarchical Transformer for Exposure Correction

paper_url: http://arxiv.org/abs/2309.00872
repo_url: None
paper_authors: Gehui Li, Jinyuan Liu, Long Ma, Zhiying Jiang, Xin Fan, Risheng Liu
for: 本文旨在提高图像曝光误差的correltion，以提高图像质量。
methods: 本文提出了一种Macro-Micro-Hierarchical transformer，包括macro注意力、micro注意力和层次结构，以实现均衡global和local特征的捕捉。
results: 实验表明，本方法可以提供更加吸引人的图像修复结果，并且在low-light face recognition和low-light semantic segmentation中表现出色。

Abstract
Photographs taken with less-than-ideal exposure settings often display poor visual quality. Since the correction procedures vary significantly, it is difficult for a single neural network to handle all exposure problems. Moreover, the inherent limitations of convolutions, hinder the models ability to restore faithful color or details on extremely over-/under- exposed regions. To overcome these limitations, we propose a Macro-Micro-Hierarchical transformer, which consists of a macro attention to capture long-range dependencies, a micro attention to extract local features, and a hierarchical structure for coarse-to-fine correction. In specific, the complementary macro-micro attention designs enhance locality while allowing global interactions. The hierarchical structure enables the network to correct exposure errors of different scales layer by layer. Furthermore, we propose a contrast constraint and couple it seamlessly in the loss function, where the corrected image is pulled towards the positive sample and pushed away from the dynamically generated negative samples. Thus the remaining color distortion and loss of detail can be removed. We also extend our method as an image enhancer for low-light face recognition and low-light semantic segmentation. Experiments demonstrate that our approach obtains more attractive results than state-of-the-art methods quantitatively and qualitatively.

摘要
照片拍摄时使用不理想的曝光设置通常会导致视觉质量差。由于修正方法之间差异很大，因此单一神经网络难以处理所有曝光问题。另外，卷积的内在限制，阻碍模型恢复 faithful 的颜色或细节在极度过曝光或 Underexposed 区域。为了缓解这些限制，我们提议了一种宏微层次 transformer，它包括一个宏注意力来捕捉长距离依赖关系，一个微注意力来提取本地特征，以及一个层次结构来进行粗细修正。具体来说，宏微注意力的补做设计可以提高地方性，同时允许全局交互。层次结构使得网络可以层次修正不同的曝光错误。此外，我们还提出了一种对比约束，并将其灵活地添加到损失函数中，使 corrected 图像被pull towards 正样本，并被push away FROM 动态生成的负样本。因此，剩下的颜色扭曲和细节损失可以被去除。我们还扩展了我们的方法，用于低光照人脸识别和低光照 semantic segmentation。实验表明，我们的方法可以比 estado-of-the-art 方法更加吸引人地得到结果， both quantitatively and qualitatively。

Boosting Weakly-Supervised Image Segmentation via Representation, Transform, and Compensator

paper_url: http://arxiv.org/abs/2309.00871
repo_url: None
paper_authors: Chunyan Wang, Dong Zhang, Rui Yan
for: 本研究旨在提出一种单 Stage weakly-supervised image segmentation (WSIS) 方法，以提高 pseudo-mask 质量，从而实现更高的 segmentation 精度。
methods: 我们提出了一种使用 siamese network 和对比学习的方法，通过改进类活图 (CAM) 的质量，实现自我反复过程。我们还引入了交叉表示反复 module，以学习 robust 类型抽象和捕捉全局上下文信息，以便反复 CAMs。
results: 我们在 PASCAL VOC 2012 数据集上进行了实验，并证明了我们的方法可以准确地 segment 图像。我们在 PASCAL VOC 2012 验证集上达到了 67.2% 和 68.76% mIoU，在测试集上达到了 68.76% mIoU。此外，我们还扩展了我们的方法到弱地监督对象定位任务，并实验表明我们的方法在这个任务中仍然能够获得非常竞争力的结果。

Abstract
Weakly-supervised image segmentation (WSIS) is a critical task in computer vision that relies on image-level class labels. Multi-stage training procedures have been widely used in existing WSIS approaches to obtain high-quality pseudo-masks as ground-truth, resulting in significant progress. However, single-stage WSIS methods have recently gained attention due to their potential for simplifying training procedures, despite often suffering from low-quality pseudo-masks that limit their practical applications. To address this issue, we propose a novel single-stage WSIS method that utilizes a siamese network with contrastive learning to improve the quality of class activation maps (CAMs) and achieve a self-refinement process. Our approach employs a cross-representation refinement method that expands reliable object regions by utilizing different feature representations from the backbone. Additionally, we introduce a cross-transform regularization module that learns robust class prototypes for contrastive learning and captures global context information to feed back rough CAMs, thereby improving the quality of CAMs. Our final high-quality CAMs are used as pseudo-masks to supervise the segmentation result. Experimental results on the PASCAL VOC 2012 dataset demonstrate that our method significantly outperforms other state-of-the-art methods, achieving 67.2% and 68.76% mIoU on PASCAL VOC 2012 val set and test set, respectively. Furthermore, our method has been extended to weakly supervised object localization task, and experimental results demonstrate that our method continues to achieve very competitive results.

摘要
弱样指导图像分割（WSIS）是计算机视觉中的关键任务，它基于图像级别的类标签。现有的WSIS方法多使用多个阶段训练过程来获得高质量的假标签，从而取得了显著的进步。然而，单阶段WSIS方法在最近受到了关注，因为它们可能简化训练过程，尽管经常受到低质量假标签的限制，使其在实际应用中具有局限性。为解决这个问题，我们提出了一种新的单阶段WSIS方法，该方法使用对称网络和对比学习来提高类激活图（CAM）的质量，并实现自我调整过程。我们的方法使用跨表示反复增强方法，将不同的特征表示从后向扩展可靠的物体区域，并 introduce a cross-transform regularization module，该模块学习强健的类范例，以便对比学习，并捕捉全局信息，以帮助改善CAM的质量。最终，我们的高质量CAM被用作假标签，以便监督分割结果。实验结果表明，我们的方法在PASCAL VOC 2012数据集上与其他状态对照方法相比，显著超越了它们，达到了67.2%和68.76%的mIoU在PASCAL VOC 2012验证集和测试集上，分别。此外，我们的方法还被扩展到弱有指导物体定位任务，实验结果表明，我们的方法在这个任务上仍然实现了非常竞争力的结果。

Big-model Driven Few-shot Continual Learning

paper_url: http://arxiv.org/abs/2309.00862
repo_url: None
paper_authors: Ziqi Gu, Chunyan Xu, Zihan Lu, Xin Liu, Anbo Dai, Zhen Cui
for: 提高 few-shot continual learning (FSCL) 的精度和稳定性。
methods: 使用 big-model 驱动的转移学习，采用 adaptive decision 机制，并实施 adaptive distillation 来提高模型的性能。
results: 在三个popular dataset上（包括 CIFAR100、minilmageNet 和 CUB200），提出的 B-FSCL 方法完全超越了所有现有的 FSCL 方法。

Abstract
Few-shot continual learning (FSCL) has attracted intensive attention and achieved some advances in recent years, but now it is difficult to again make a big stride in accuracy due to the limitation of only few-shot incremental samples. Inspired by distinctive human cognition ability in life learning, in this work, we propose a novel Big-model driven Few-shot Continual Learning (B-FSCL) framework to gradually evolve the model under the traction of the world's big-models (like human accumulative knowledge). Specifically, we perform the big-model driven transfer learning to leverage the powerful encoding capability of these existing big-models, which can adapt the continual model to a few of newly added samples while avoiding the over-fitting problem. Considering that the big-model and the continual model may have different perceived results for the identical images, we introduce an instance-level adaptive decision mechanism to provide the high-level flexibility cognitive support adjusted to varying samples. In turn, the adaptive decision can be further adopted to optimize the parameters of the continual model, performing the adaptive distillation of big-model's knowledge information. Experimental results of our proposed B-FSCL on three popular datasets (including CIFAR100, minilmageNet and CUB200) completely surpass all state-of-the-art FSCL methods.

摘要
Recently, few-shot continual learning (FSCL) has received extensive attention and achieved some advances, but it has become difficult to make further significant improvements in accuracy due to the limited number of few-shot incremental samples. Inspired by human cognitive abilities in life learning, we propose a novel Big-model driven Few-shot Continual Learning (B-FSCL) framework to gradually evolve the model under the guidance of the world's big-models (like human accumulative knowledge). Specifically, we perform big-model driven transfer learning to leverage the powerful encoding capabilities of these existing big-models, which can adapt the continual model to a few newly added samples while avoiding the overfitting problem. Considering that the big-model and the continual model may have different perceptions of the same images, we introduce an instance-level adaptive decision mechanism to provide high-level flexibility cognitive support adjusted to varying samples. In turn, the adaptive decision can be further adopted to optimize the parameters of the continual model, performing adaptive distillation of big-model's knowledge information. Experimental results of our proposed B-FSCL on three popular datasets (including CIFAR100, minilmageNet, and CUB200) completely surpass all state-of-the-art FSCL methods.

Correlated and Multi-frequency Diffusion Modeling for Highly Under-sampled MRI Reconstruction

paper_url: http://arxiv.org/abs/2309.00853
repo_url: https://github.com/yqx7150/cm-dm
paper_authors: Yu Guan, Chuanming Yu, Shiyu Lu, Zhuoxu Cui, Dong Liang, Qiegen Liu
for: 提高MRI重建精度和加速抽象过程
methods: 利用多频率优先和吸引过程对各种细胞膜进行精细Texture detail重建
results: 实验结果表明，提出的方法可以更高精度地重建MRI图像，并且可以加速抽象过程。

Abstract
Most existing MRI reconstruction methods perform tar-geted reconstruction of the entire MR image without tak-ing specific tissue regions into consideration. This may fail to emphasize the reconstruction accuracy on im-portant tissues for diagnosis. In this study, leveraging a combination of the properties of k-space data and the diffusion process, our novel scheme focuses on mining the multi-frequency prior with different strategies to pre-serve fine texture details in the reconstructed image. In addition, a diffusion process can converge more quickly if its target distribution closely resembles the noise distri-bution in the process. This can be accomplished through various high-frequency prior extractors. The finding further solidifies the effectiveness of the score-based gen-erative model. On top of all the advantages, our method improves the accuracy of MRI reconstruction and accel-erates sampling process. Experimental results verify that the proposed method successfully obtains more accurate reconstruction and outperforms state-of-the-art methods.

摘要
大多数现有MRI重建方法都是对整个MRI图像进行targeted重建，不考虑特定组织区域的重建精度。这可能导致重建精度不够，特别是在诊断中需要准确的组织区域。在本研究中，我们提出了一种新的方法，利用k空间数据的性质和扩散过程，将多频率优先级与不同策略相结合，以保留重建图像中细节的细腻 texture。此外，扩散过程可以更快地 converges，如果target分布和噪声分布在过程中很相似。这可以通过多种高频率优先级抽取器来实现。这些发现进一步证明了Score-based生成模型的效iveness。此外，我们的方法还提高了MRI重建的精度和采样速度。实验结果证明，我们提出的方法可以更好地重建MRI图像，并且超过了现有的方法。

A Post-Processing Based Bengali Document Layout Analysis with YOLOV8

paper_url: http://arxiv.org/abs/2309.00848
repo_url: None
paper_authors: Nazmus Sakib Ahmed, Saad Sakib Noor, Ashraful Islam Shanto Sikder, Abhijit Paul
for: 这 paper 的目的是提高孟加拉文档格式分析 (DLA)，使用 YOLOv8 模型和创新的后处理技术。
methods: 这 paper 使用数据增强来提高模型的鲁棒性，并使用两个阶段预测策略来实现准确的元素分 segmentation。
results: 这 paper 的结果表明， ensemble 模型和后处理技术可以超越基础模型，解决在 BaDLAD 数据集中存在的问题。

Abstract
This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.

摘要
Translated into Simplified Chinese:这篇论文关注使用YOLOv8模型和创新的后处理技术进行增强孟加拉文档布局分析（DLA）。我们利用数据增强来提高模型的可靠性，并在完整的数据集上精心调整方法，实现了两stage预测策略以确定精确的元素分割。我们的集成模型，结合后处理，超越了基础模型，解决了在BaDLAD数据集中 Identified的问题。通过这种方法，我们希望推进孟加拉文档分析，提高OCR和文档理解。BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field.此外，我们的实验提供了关键的思路，可以在已有的解决方案中添加新策略。

pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time Adaptation

paper_url: http://arxiv.org/abs/2309.00846
repo_url: None
paper_authors: Manogna Sreenivas, Goirik Chakrabarty, Soma Biswas
for: 本文旨在提出一种新的测试时适应（TTA）方法，以便在实际场景中，模型能够很好地表现。
methods: 本方法叫做 Pseudo Source guided Target Clustering（pSTarC），它是在实际领域变换下的TTA领域中相对未曾研究的。这种方法 Draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples。
results: 实验表明，pSTarC可以减轻计算需求，同时提高预测精度。此外，我们还证明了pSTarC的普适性，并在连续TTA框架中表现出色。

Abstract
Test Time Adaptation (TTA) is a pivotal concept in machine learning, enabling models to perform well in real-world scenarios, where test data distribution differs from training. In this work, we propose a novel approach called pseudo Source guided Target Clustering (pSTarC) addressing the relatively unexplored area of TTA under real-world domain shifts. This method draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples. The test samples are strategically aligned with these pseudo-source samples, facilitating their clustering and thereby enhancing TTA performance. pSTarC operates solely within the fully test-time adaptation protocol, removing the need for actual source data. Experimental validation on a variety of domain shift datasets, namely VisDA, Office-Home, DomainNet-126, CIFAR-100C verifies pSTarC's effectiveness. This method exhibits significant improvements in prediction accuracy along with efficient computational requirements. Furthermore, we also demonstrate the universality of the pSTarC framework by showing its effectiveness for the continuous TTA framework.

摘要
测试时适应（TTA）是机器学习中的一个重要概念，它允许模型在真实世界中表现良好，其测试数据分布与训练数据分布不同。在这种情况下，我们提出了一种新的方法called pseudo Source guided Target Clustering（pSTarC），用于解决真实世界域转移下的TTA。这种方法 Draws inspiration from 目标划分技术，利用源分类器来生成pseudo-source样本。测试样本被策略性地与这些pseudo-source样本相对应，从而提高了TTA性能。pSTarC在完全测试时适应协议下运行，不需要实际的源数据。在多个域转移数据集上，包括VisDA、Office-Home、DomainNet-126和CIFAR-100C，我们进行了实验 validate pSTarC的效果。这种方法在预测精度和计算需求方面具有显著改进。此外，我们还证明了pSTarC框架的通用性，其在连续TTA框架中也表现出了效果。

ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data

paper_url: http://arxiv.org/abs/2309.00832
repo_url: https://github.com/cleanlab/cleanlab
paper_authors: Ulyana Tkachenko, Aditya Thyagarajan, Jonas Mueller
for: 本研究旨在提高物体检测模型的训练数据质量，以提高模型的检测精度和稳定性。
methods: 本研究提出了一种名为ObjectLab的简单而直观的算法，用于检测物体检测标签中的多种错误，包括：漏掉 bounding box、位置错误和类别标签错误。ObjectLab 使用任何已经训练过物体检测模型来评估每个图像的标签质量，以便自动优先级检查和修正涉及到错误的图像。
results: 对于多个物体检测数据集（包括 COCO）和多种模型（包括 Detectron-X101 和 Faster-RCNN），ObjectLab 能够准确地检测标签错误，与其他标签质量分数相比，具有更高的准确率/回归率。

Abstract
Despite powering sensitive systems like autonomous vehicles, object detection remains fairly brittle in part due to annotation errors that plague most real-world training datasets. We propose ObjectLab, a straightforward algorithm to detect diverse errors in object detection labels, including: overlooked bounding boxes, badly located boxes, and incorrect class label assignments. ObjectLab utilizes any trained object detection model to score the label quality of each image, such that mislabeled images can be automatically prioritized for label review/correction. Properly handling erroneous data enables training a better version of the same object detection model, without any change in existing modeling code. Across different object detection datasets (including COCO) and different models (including Detectron-X101 and Faster-RCNN), ObjectLab consistently detects annotation errors with much better precision/recall compared to other label quality scores.

摘要
尽管它用于感知系统如自动驾驶汽车的识别系统，但对象检测仍然比较脆弱，其中一个主要原因是训练数据中的注释错误。我们提议ObjectLab，一种简单的算法，用于检测对象检测标签中的多种错误，包括：被忽略的 bounding box、 incorrect 的位置和类别标签分配错误。ObjectLab 使用任何已经训练过的对象检测模型来评估每个图像的标签质量，以便自动优先级化需要更正的标签。正确处理错误数据可以训练一个更好的同样的对象检测模型，无需更改现有的代码。在不同的对象检测 dataset （包括 COCO）和不同的模型（包括 Detectron-X101 和 Faster-RCNN）中，ObjectLab invariably detects 注释错误的精度/回归比例远高于其他标签质量分数。

Multi-scale, Data-driven and Anatomically Constrained Deep Learning Image Registration for Adult and Fetal Echocardiography

paper_url: http://arxiv.org/abs/2309.00831
repo_url: https://github.com/kamruleee51/ddc-ac-dlir
paper_authors: Md. Kamrul Hasan, Haobo Zhu, Guang Yang, Choon Hwai Yap
for: 这个研究的目的是提高电子医学图像匹配的精度和稳定性，以便在临床中更 accurately 评估心脏运动和肌体弹性。
methods: 这个研究使用了深度学习图像匹配（DLIR）技术，并提出了一种将形态编码损失、数据驱动损失和多 scales 训练方法相结合的框架。
results: 测试结果表明，这种方法可以提高图像匹配的精度和稳定性，并且可以在成人和胎儿电子医学图像中达到优秀的结果。

Abstract
Temporal echocardiography image registration is a basis for clinical quantifications such as cardiac motion estimation, myocardial strain assessments, and stroke volume quantifications. In past studies, deep learning image registration (DLIR) has shown promising results and is consistently accurate and precise, requiring less computational time. We propose that a greater focus on the warped moving image's anatomic plausibility and image quality can support robust DLIR performance. Further, past implementations have focused on adult echocardiography, and there is an absence of DLIR implementations for fetal echocardiography. We propose a framework that combines three strategies for DLIR in both fetal and adult echo: (1) an anatomic shape-encoded loss to preserve physiological myocardial and left ventricular anatomical topologies in warped images; (2) a data-driven loss that is trained adversarially to preserve good image texture features in warped images; and (3) a multi-scale training scheme of a data-driven and anatomically constrained algorithm to improve accuracy. Our tests show that good anatomical topology and image textures are strongly linked to shape-encoded and data-driven adversarial losses. They improve different aspects of registration performance in a non-overlapping way, justifying their combination. Despite fundamental distinctions between adult and fetal echo images, we show that these strategies can provide excellent registration results in both adult and fetal echocardiography using the publicly available CAMUS adult echo dataset and our private multi-demographic fetal echo dataset. Our approach outperforms traditional non-DL gold standard registration approaches, including Optical Flow and Elastix. Registration improvements could be translated to more accurate and precise clinical quantification of cardiac ejection fraction, demonstrating a potential for translation.

摘要
医学影像协调是基础 для临床量化，如心脏运动评估、肌肉弹性评估和心脏血量评估。过去的研究表明，深度学习图像协调（DLIR）有扎实的结果和精度，需要较少的计算时间。我们提议更重视卷积动图像的 анатомиче可能性和图像质量，以支持Robust DLIR表现。此外，过去的实施都是成人echo，而 absence of DLIR实现 для胎儿echo。我们提议一种框架，该框架结合以下三种策略：（1）一种适应Physiological myocardial和左心脏的生物学特征的形状编码损失；（2）一种通过对图像特征进行反向学习来保持好的图像特征；（3）一种多尺度训练的数据驱动和生物学特征驱动的算法，以提高准确性。我们的测试表明，良好的生物学特征和图像特征是强相关的，这些损失可以不同方面提高协调性能。尽管成人echo和胎儿echo图像具有基本不同的特征，我们的策略可以在两者上提供出色的协调结果。我们的方法超过了传统的非深度学习标准注册方法，包括折射流和Elastix。更好的协调可能可以翻译到更准确和精度的临床量化，表明了我们的方法的潜在应用。

When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision

paper_url: http://arxiv.org/abs/2309.00828
repo_url: None
paper_authors: Qingtao Yu, Heming Du, Chen Liu, Xin Yu
for: 提高弱监督3D点云实例分割的性能，使用矩形框筛选的粗糙标注。
methods: 利用预训练的2D基础模型SAM和3D几何诊断来从矩形框筛选获得准确的点云实例标签。
results: 在ScanNet-v2和S3DIS测试集上实现高质量的3D点云实例标签，并在噪音矩形框筛选情况下显示了高效性和稳定性。

Abstract
Learning from bounding-boxes annotations has shown great potential in weakly-supervised 3D point cloud instance segmentation. However, we observed that existing methods would suffer severe performance degradation with perturbed bounding box annotations. To tackle this issue, we propose a complementary image prompt-induced weakly-supervised point cloud instance segmentation (CIP-WPIS) method. CIP-WPIS leverages pretrained knowledge embedded in the 2D foundation model SAM and 3D geometric prior to achieve accurate point-wise instance labels from the bounding box annotations. Specifically, CP-WPIS first selects image views in which 3D candidate points of an instance are fully visible. Then, we generate complementary background and foreground prompts from projections to obtain SAM 2D instance mask predictions. According to these, we assign the confidence values to points indicating the likelihood of points belonging to the instance. Furthermore, we utilize 3D geometric homogeneity provided by superpoints to decide the final instance label assignments. In this fashion, we achieve high-quality 3D point-wise instance labels. Extensive experiments on both Scannet-v2 and S3DIS benchmarks demonstrate that our method is robust against noisy 3D bounding-box annotations and achieves state-of-the-art performance.

摘要
学习封包注解可以提高弱相关3D点云实例分割的潜力。然而，我们发现现有方法对受到扰动 boundING box注解时会表现出严重的性能下降。为解决这个问题，我们提出了补充图 prompt-induced 弱相关3D点云实例分割（CIP-WPIS）方法。CIP-WPIS 利用预训练在2D基础模型 SAM 中嵌入的知识和3D几何规范来实现从 bounding box 注解中获取高质量点云实例标签。具体来说，CIP-WPIS 首先选择在实例中3D候选点完全可见的图像视图。然后，我们生成补充背景和前景投影以获得 SAM 2D实例幕标注。根据这些标注，我们将点Cloud中的点分配确idence值，以表示点是否属于实例。此外，我们利用 superpoints 提供的3D几何一致性来决定实例标签分配。这种方法可以实现高质量点云实例标签。我们在 Scannet-v2 和 S3DIS benchmark上进行了广泛的实验，并证明了我们的方法对受到扰动 bounding box 注解的Robustness和性能具有状态的某些表现。

Few shot font generation via transferring similarity guided global style and quantization local style

paper_url: http://arxiv.org/abs/2309.00827
repo_url: https://github.com/awei669/VQ-Font
paper_authors: Wei Pan, Anna Zhu, Xinyu Zhou, Brian Kenji Iwana, Shilin Li
for: 这种研究旨在实现自动化少量字体生成（AFFG），以便通过只需要几个字形参考来生成新的字体，从而降低手动设计字体的劳动成本。
methods: 该方法采用了字符相似性指导的全局特征归一化和精细组件水平表示，并通过跨注意力基本样式传递模块来传递参考字形的风格。无需手动定义特定的字形组件，如笔画和基本元素。
results: 实验结果表明，该方法可以获得完整的组件级风格表示，并控制全局字形特征。与其他当前状态顶尖方法相比，该方法在不同的语言书写系统上表现出了更高的效果和普适性。代码可以在 GitHub 上找到：https://github.com/awei669/VQ-Font。

Abstract
Automatic few-shot font generation (AFFG), aiming at generating new fonts with only a few glyph references, reduces the labor cost of manually designing fonts. However, the traditional AFFG paradigm of style-content disentanglement cannot capture the diverse local details of different fonts. So, many component-based approaches are proposed to tackle this problem. The issue with component-based approaches is that they usually require special pre-defined glyph components, e.g., strokes and radicals, which is infeasible for AFFG of different languages. In this paper, we present a novel font generation approach by aggregating styles from character similarity-guided global features and stylized component-level representations. We calculate the similarity scores of the target character and the referenced samples by measuring the distance along the corresponding channels from the content features, and assigning them as the weights for aggregating the global style features. To better capture the local styles, a cross-attention-based style transfer module is adopted to transfer the styles of reference glyphs to the components, where the components are self-learned discrete latent codes through vector quantization without manual definition. With these designs, our AFFG method could obtain a complete set of component-level style representations, and also control the global glyph characteristics. The experimental results reflect the effectiveness and generalization of the proposed method on different linguistic scripts, and also show its superiority when compared with other state-of-the-art methods. The source code can be found at https://github.com/awei669/VQ-Font.

摘要
自动几个字体生成（AFFG），目的是通过只需几个字形引用来生成新字体，从而减少手动设计字体的劳动成本。然而，传统的AFFG模式中的风格内容分离无法捕捉不同字体的地方细节。因此，许多组件化方法被提议。然而，这些组件化方法通常需要特定的预定义字形组件，例如笔画和基本元素，这是不适用于不同语言的AFFG。在这篇论文中，我们提出了一种新的字体生成方法，通过将风格特征从类似性指导的全局特征和精细组件水平表示相乘。我们在目标字形和参考样本之间计算相似性分数，并将其作为全局风格特征的权重进行相乘。为更好地捕捉地方风格，我们采用了交叉注意力基于的风格传输模块，将参考字形的风格传输到组件水平，其中组件是通过量化Vector без manual定义得到的自适应积分码。通过这些设计，我们的AFFG方法可以获得完整的组件级别风格表示，同时控制全局字形特征。实验结果表明了我们的方法在不同的文字系统中的效果和普遍性，以及与其他当前领域的方法相比的优势。详细代码可以在找到。

Soil Image Segmentation Based on Mask R-CNN

paper_url: http://arxiv.org/abs/2309.00817
repo_url: https://github.com/YidaMyth/Mask-RCNN_QtGui
paper_authors: Yida Chen, Kang Liu, Yi Xin, Xinru Zhao
for: 本研究是为了开发一种可以在自然环境下实时 segmentation 和检测土壤图像的机器视觉方法。
methods: 本研究使用了深度学习的Mask R-CNN模型来实现土壤图像实例分割。首先，构建了一个基于收集的土壤图像集，并使用EISeg注解工具将土壤区域标注为土壤。然后，使用了GPU加速来训练Mask R-CNN模型，并在训练集和验证集上进行了评估。
results: 训练后，Mask R-CNN模型可以准确地 segmentation 土壤图像，并在不同环境下收集的图像上表现良好。训练集的损失值为0.1999，验证集的mAP值（IoU=0.5）为0.8804，并且只需0.06秒的时间来完成图像 segmentation。

Abstract
The complex background in the soil image collected in the field natural environment will affect the subsequent soil image recognition based on machine vision. Segmenting the soil center area from the soil image can eliminate the influence of the complex background, which is an important preprocessing work for subsequent soil image recognition. For the first time, the deep learning method was applied to soil image segmentation, and the Mask R-CNN model was selected to complete the positioning and segmentation of soil images. Construct a soil image dataset based on the collected soil images, use the EISeg annotation tool to mark the soil area as soil, and save the annotation information; train the Mask R-CNN soil image instance segmentation model. The trained model can obtain accurate segmentation results for soil images, and can show good performance on soil images collected in different environments; the trained instance segmentation model has a loss value of 0.1999 in the training set, and the mAP of the validation set segmentation (IoU=0.5) is 0.8804, and it takes only 0.06s to complete image segmentation based on GPU acceleration, which can meet the real-time segmentation and detection of soil images in the field under natural conditions. You can get our code in the Conclusions. The homepage is https://github.com/YidaMyth.

摘要
在自然环境中采集的土壤图像中的复杂背景将影响后续的土壤图像认知基于机器视觉。 segmenting 土壤中心区域从土壤图像中可以消除复杂背景的影响，这是土壤图像认知前置处理的重要步骤。这是首次应用深度学习方法进行土壤图像分割，选择了Mask R-CNN模型来完成位置和分割土壤图像。根据收集的土壤图像构建了土壤图像数据集，使用EISeg注意力工具标记土壤区域为土壤，并保存注意力信息。训练Mask R-CNN土壤图像实例分割模型。训练后的模型可以在不同环境中获得高精度的分割结果，并且在0.06秒钟内完成图像分割基于GPU加速，可以满足在自然条件下的实时分割和检测土壤图像。可以在结论中获取我们的代码。主页是https://github.com/YidaMyth。

Self-Supervised Video Transformers for Isolated Sign Language Recognition

paper_url: http://arxiv.org/abs/2309.02450
repo_url: None
paper_authors: Marcelo Sandoval-Castaneda, Yanhong Li, Diane Brentari, Karen Livescu, Gregory Shakhnarovich
for: 本研究是一篇关于孤立手语识别（ISLR）自我超级vised学习方法的深入分析文章。
methods: 我们考虑了四种最近引入的 transformer 基于方法，以及四种预训练数据模式，并在 WLASL2000 数据集上研究了所有组合。
results: 我们发现，MaskFeat 可以超过 pose-based 和监督视频模型，在 gloss-based WLASL2000 上达到 79.02% 的top-1 准确率。此外，我们还分析了这些模型对 ASL 手语表示的能力，并通过 linear probing 分析表示的多样性。这个研究证明了 ISLR 中 architecture 和预训练任务的选择对性的重要性。

Abstract
This paper presents an in-depth analysis of various self-supervision methods for isolated sign language recognition (ISLR). We consider four recently introduced transformer-based approaches to self-supervised learning from videos, and four pre-training data regimes, and study all the combinations on the WLASL2000 dataset. Our findings reveal that MaskFeat achieves performance superior to pose-based and supervised video models, with a top-1 accuracy of 79.02% on gloss-based WLASL2000. Furthermore, we analyze these models' ability to produce representations of ASL signs using linear probing on diverse phonological features. This study underscores the value of architecture and pre-training task choices in ISLR. Specifically, our results on WLASL2000 highlight the power of masked reconstruction pre-training, and our linear probing results demonstrate the importance of hierarchical vision transformers for sign language representation.

摘要

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

paper_url: http://arxiv.org/abs/2309.00796
repo_url: https://github.com/ZcyMonkey/AttT2M
paper_authors: Chongyang Zhong, Lei Hu, Zihao Zhang, Shihong Xia
for: 本研究的目的是提出一种基于文本描述的三维人体运动生成方法，以便生成自然、多样化的人体运动。
methods: 该方法使用了两个阶段的方法，包括body-part attention和global-local motion-text attention。body-part attention通过引入人体部分空间编码器来学习更表达性的精度矩阵空间，而global-local motion-text attention则是从cross-modal角度学习文本和运动之间的关系。
results: 对于HumanML3D和KIT-ML两个数据集，该方法在质量和量化评估中几乎所有超过当前状态的方法，并实现了细腻的生成和action2motion。

Abstract
Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose \textbf{AttT2M}, a two-stage method with multi-perspective attention mechanism: \textbf{body-part attention} and \textbf{global-local motion-text attention}. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code is in https://github.com/ZcyMonkey/AttT2M

摘要
师MAIN RESEARCH FOCUS IN RECENT YEARS HAS BEEN GENERATING 3D HUMAN MOTION BASED ON TEXTUAL DESCRIPTIONS. THIS REQUIRES THE GENERATED MOTION TO BE DIVERSE, NATURAL, AND CONFORM TO THE TEXTUAL DESCRIPTION. DUE TO THE COMPLEX SPATIO-TEMPORAL NATURE OF HUMAN MOTION AND THE DIFFICULTY IN LEARNING THE CROSS-MODAL RELATIONSHIP BETWEEN TEXT AND MOTION, TEXT-DRIVEN MOTION GENERATION IS STILL A CHALLENGING PROBLEM. TO ADDRESS THESE ISSUES, WE PROPOSE ATT2M, A TWO-STAGE METHOD WITH MULTI-PERSPECTIVE ATTENTION MECHANISM. THE FORMER FOCUSES ON THE MOTION EMBEDDING PERSPECTIVE, WHICH MEANS INTRODUCING A BODY-PART SPATIO-TEMPORAL ENCODER INTO VQ-VAE TO LEARN A MORE EXPRESSIVE DISCRETE LATENT SPACE. THE LATTER IS FROM THE CROSS-MODAL PERSPECTIVE, WHICH IS USED TO LEARN THE SENTENCE-LEVEL AND WORD-LEVEL MOTION-TEXT CROSS-MODAL RELATIONSHIP. THE TEXT-DRIVEN MOTION IS FINALLY GENERATED WITH A GENERATIVE TRANSFORMER. EXTENSIVE EXPERIMENTS CONDUCTED ON HUMANML3D AND KIT-ML DEMONSTRATE THAT OUR METHOD OUTPERFORMS THE CURRENT STATE-OF-THE-ART WORKS IN TERMS OF QUALITATIVE AND QUANTITATIVE EVALUATION, AND ACHIEVE FINE-GRAINED SYNTHESIS AND ACTION2MOTION. OUR CODE IS AVAILABLE AT https://github.com/ZcyMonkey/AttT2M.

FastPoseGait: A Toolbox and Benchmark for Efficient Pose-based Gait Recognition

paper_url: http://arxiv.org/abs/2309.00794
repo_url: https://github.com/bnu-ivc/fastposegait
paper_authors: Shibei Meng, Yang Fu, Saihui Hou, Chunshui Cao, Xu Liu, Yongzhen Huang
for: 这个研究旨在提供一个开源的 pose-based gait recognition 工具箱，以便研究人员可以快速进行 pose-based gait recognition 的研究。
methods: 这个工具箱支持多种现代 pose-based gait recognition 算法，包括多种 SOTA 算法和最新的进步。此外，这个工具箱还提供了许多预训模型和详细的 benchmark 结果，对未来的研究提供了宝贵的参考和对照。
results: 这个研究提供了一个高度可调的 pose-based gait recognition 工具箱，可以快速地进行 pose-based gait recognition 的研究。此外，这个工具箱还提供了许多预训模型和详细的 benchmark 结果，对未来的研究提供了宝贵的参考和对照。

Abstract
We present FastPoseGait, an open-source toolbox for pose-based gait recognition based on PyTorch. Our toolbox supports a set of cutting-edge pose-based gait recognition algorithms and a variety of related benchmarks. Unlike other pose-based projects that focus on a single algorithm, FastPoseGait integrates several state-of-the-art (SOTA) algorithms under a unified framework, incorporating both the latest advancements and best practices to ease the comparison of effectiveness and efficiency. In addition, to promote future research on pose-based gait recognition, we provide numerous pre-trained models and detailed benchmark results, which offer valuable insights and serve as a reference for further investigations. By leveraging the highly modular structure and diverse methods offered by FastPoseGait, researchers can quickly delve into pose-based gait recognition and promote development in the field. In this paper, we outline various features of this toolbox, aiming that our toolbox and benchmarks can further foster collaboration, facilitate reproducibility, and encourage the development of innovative algorithms for pose-based gait recognition. FastPoseGait is available at https://github.com//BNU-IVC/FastPoseGait and is actively maintained. We will continue updating this report as we add new features.

摘要
我们现在推出 FastPoseGait，一个开源工具箱 для pose-based 步态识别基于 PyTorch。我们的工具箱支持一系列当今最先进的 pose-based 步态识别算法以及一些相关的benchmark。与其他 pose-based 项目不同，FastPoseGait 集成了多种最新的SOTA算法，并在一个统一的框架下集成了最新的技术和最佳实践，以便方便比较效果和效率。此外，为促进未来的pose-based 步态识别研究，我们提供了多个预训练模型和详细的benchmark结果，这些结果对于进一步的调查提供了 ценные信息，并作为参考来供其他研究人员参考。通过 FastPoseGait 的高度可模块化结构和多种方法，研究人员可以快速探索 pose-based 步态识别领域，并促进该领域的发展。在这篇文章中，我们详细介绍了 FastPoseGait 的各种特点，希望通过我们的工具箱和benchmark，推动合作、促进复制性和激发 pose-based 步态识别领域的创新算法的发展。FastPoseGait 可以在上下载，并且 actively 维护。我们将继续更新这份报告，添加新的特性。

Towards High-Frequency Tracking and Fast Edge-Aware Optimization

paper_url: http://arxiv.org/abs/2309.00777
repo_url: None
paper_authors: Akash Bapat
for:* 这个论文目标是提高AR/VR跟踪系统的状态 искусственный智能，提高跟踪频率至数个命令的频率。methods:* 该论文提出了一种利用多个商业摄像头的方法，利用摄像头的滚动屏和圆形扭曲来实现高频跟踪。results:* 实验表明，该方法可以在不同的运动范围内实现高精度的跟踪，并且可以与现有的状态 искусственный智能系统进行综合比较。

Abstract
This dissertation advances the state of the art for AR/VR tracking systems by increasing the tracking frequency by orders of magnitude and proposes an efficient algorithm for the problem of edge-aware optimization. AR/VR is a natural way of interacting with computers, where the physical and digital worlds coexist. We are on the cusp of a radical change in how humans perform and interact with computing. Humans are sensitive to small misalignments between the real and the virtual world, and tracking at kilo-Hertz frequencies becomes essential. Current vision-based systems fall short, as their tracking frequency is implicitly limited by the frame-rate of the camera. This thesis presents a prototype system which can track at orders of magnitude higher than the state-of-the-art methods using multiple commodity cameras. The proposed system exploits characteristics of the camera traditionally considered as flaws, namely rolling shutter and radial distortion. The experimental evaluation shows the effectiveness of the method for various degrees of motion. Furthermore, edge-aware optimization is an indispensable tool in the computer vision arsenal for accurate filtering of depth-data and image-based rendering, which is increasingly being used for content creation and geometry processing for AR/VR. As applications increasingly demand higher resolution and speed, there exists a need to develop methods that scale accordingly. This dissertation proposes such an edge-aware optimization framework which is efficient, accurate, and algorithmically scales well, all of which are much desirable traits not found jointly in the state of the art. The experiments show the effectiveness of the framework in a multitude of computer vision tasks such as computational photography and stereo.

摘要
Furthermore, edge-aware optimization is an indispensable tool in the computer vision arsenal for accurate filtering of depth-data and image-based rendering, which is increasingly being used for content creation and geometry processing for AR/VR. As applications increasingly demand higher resolution and speed, there exists a need to develop methods that scale accordingly. This dissertation proposes such an edge-aware optimization framework which is efficient, accurate, and algorithmically scales well, all of which are much desirable traits not found jointly in the state of the art. The experiments show the effectiveness of the framework in a multitude of computer vision tasks such as computational photography and stereo.

Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs

paper_url: http://arxiv.org/abs/2309.00769
repo_url: None
paper_authors: Abrar Majeedi, Babak Naderi, Yasaman Hosseinkashi, Juhee Cho, Ruben Alvarez Martinez, Ross Cutler
for: 本研究旨在提供一个准确评估metric以便评估机器学习（ML）基于的影像压缩器，因为现有的评估metric在ML影像压缩器上不具有高相关性。
methods: 本研究使用了一个新的 dataset，其中包含了已精准地标注的质量标准。此外，研究者还提出了一个全参照影像质量评估（FRVQA）模型，它具有0.99的彭森相関系数（PCC）和0.99的施普曼排名相关系数（SRCC）。
results: 研究结果显示，新的评估metric 具有高相关性，并且可以帮助加速机器学习影像压缩器的研究。此外，研究者还将dataset和FRVQA模型开源，以便其他人可以进一步改进FRVQA模型。

Abstract
Machine learning-based video codecs have made significant progress in the past few years. A critical area in the development of ML-based video codecs is an accurate evaluation metric that does not require an expensive and slow subjective test. We show that existing evaluation metrics that were designed and trained on DSP-based video codecs are not highly correlated to subjective opinion when used with ML video codecs due to the video artifacts being quite different between ML and video codecs. We provide a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level. We make the dataset and FRVQA model open source to help accelerate research in ML video codecs, and so that others can further improve the FRVQA model.

摘要
To address this challenge, we have created a new dataset of ML video codec videos that have been accurately labeled for quality. We also propose a new full reference video quality assessment (FRVQA) model that achieves a Pearson Correlation Coefficient (PCC) of 0.99 and a Spearman's Rank Correlation Coefficient (SRCC) of 0.99 at the model level.To help accelerate research in ML video codecs, we are making the dataset and FRVQA model open source. We hope that others will use and build upon our work to further improve the FRVQA model and advance the field of ML video codecs.

2023-09-02

SEPAL: Spatial Gene Expression Prediction from Local Graphs

Contrastive Grouping with Transformer for Referring Image Segmentation

Comparative Analysis of Deep Learning Architectures for Breast Cancer Diagnosis Using the BreaKHis Dataset

RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Constrained CycleGAN for Effective Generation of Ultrasound Sector Images of Improved Spatial Resolution

Deep-Learning Framework for Optimal Selection of Soil Sampling Sites

AdLER: Adversarial Training with Label Error Rectification for One-Shot Medical Image Segmentation

NTU4DRadLM: 4D Radar-centric Multi-Modal Dataset for Localization and Mapping

ASF-Net: Robust Video Deraining via Temporal Alignment and Online Adaptive Learning

Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning

Exploring the Robustness of Human Parsers Towards Common Corruptions

Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

GBE-MLZSL: A Group Bi-Enhancement Framework for Multi-Label Zero-Shot Learning

A novel framework employing deep multi-attention channels network for the autonomous detection of metastasizing cells through fluorescence microscopy

MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation

A Generic Fundus Image Enhancement Network Boosted by Frequency Self-supervised Representation Learning

Fearless Luminance Adaptation: A Macro-Micro-Hierarchical Transformer for Exposure Correction

Boosting Weakly-Supervised Image Segmentation via Representation, Transform, and Compensator

Big-model Driven Few-shot Continual Learning

Correlated and Multi-frequency Diffusion Modeling for Highly Under-sampled MRI Reconstruction

A Post-Processing Based Bengali Document Layout Analysis with YOLOV8

pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time Adaptation

ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data

Multi-scale, Data-driven and Anatomically Constrained Deep Learning Image Registration for Adult and Fetal Echocardiography

When 3D Bounding-Box Meets SAM: Point Cloud Instance Segmentation with Weak-and-Noisy Supervision

Few shot font generation via transferring similarity guided global style and quantization local style

Soil Image Segmentation Based on Mask R-CNN

Self-Supervised Video Transformers for Isolated Sign Language Recognition

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

FastPoseGait: A Toolbox and Benchmark for Efficient Pose-based Gait Recognition

Towards High-Frequency Tracking and Fast Edge-Aware Optimization

Full Reference Video Quality Assessment for Machine Learning-Based Video Codecs