2023-10-08

cs.CV

cs.CV - 2023-10-08

Progressive Neural Compression for Adaptive Image Offloading under Timing Constraints

paper_url: http://arxiv.org/abs/2310.05306
repo_url: https://github.com/rickywrq/Progressive-Neural-Compression
paper_authors: Ruiqi Wang, Hanyang Liu, Jiaming Qiu, Moran Xu, Roch Guerin, Chenyang Lu
for: 这篇论文旨在提出一种适应性的进步神经压缩方法，以提高机器学习应用程序在边缘服务器上的推论性能，并且在网络带宽不稳定的情况下进行图像卸载。
methods: 本篇论文使用了进步神经压缩（PNC）方法，并使用了多目标减少自适应器来训练对应的图像压缩策略，以便根据可用带宽进行图像卸载。
results: 相比于现有的神经压缩方法和传统压缩方法，PNC方法可以提高机器学习应用程序的推论性能，并且可以适应网络带宽不稳定的情况。

Abstract
IoT devices are increasingly the source of data for machine learning (ML) applications running on edge servers. Data transmissions from devices to servers are often over local wireless networks whose bandwidth is not just limited but, more importantly, variable. Furthermore, in cyber-physical systems interacting with the physical environment, image offloading is also commonly subject to timing constraints. It is, therefore, important to develop an adaptive approach that maximizes the inference performance of ML applications under timing constraints and the resource constraints of IoT devices. In this paper, we use image classification as our target application and propose progressive neural compression (PNC) as an efficient solution to this problem. Although neural compression has been used to compress images for different ML applications, existing solutions often produce fixed-size outputs that are unsuitable for timing-constrained offloading over variable bandwidth. To address this limitation, we train a multi-objective rateless autoencoder that optimizes for multiple compression rates via stochastic taildrop to create a compression solution that produces features ordered according to their importance to inference performance. Features are then transmitted in that order based on available bandwidth, with classification ultimately performed using the (sub)set of features received by the deadline. We demonstrate the benefits of PNC over state-of-the-art neural compression approaches and traditional compression methods on a testbed comprising an IoT device and an edge server connected over a wireless network with varying bandwidth.

摘要
互联网物品（IoT）正在日益成为机器学习（ML）应用程序的数据来源。从设备到服务器的数据传输通常是通过本地无线网络，带宽是有限的，更重要的是可变的。在融合物理环境的应用中，图像传输也会受到时间限制。因此，发展一个适应的方法可以在时间限制和互联网设备的资源限制下，提高机器学习应用的推论性能。在这篇文章中，我们使用图像分类作为我们的目标应用，并提出进步的神经压缩（PNC）方法来解决这个问题。尽管神经压缩已经用于不同的机器学习应用中图像压缩，但现有的解决方案通常会生成固定大小的出力，这不适合时间限制下的图像传输。为解决这个限制，我们训练了多个目标自适应的减少顶峰网络，以便在不同的压缩率下选择适合的压缩率，并生成按照推论性能的重要性排序的特征。这些特征会根据可用带宽进行传输，并最终通过在时间限制下接收的（子）集特征进行分类。我们在一个包括IoT设备和边缘服务器之间的无线网络上进行了实验，评估PNC方法与现有的神经压缩方法和传统压缩方法的比较。

GestSync: Determining who is speaking without a talking head

paper_url: http://arxiv.org/abs/2310.05304
repo_url: https://github.com/Sindhu-Hegde/gestsync
paper_authors: Sindhu B Hegde, Andrew Zisserman
For: The paper is written for determining if a person’s gestures are correlated with their speech or not, and exploring the use of self-supervised learning for this task.* Methods: The paper introduces a dual-encoder model for the task of Gesture-Sync, and compares the performance of different input representations, including RGB frames, keypoint images, and keypoint vectors.* Results: The paper shows that the model can be trained using self-supervised learning alone, and evaluates its performance on the LRS3 dataset. Additionally, the paper demonstrates applications of Gesture-Sync for audio-visual synchronisation and determining who is the speaker in a crowd without seeing their faces.

Abstract
In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: \url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.

摘要
在这篇论文中，我们引入了一个新的同步任务，即Gesture-Sync：判断一个人的手势与其说话是否相关。与Lip-Sync相比，Gesture-Sync更加具有挑战性，因为voice和身体运动之间的关系比lip motion和语音之间的关系更加松散。我们介绍了一种双encoder模型来解决这个任务，并比较了不同的输入表示，包括RGB帧、关键点图像和关键点向量，评估其性能和优势。我们表明该模型可以通过自我超视了学习来训练，并评估其性能在LRS3 dataset上。最后，我们展示了Gesture-Sync在audio-visual同步和推测人声的场景中的应用，以及不看到人脸时推测说话人的场景。相关代码、数据集和预训练模型可以在以下链接中找到：\url{https://www.robots.ox.ac.uk/~vgg/research/gestsync}.

Image Compression and Decompression Framework Based on Latent Diffusion Model for Breast Mammography

paper_url: http://arxiv.org/abs/2310.05299
repo_url: https://github.com/neogeoss/EMBED_Mammo_Models
paper_authors: InChan Hwang, MinJae Woo
for: 这个研究旨在开发一个新的医疗图像压缩和解压缩框架，利用隐藏扩散模型（LDM）。LDM比过去的浊度扩散概率模型（DDPM）有更好的图像质量和较少的计算资源，在图像解压缩过程中。
methods: 这个研究使用了隐藏扩散模型（LDM）和Torchvision进行图像扩大，并考虑了医疗图像数据的应用。
results: 实验结果显示，这种方法比传统文件压缩算法更好，并且训练使用解压缩档案的 convolutional neural network（CNN）模型与使用原始图像档案训练的模型相比，表现相似。此外，这种方法还可以将医疗图像数据压缩到较小的大小，以便在医疗设备中储存。研究的影响范围包括医疗图像压缩中的噪声reduction和替代复杂波纹基于损失less压缩算法。

Abstract
This research presents a novel framework for the compression and decompression of medical images utilizing the Latent Diffusion Model (LDM). The LDM represents advancement over the denoising diffusion probabilistic model (DDPM) with a potential to yield superior image quality while requiring fewer computational resources in the image decompression process. A possible application of LDM and Torchvision for image upscaling has been explored using medical image data, serving as an alternative to traditional image compression and decompression algorithms. The experimental outcomes demonstrate that this approach surpasses a conventional file compression algorithm, and convolutional neural network (CNN) models trained with decompressed files perform comparably to those trained with original image files. This approach also significantly reduces dataset size so that it can be distributed with a smaller size, and medical images take up much less space in medical devices. The research implications extend to noise reduction in lossy compression algorithms and substitute for complex wavelet-based lossless algorithms.

摘要
这项研究提出了一种新的压缩和解压缩医疗图像框架，利用Latent Diffusion Model（LDM）。LDM比denoising diffusion probabilistic model（DDPM）有更多的进步，可以提供更高质量的图像，同时需要更少的计算资源进行图像解压缩。研究人员还探索了使用LDM和Torchvision进行图像缩放，作为传统图像压缩和解压缩算法的替代方案。实验结果表明，这种方法比传统文件压缩算法更好，并且使用解压缩文件训练的 convolutional neural network（CNN）模型与原始图像文件训练的模型性能相似。此外，这种方法还可以减少数据集大小，使得它可以更加容易地分布，医疗设备中的医疗图像占用的空间也更加小。研究的影响扩展到了lossy压缩算法中的噪声减少和lossless压缩算法中的复杂波лет特征。

MSight: An Edge-Cloud Infrastructure-based Perception System for Connected Automated Vehicles

paper_url: http://arxiv.org/abs/2310.05290
repo_url: None
paper_authors: Rusheng Zhang, Depu Meng, Shengyin Shen, Zhengxia Zou, Houqiang Li, Henry X. Liu
for: 这篇论文是为了探讨Connected Automated Vehicle（CAV）应用中的道路边缘感知技术。
methods: 这篇论文使用了路侧感知系统MSight，实现了实时车辆检测、定位、追踪和短期路径预测。
results: 评估结果显示MSight系统能够维持车道精度，并且具有几乎 zero latency，这表明了这个系统在CAV安全性和效率方面的应用潜力。

Abstract
As vehicular communication and networking technologies continue to advance, infrastructure-based roadside perception emerges as a pivotal tool for connected automated vehicle (CAV) applications. Due to their elevated positioning, roadside sensors, including cameras and lidars, often enjoy unobstructed views with diminished object occlusion. This provides them a distinct advantage over onboard perception, enabling more robust and accurate detection of road objects. This paper presents MSight, a cutting-edge roadside perception system specifically designed for CAVs. MSight offers real-time vehicle detection, localization, tracking, and short-term trajectory prediction. Evaluations underscore the system's capability to uphold lane-level accuracy with minimal latency, revealing a range of potential applications to enhance CAV safety and efficiency. Presently, MSight operates 24/7 at a two-lane roundabout in the City of Ann Arbor, Michigan.

摘要
“自动驾驶汽车（CAV）技术继续发展，基础设施上的路面感知emerges as a pivotal tool。由于路面感知器的高位置，包括摄像头和激光测距仪，通常有较好的视野和降低物体遮挡，这提供了它们与车辆上的感知相比，更加稳定和准确地检测道路上的物体。本文介绍了MSight，一个特有的路面感知系统，专门设计供CAV使用。MSight在实时提供车辆检测、定位、追踪和短期预测。评估结果显示MSight可以保持车道精度，并具有最小延迟。这些应用可能增加CAV的安全和效率。目前，MSight在美国密歇根州安那堤市的一个二轮圆环上运行24小时。”

The Emergence of Reproducibility and Consistency in Diffusion Models

paper_url: http://arxiv.org/abs/2310.05264
repo_url: None
paper_authors: Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Liyue Shen, Qing Qu
for: 这个论文的目的是探索Diffusion模型中的一种常见现象，即“一致性模型重复性”。
methods: 作者采用了大量实验和分析方法，包括使用不同的模型架构和训练策略，来研究Diffusion模型的一致性模型重复性。
results: 研究发现，Diffusion模型在不同的训练 regime中都具有一定的一致性模型重复性，其中包括“记忆化 режим”和“泛化 режим”。此外，作者还发现这种一致性模型重复性可以在多种Diffusion模型的变种中找到，例如Conditional Diffusion模型、用于解决反向问题的Diffusion模型以及精度调整后的Diffusion模型。

Abstract
Recently, diffusion models have emerged as powerful deep generative models, showcasing cutting-edge performance across various applications such as image generation, solving inverse problems, and text-to-image synthesis. These models generate new data (e.g., images) by transforming random noise inputs through a reverse diffusion process. In this work, we uncover a distinct and prevalent phenomenon within diffusion models in contrast to most other generative models, which we refer to as ``consistent model reproducibility''. To elaborate, our extensive experiments have consistently shown that when starting with the same initial noise input and sampling with a deterministic solver, diffusion models tend to produce nearly identical output content. This consistency holds true regardless of the choices of model architectures and training procedures. Additionally, our research has unveiled that this exceptional model reproducibility manifests in two distinct training regimes: (i) ``memorization regime,'' characterized by a significantly overparameterized model which attains reproducibility mainly by memorizing the training data; (ii) ``generalization regime,'' in which the model is trained on an extensive dataset, and its reproducibility emerges with the model's generalization capabilities. Our analysis provides theoretical justification for the model reproducibility in ``memorization regime''. Moreover, our research reveals that this valuable property generalizes to many variants of diffusion models, including conditional diffusion models, diffusion models for solving inverse problems, and fine-tuned diffusion models. A deeper understanding of this phenomenon has the potential to yield more interpretable and controllable data generative processes based on diffusion models.

摘要

Structure-Preserving Instance Segmentation via Skeleton-Aware Distance Transform

paper_url: http://arxiv.org/abs/2310.05262
repo_url: None
paper_authors: Zudi Lin, Donglai Wei, Aarush Gupta, Xingyu Liu, Deqing Sun, Hanspeter Pfister
for: Histopathology image segmentation
methods: Skeleton-aware distance transform (SDT) combining object skeleton and distance transform
results: State-of-the-art performance in histopathology image segmentation

Abstract
Objects with complex structures pose significant challenges to existing instance segmentation methods that rely on boundary or affinity maps, which are vulnerable to small errors around contacting pixels that cause noticeable connectivity change. While the distance transform (DT) makes instance interiors and boundaries more distinguishable, it tends to overlook the intra-object connectivity for instances with varying width and result in over-segmentation. To address these challenges, we propose a skeleton-aware distance transform (SDT) that combines the merits of object skeleton in preserving connectivity and DT in modeling geometric arrangement to represent instances with arbitrary structures. Comprehensive experiments on histopathology image segmentation demonstrate that SDT achieves state-of-the-art performance.

摘要
对于具有复杂结构的对象，现有的实例分割方法，例如基于边界或相互关系图的方法，会面临 significiant 挑战。这些方法容易受到小范围内的像素错误所致的连接性变化的影响，从而导致分割不准确。而距离变换（DT）可以使实例的内部和边界更加可识别，但是它通常会忽略内部对象的连接性，导致不同宽度的实例进行过分割。为解决这些挑战，我们提议使用skeleton-aware distance transform（SDT），这种方法结合了对象骨架的优点，以保持连接性，并且利用距离变换来模型几何关系，以更好地表示具有自由结构的实例。我们对历史病理图像分割进行了广泛的实验，结果表明，SDT可以达到当前最佳性能。

SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

paper_url: http://arxiv.org/abs/2310.05241
repo_url: None
paper_authors: Sunjae Yoon, Gwanhyeong Koo, Dahyun Kim, Chang D. Yoo
for: 本研究旨在提高视频oment Retrieval（VMR）系统的精度和效率，通过在多个视频中检索具有相同语言查询的时刻点。
methods: 本研究提出了一种新的Scene Complexity Aware Network（SCANet），该网络能够评估多个视频中场景的复杂性，并根据场景的复杂性生成适应性的提案。
results: 实验结果表明，使用SCANet网络可以在三个检索标准（Charades-STA、ActivityNet、TVR）上达到状态级性能，并且 demonstarted the effectiveness of incorporating scene complexity in VMR systems.

Abstract
Video moment retrieval aims to localize moments in video corresponding to a given language query. To avoid the expensive cost of annotating the temporal moments, weakly-supervised VMR (wsVMR) systems have been studied. For such systems, generating a number of proposals as moment candidates and then selecting the most appropriate proposal has been a popular approach. These proposals are assumed to contain many distinguishable scenes in a video as candidates. However, existing proposals of wsVMR systems do not respect the varying numbers of scenes in each video, where the proposals are heuristically determined irrespective of the video. We argue that the retrieval system should be able to counter the complexities caused by varying numbers of scenes in each video. To this end, we present a novel concept of a retrieval system referred to as Scene Complexity Aware Network (SCANet), which measures the `scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video. Experimental results on three retrieval benchmarks (i.e., Charades-STA, ActivityNet, TVR) achieve state-of-the-art performances and demonstrate the effectiveness of incorporating the scene complexity.

摘要
视频瞬间检索目标是将视频中对应给给定语言查询的时刻点进行本地化。为了避免对时间点的标注成本昂贵，弱监督视频检索（wsVMR）系统得到了研究。这些系统通常采取生成多个提议作为时刻点候选者，然后选择最合适的提议。这些提议假设视频中包含许多可识别的场景。然而，现有的wsVMR系统提议不尊重每个视频中场景的数量，这些提议是基于视频的规则随意确定的。我们认为检索系统应该能够对每个视频中场景的复杂性进行应对。为此，我们提出了一种新的检索系统，即场景复杂度意识网络（SCANet），它在多个视频中场景的复杂性测量并生成适应场景复杂性的提议。实验结果在三个检索标准 benchmark（i.e., Charades-STA、ActivityNet、TVR）上达到了状态的最佳性能，并证明了包含场景复杂度的 incorporation 的效iveness。

Latent Diffusion Model for Medical Image Standardization and Enhancement

paper_url: http://arxiv.org/abs/2310.05237
repo_url: None
paper_authors: Md Selim, Jie Zhang, Faraneh Fathi, Michael A. Brooks, Ge Wang, Guoqiang Yu, Jin Chen
for: 这篇论文的目的是提出一个新的数据构造模型，以对 computed tomography (CT) 图像进行标准化，以提高医学研究中的可比性和精确性。
methods: 这篇论文使用了一个名为 DiffusionCT 的新型数据构造模型，该模型在 latent space 中将不同的 CT 图像转换为标准化的图像，以提高医学研究中的可比性和精确性。
results: 这篇论文的实验结果显示，DiffusionCT 可以对 CT 图像进行高品质的标准化，并且可以降低 SPAD 图像中的噪声，进一步验证了 DiffusionCT 的有效性。

Abstract
Computed tomography (CT) serves as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. Nonetheless, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability poses a fundamental challenge for subsequent research that relies on consistent image features. Existing CT image standardization models predominantly utilize GAN-based supervised or semi-supervised learning, but their performance remains limited. We present DiffusionCT, an innovative score-based DDPM model that operates in the latent space to transform disparate non-standard distributions into a standardized form. The architecture comprises a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the bottleneck position. First, the encoder-decoder is trained independently, without embedding DDPM, to capture the latent representation of the input data. Second, the latent DDPM model is trained while keeping the encoder-decoder parameters fixed. Finally, the decoder uses the transformed latent representation to generate a standardized CT image, providing a more consistent basis for downstream analysis. Empirical tests on patient CT images indicate notable improvements in image standardization using DiffusionCT. Additionally, the model significantly reduces image noise in SPAD images, further validating the effectiveness of DiffusionCT for advanced imaging tasks.

摘要
computed tomography (CT) serve as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. However, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability poses a fundamental challenge for subsequent research that relies on consistent image features. Existing CT image standardization models predominantly utilize GAN-based supervised or semi-supervised learning, but their performance remains limited. We present DiffusionCT, an innovative score-based DDPM model that operates in the latent space to transform disparate non-standard distributions into a standardized form. The architecture comprises a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the bottleneck position. First, the encoder-decoder is trained independently, without embedding DDPM, to capture the latent representation of the input data. Second, the latent DDPM model is trained while keeping the encoder-decoder parameters fixed. Finally, the decoder uses the transformed latent representation to generate a standardized CT image, providing a more consistent basis for downstream analysis. Empirical tests on patient CT images indicate notable improvements in image standardization using DiffusionCT. Additionally, the model significantly reduces image noise in SPAD images, further validating the effectiveness of DiffusionCT for advanced imaging tasks.

Enhancing Cross-Dataset Performance of Distracted Driving Detection With Score-Softmax Classifier

paper_url: http://arxiv.org/abs/2310.05202
repo_url: https://github.com/congduan-hnu/ssoftmax
paper_authors: Cong Duan, Zixuan Liu, Jiahao Xia, Minghai Zhang, Jiacai Liao, Libo Cao
for: 这个研究旨在提高车上司机的实时监控，以预测分心、疲劳和潜在危险。
methods: 我们引入了Score-Softmax分类器，以解决跨数据集短cut learning问题，并且利用人类评价模式设计了二维超级监管矩阵。
results: 我们的研究表明，Score-Softmax分类器可以提高跨数据集表现，并且比传统方法更好地结合多个数据集。

Abstract
Deep neural networks enable real-time monitoring of in-vehicle driver, facilitating the timely prediction of distractions, fatigue, and potential hazards. This technology is now integral to intelligent transportation systems. Recent research has exposed unreliable cross-dataset end-to-end driver behavior recognition due to overfitting, often referred to as ``shortcut learning", resulting from limited data samples. In this paper, we introduce the Score-Softmax classifier, which addresses this issue by enhancing inter-class independence and Intra-class uncertainty. Motivated by human rating patterns, we designed a two-dimensional supervisory matrix based on marginal Gaussian distributions to train the classifier. Gaussian distributions help amplify intra-class uncertainty while ensuring the Score-Softmax classifier learns accurate knowledge. Furthermore, leveraging the summation of independent Gaussian distributed random variables, we introduced a multi-channel information fusion method. This strategy effectively resolves the multi-information fusion challenge for the Score-Softmax classifier. Concurrently, we substantiate the necessity of transfer learning and multi-dataset combination. We conducted cross-dataset experiments using the SFD, AUCDD-V1, and 100-Driver datasets, demonstrating that Score-Softmax improves cross-dataset performance without modifying the model architecture. This provides a new approach for enhancing neural network generalization. Additionally, our information fusion approach outperforms traditional methods.

摘要

paper_url: http://arxiv.org/abs/2310.05193
repo_url: None
paper_authors: Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang Zhao
for: 这种研究旨在更好地利用大规模预训练的uni-modal模型，以提高多模态学习的表现。
methods: 这种方法使用预训练的uni-modal模型，并将其作为初始模型进行多模态联合训练，以增强模式之间的适应性。
results: 研究表明，这种方法可以提高多模态模型的总表现，特别是在一些任务中，even when fine-tuned with only uni-modal data。

Abstract
This paper investigates how to better leverage large-scale pre-trained uni-modal models to further enhance discriminative multi-modal learning. Even when fine-tuned with only uni-modal data, these models can outperform previous multi-modal models in certain tasks. It's clear that their incorporation into multi-modal learning would significantly improve performance. However, multi-modal learning with these models still suffers from insufficient learning of uni-modal features, which weakens the resulting multi-modal model's generalization ability. While fine-tuning uni-modal models separately and then aggregating their predictions is straightforward, it doesn't allow for adequate adaptation between modalities, also leading to sub-optimal results. To this end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By freezing the weights of uni-modal fine-tuned models, adding extra trainable rank decomposition matrices to them, and subsequently performing multi-modal joint training, our method enhances adaptation between modalities and boosts overall performance. We demonstrate the effectiveness of MMLoRA on three dataset categories: audio-visual (e.g., AVE, Kinetics-Sound, CREMA-D), vision-language (e.g., MM-IMDB, UPMC Food101), and RGB-Optical Flow (UCF101).

摘要

HOD: A Benchmark Dataset for Harmful Object Detection

paper_url: http://arxiv.org/abs/2310.05192
repo_url: https://github.com/poori-nuna/hod-benchmark-dataset
paper_authors: Eungyeom Ha, Heemook Kim, Sung Chul Hong, Dongbin Na
for: 这个论文的目标是开发自动识别危险内容的系统，以防止在在线服务平台上传播危险内容。methods: 这个研究使用了最新的计算机视觉技术，包括使用最新的对象检测架构和大量的数据集来训练模型。results: 研究人员通过实验表明，使用提议的数据集和方法可以准确地检测在线服务平台上的危险内容，并且可以在实时应用中提供有效的识别结果。

Abstract
Recent multi-media data such as images and videos have been rapidly spread out on various online services such as social network services (SNS). With the explosive growth of online media services, the number of image content that may harm users is also growing exponentially. Thus, most recent online platforms such as Facebook and Instagram have adopted content filtering systems to prevent the prevalence of harmful content and reduce the possible risk of adverse effects on users. Unfortunately, computer vision research on detecting harmful content has not yet attracted attention enough. Users of each platform still manually click the report button to recognize patterns of harmful content they dislike when exposed to harmful content. However, the problem with manual reporting is that users are already exposed to harmful content. To address these issues, our research goal in this work is to develop automatic harmful object detection systems for online services. We present a new benchmark dataset for harmful object detection. Unlike most related studies focusing on a small subset of object categories, our dataset addresses various categories. Specifically, our proposed dataset contains more than 10,000 images across 6 categories that might be harmful, consisting of not only normal cases but also hard cases that are difficult to detect. Moreover, we have conducted extensive experiments to evaluate the effectiveness of our proposed dataset. We have utilized the recently proposed state-of-the-art (SOTA) object detection architectures and demonstrated our proposed dataset can be greatly useful for the real-time harmful object detection task. The whole source codes and datasets are publicly accessible at https://github.com/poori-nuna/HOD-Benchmark-Dataset.

摘要
近年来多媒体数据如图片和视频在不同的在线服务平台上迅速扩散，如社交媒体服务（SNS）。随着在线媒体服务的快速发展，具有可能伤害用户的图像内容的数量也在增长 exponentially。因此，现代在线平台如Facebook和Instagram已经采用内容筛选系统来防止有害内容的普及和减少可能的用户伤害的风险。然而，计算机视觉研究检测有害内容还没有吸引到够多的关注。用户们仍然通过手动报告按钮来认识他们看到的有害内容。然而，手动报告的问题在于用户已经曝露在有害内容中。为解决这些问题，我们在这项工作中的研究目标是开发自动检测有害对象系统。我们提出了一个新的比较 dataset，与大多数相关研究一样，我们的 dataset 覆盖了多个对象类别，并且包含了超过 10,000 个图像，这些图像包括不只是正常情况，还有一些困难检测的情况。此外，我们进行了广泛的实验，以评估我们提出的 dataset 的有效性。我们使用了最新的 state-of-the-art 对象检测架构，并证明了我们的 dataset 可以在实时有害对象检测任务中具有很高的有用性。整个源代码和数据集都可以在上公开访问。

AANet: Aggregation and Alignment Network with Semi-hard Positive Sample Mining for Hierarchical Place Recognition

paper_url: http://arxiv.org/abs/2310.05184
repo_url: https://github.com/Lu-Feng/AANet
paper_authors: Feng Lu, Lijun Zhang, Shuting Dong, Baifan Chen, Chun Yuan
for:* 这种paper是为了提出一种高效的视觉场景识别方法，用于Robotics中的位置定位。methods:* 该方法使用了两个阶段的 hierarchical two-stage VPR 方法，首先使用全局特征进行全局搜索，然后使用本地特征进行再排序。* 该方法还提出了一种 Dynamically Aligning Local Features (DALF) 算法，用于在空间约束下对本地特征进行对齐。results:* 对四个常用的 VPR 数据集进行了广泛的实验，结果显示，提出的 AANet 可以比一些现有的状态作准的方法更高效，同时占用的时间更少。

Abstract
Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots. Recently, the hierarchical two-stage VPR methods have become popular in this field due to the trade-off between accuracy and efficiency. These methods retrieve the top-k candidate images using the global features in the first stage, then re-rank the candidates by matching the local features in the second stage. However, they usually require additional algorithms (e.g. RANSAC) for geometric consistency verification in re-ranking, which is time-consuming. Here we propose a Dynamically Aligning Local Features (DALF) algorithm to align the local features under spatial constraints. It is significantly more efficient than the methods that need geometric consistency verification. We present a unified network capable of extracting global features for retrieving candidates via an aggregation module and aligning local features for re-ranking via the DALF alignment module. We call this network AANet. Meanwhile, many works use the simplest positive samples in triplet for weakly supervised training, which limits the ability of the network to recognize harder positive pairs. To address this issue, we propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks. Extensive experiments on four benchmark VPR datasets show that the proposed AANet can outperform several state-of-the-art methods with less time consumption. The code is released at https://github.com/Lu-Feng/AANet.

摘要
Visual地位识别（VPR）是机器人学研究的热点之一，它利用视觉信息来定位机器人。近年来，层次分解两个阶段VPR方法在这个领域得到了广泛应用，因为它们可以平衡精度和效率。这些方法首先使用全局特征来检索top-k候选图像，然后在第二阶段使用本地特征进行重新排序。然而，它们通常需要额外的算法（例如RANSAC）来验证几何一致性，这会占用大量时间。我们提出了一种 Dinamically Aligning Local Features（DALF）算法，可以在空间约束下对本地特征进行对齐。与需要几何一致性验证的方法相比，它更高效。我们提出了一种能够提取全局特征并对本地特征进行对齐的网络，我们称之为AANet。另外，许多工作使用最简单的三个图像作为弱有监督训练的正例，这限制了网络的识别能力。为了解决这个问题，我们提出了一种 Semi-hard Positive Sample Mining（ShPSM）策略，可以选择适当的困难正例图像进行训练更加 Robust VPR 网络。我们在四个常用的VPR数据集上进行了广泛的实验，结果显示，我们的AANet可以超过一些状态OF-the-art方法，并且占用更少的时间。代码可以在https://github.com/Lu-Feng/AANet上下载。

ITRE: Low-light Image Enhancement Based on Illumination Transmission Ratio Estimation

paper_url: http://arxiv.org/abs/2310.05158
repo_url: None
paper_authors: Yu Wang, Yihong Wang, Tong Liu, Xiubao Sui, Qian Chen
for: 提高低光照图像的品质
methods: 使用Retinex方法，包括分色域聚类、初始照明传输矩阵计算、基础模型生成和终端检查等步骤，以避免噪声、 artifacts 和过度曝光
results: 对比state-of-the-art方法，本方法在降低噪声、避免 artifacts、控制曝光水平方面具有优越的表现

Abstract
Noise, artifacts, and over-exposure are significant challenges in the field of low-light image enhancement. Existing methods often struggle to address these issues simultaneously. In this paper, we propose a novel Retinex-based method, called ITRE, which suppresses noise and artifacts from the origin of the model, prevents over-exposure throughout the enhancement process. Specifically, we assume that there must exist a pixel which is least disturbed by low light within pixels of same color. First, clustering the pixels on the RGB color space to find the Illumination Transmission Ratio (ITR) matrix of the whole image, which determines that noise is not over-amplified easily. Next, we consider ITR of the image as the initial illumination transmission map to construct a base model for refined transmission map, which prevents artifacts. Additionally, we design an over-exposure module that captures the fundamental characteristics of pixel over-exposure and seamlessly integrate it into the base model. Finally, there is a possibility of weak enhancement when inter-class distance of pixels with same color is too small. To counteract this, we design a Robust-Guard module that safeguards the robustness of the image enhancement process. Extensive experiments demonstrate the effectiveness of our approach in suppressing noise, preventing artifacts, and controlling over-exposure level simultaneously. Our method performs superiority in qualitative and quantitative performance evaluations by comparing with state-of-the-art methods.

摘要
噪声、artefacts和过度曝光是低光照图像增强领域中的主要挑战。现有方法 oftentimes 难以同时解决这些问题。在这篇论文中，我们提出了一种新的Retinex基于的方法，称为ITRE，该方法可以在图像增强过程中对噪声和artefacts进行控制，同时避免过度曝光。具体来说，我们假设在图像中存在一个最少受到低光照的像素，我们可以通过RGB色彩空间的聚类来找到整个图像的照明传输矩阵（ITR），该矩阵确定了噪声不易过度增强。接着，我们将ITR矩阵作为图像的初始照明传输地图，并将其用于构建基本模型，以避免artefacts。此外，我们还设计了一个过度曝光模块，该模块可以融合到基本模型中，以捕捉图像过度曝光的基本特征。最后，当相同色彩的像素间距离过小时，可能会出现弱化效果。为了解决这个问题，我们设计了一个Robust-Guard模块，以保证图像增强过程的稳定性。广泛的实验表明，我们的方法可以同时控制噪声、artefacts和过度曝光水平，并且在质量和量化性能评价中表现出优于状态艺术方法。

LocoNeRF: A NeRF-based Approach for Local Structure from Motion for Precise Localization

paper_url: http://arxiv.org/abs/2310.05134
repo_url: None
paper_authors: Artem Nenashev, Mikhail Kurenkov, Andrei Potapov, Iana Zhura, Maksim Katerishich, Dzmitry Tsetserukou
for: 提高视觉定位精度， addresses the limitations of global SfM and the challenges of local SfM.
methods: 使用Neural Radiance Fields (NeRF) instead of image databases for storage, and sampling reference images around the prior query position for further improvements.
results: 比ground truth有0.068米的准确性，但是数据库大小减少至160MB，比COLMAP的400MB有所降低。 Additionally, the ablation study shows the impact of using reference images from the NeRF reconstruction.

Abstract
Visual localization is a critical task in mobile robotics, and researchers are continuously developing new approaches to enhance its efficiency. In this article, we propose a novel approach to improve the accuracy of visual localization using Structure from Motion (SfM) techniques. We highlight the limitations of global SfM, which suffers from high latency, and the challenges of local SfM, which requires large image databases for accurate reconstruction. To address these issues, we propose utilizing Neural Radiance Fields (NeRF), as opposed to image databases, to cut down on the space required for storage. We suggest that sampling reference images around the prior query position can lead to further improvements. We evaluate the accuracy of our proposed method against ground truth obtained using LIDAR and Advanced Lidar Odometry and Mapping in Real-time (A-LOAM), and compare its storage usage against local SfM with COLMAP in the conducted experiments. Our proposed method achieves an accuracy of 0.068 meters compared to the ground truth, which is slightly lower than the most advanced method COLMAP, which has an accuracy of 0.022 meters. However, the size of the database required for COLMAP is 400 megabytes, whereas the size of our NeRF model is only 160 megabytes. Finally, we perform an ablation study to assess the impact of using reference images from the NeRF reconstruction.

摘要
“视觉本地化是移动 роботика中的一项关键任务，研究人员不断开发新的方法来提高其效率。在这篇文章中，我们提出一种新的方法来提高视觉本地化的准确性，使用结构从运动（SfM）技术。我们指出了全球SfM的局限性，即高延迟，以及本地SfM的挑战，即需要大量图像数据库以实现准确重建。为了解决这些问题，我们提议使用神经辐射场（NeRF），而不是图像数据库，来减少存储空间。我们建议在提前查询位置附近采样参考图像可以获得进一步改进。我们对我们提议的方法与实际数据进行评估，并与使用COLMAP的本地SfM进行比较。我们的方法准确性为0.068米，轻微低于最先进的方法COLMAP准确性（0.022米）。然而，COLMAP需要400兆字节的数据库，而我们的NeRF模型只需160兆字节。最后，我们进行了一个ablation研究，以评估使用NeRF重建中的参考图像的影响。”

Geometry Aware Field-to-field Transformations for 3D Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.05133
repo_url: https://github.com/USTCPCS/CVPR2018_attention
paper_authors: Dominik Hollidt, Clinton Wang, Polina Golland, Marc Pollefeys
for: 实现3D内容Semantic Segmentation仅基于2D监控，使用神经辉照场（NeRF）。
methods: 通过获取表面点云的特征，实现了几何簇的储存，实现了3D理解。通过伪类型自动编码学习，实现了几何簇的数据储存。
results: 获得了几何簇的储存，并且可以实现几何簇的3D理解。

Abstract
We present a novel approach to perform 3D semantic segmentation solely from 2D supervision by leveraging Neural Radiance Fields (NeRFs). By extracting features along a surface point cloud, we achieve a compact representation of the scene which is sample-efficient and conducive to 3D reasoning. Learning this feature space in an unsupervised manner via masked autoencoding enables few-shot segmentation. Our method is agnostic to the scene parameterization, working on scenes fit with any type of NeRF.

摘要
我们提出了一种新的方法，可以通过使用神经辐射场（NeRF）来完成3Dsemantic segmentation，仅基于2D监督。我们通过对表面点云中提取特征来获得一个减少的表示形式，该形式是效率的样本和适合3D理解。通过在掩码自动编码中学习这个特征空间，我们实现了几个shot分类。我们的方法对于场景参数化是无关的，可以应用于任何类型的NeRF场景。Note: "神经辐射场" (NeRF) is a Chinese term that refers to Neural Radiance Fields.

Bidirectional Knowledge Reconfiguration for Lightweight Point Cloud Analysis

paper_url: http://arxiv.org/abs/2310.05125
repo_url: None
paper_authors: Peipei Li, Xing Cui, Yibo Hu, Man Zhang, Ting Yao, Tao Mei
for: 本研究旨在提高点云分析的计算机系统过载问题，使其可以在移动或边缘设备上应用。
methods: 本文提出了特征缩减技术来降低点云模型的计算负担。 Specifically, we propose bidirectional knowledge reconfiguration (BKR) to distill informative contextual knowledge from the teacher to the student.
results: 我们的方法在shape classification、part segmentation和semantic segmentation benchmarks上表现出了超越性和优势，demonstrating the universality and superiority of our method.

Abstract
Point cloud analysis faces computational system overhead, limiting its application on mobile or edge devices. Directly employing small models may result in a significant drop in performance since it is difficult for a small model to adequately capture local structure and global shape information simultaneously, which are essential clues for point cloud analysis. This paper explores feature distillation for lightweight point cloud models. To mitigate the semantic gap between the lightweight student and the cumbersome teacher, we propose bidirectional knowledge reconfiguration (BKR) to distill informative contextual knowledge from the teacher to the student. Specifically, a top-down knowledge reconfiguration and a bottom-up knowledge reconfiguration are developed to inherit diverse local structure information and consistent global shape knowledge from the teacher, respectively. However, due to the farthest point sampling in most point cloud models, the intermediate features between teacher and student are misaligned, deteriorating the feature distillation performance. To eliminate it, we propose a feature mover's distance (FMD) loss based on optimal transportation, which can measure the distance between unordered point cloud features effectively. Extensive experiments conducted on shape classification, part segmentation, and semantic segmentation benchmarks demonstrate the universality and superiority of our method.

摘要
点云分析面临计算系统开销限制其在移动或边缘设备上应用。直接采用小型模型可能会导致显著性能下降，因为小型模型很难同时捕捉点云中的本地结构和全局形态信息，这些信息是点云分析的关键决定因素。本文探讨了降简点云模型的技术。为了减少教师和学生之间的Semantic gap，我们提出了双向知识重新配置（BKR），将教师知识中的有用Contextual information遗传给学生。具体来说，我们开发了从教师到学生的顶部知识重新配置和从学生到教师的底部知识重新配置，以继承教师的多样化本地结构信息和一致的全局形态知识。然而，由于多数点云模型中的远点抽样，学生和教师之间的中间特征不对Alignment，这会降低feature distillation的性能。为了解决这个问题，我们提出了基于最优运输的特征移动距离（FMD）损失，可以有效度量不同点云特征之间的距离。我们对shape classification、部分 segmentation和semantic segmentation benchmark进行了广泛的实验，结果表明我们的方法在 universality 和优势性方面具有出色的表现。

Cross-domain Robust Deepfake Bias Expansion Network for Face Forgery Detection

paper_url: http://arxiv.org/abs/2310.05124
repo_url: None
paper_authors: Weihua Liu, Lin Li, Chaochao Lin, Said Boumaraf
for: 这篇论文旨在提高人脸假制检测的安全性，尤其是面对深度推假技术的威胁。methods: 本论文提出了一种解决方案——跨频率坚定偏差扩展网络（BENet），通过使用自动编码器重建输入face，保持真实face的均衡性，同时选择性地增强假face与其原始样本之间的差异。results: 对比于现有方法，BENet在内部和跨频率测试中表现出色，能够有效地检测深度推假face。此外，BENet还 incorporates 一种矩阵注意力（LSA）模块，可以更好地捕捉异常的假制特征。

Abstract
The rapid advancement of deepfake technologies raises significant concerns about the security of face recognition systems. While existing methods leverage the clues left by deepfake techniques for face forgery detection, malicious users may intentionally manipulate forged faces to obscure the traces of deepfake clues and thereby deceive detection tools. Meanwhile, attaining cross-domain robustness for data-based methods poses a challenge due to potential gaps in the training data, which may not encompass samples from all relevant domains. Therefore, in this paper, we introduce a solution - a Cross-Domain Robust Bias Expansion Network (BENet) - designed to enhance face forgery detection. BENet employs an auto-encoder to reconstruct input faces, maintaining the invariance of real faces while selectively enhancing the difference between reconstructed fake faces and their original counterparts. This enhanced bias forms a robust foundation upon which dependable forgery detection can be built. To optimize the reconstruction results in BENet, we employ a bias expansion loss infused with contrastive concepts to attain the aforementioned objective. In addition, to further heighten the amplification of forged clues, BENet incorporates a Latent-Space Attention (LSA) module. This LSA module effectively captures variances in latent features between the auto-encoder's encoder and decoder, placing emphasis on inconsistent forgery-related information. Furthermore, BENet incorporates a cross-domain detector with a threshold to determine whether the sample belongs to a known distribution. The correction of classification results through the cross-domain detector enables BENet to defend against unknown deepfake attacks from cross-domain. Extensive experiments demonstrate the superiority of BENet compared with state-of-the-art methods in intra-database and cross-database evaluations.

摘要
<> translate "The rapid advancement of deepfake technologies raises significant concerns about the security of face recognition systems. While existing methods leverage the clues left by deepfake techniques for face forgery detection, malicious users may intentionally manipulate forged faces to obscure the traces of deepfake clues and thereby deceive detection tools. Meanwhile, attaining cross-domain robustness for data-based methods poses a challenge due to potential gaps in the training data, which may not encompass samples from all relevant domains. Therefore, in this paper, we introduce a solution - a Cross-Domain Robust Bias Expansion Network (BENet) - designed to enhance face forgery detection. BENet employs an auto-encoder to reconstruct input faces, maintaining the invariance of real faces while selectively enhancing the difference between reconstructed fake faces and their original counterparts. This enhanced bias forms a robust foundation upon which dependable forgery detection can be built. To optimize the reconstruction results in BENet, we employ a bias expansion loss infused with contrastive concepts to attain the aforementioned objective. In addition, to further heighten the amplification of forged clues, BENet incorporates a Latent-Space Attention (LSA) module. This LSA module effectively captures variances in latent features between the auto-encoder's encoder and decoder, placing emphasis on inconsistent forgery-related information. Furthermore, BENet incorporates a cross-domain detector with a threshold to determine whether the sample belongs to a known distribution. The correction of classification results through the cross-domain detector enables BENet to defend against unknown deepfake attacks from cross-domain. Extensive experiments demonstrate the superiority of BENet compared with state-of-the-art methods in intra-database and cross-database evaluations." into Simplified Chinese.Here's the translation:“深刻的深伪技术的快速发展，对人脸识别系统的安全带来重要的忧虑。现有的方法可以利用深伪技术留下的伪证据来检测伪证，但是黑客可能会故意修改伪证，以隐藏深伪的迹象，并且欺骗检测工具。同时，为了实现跨领域Robustness，资料基础方法面临着潜在的领域差异问题，这些训练数据可能不包括所有 relevance 的领域。因此，在这篇论文中，我们提出了一个解决方案——跨领域Robust Bias Expansion Network（BENet），用于增强伪证检测。BENet 使用 auto-encoder 重建输入的脸部，保持真实脸部的不变性，同时选择性地强化伪证的重建和原始对比。这个强化的偏见形成了可靠的基础，以便建立可靠的伪证检测。为了优化 BENet 中的重建结果，我们使用了偏见扩展损失，融合了相对概念，以达到这个目标。此外，BENet 还包括一个 Latent-Space Attention（LSA）模组，这个模组可以有效地捕捉 auto-encoder 的Encoder 和 Decoder 之间的潜在特征差异，将注意力集中在伪证相关的不一致信息上。此外，BENet 还包括一个跨领域检测器，以决定样本是否属于已知分布。通过跨领域检测器进行样本的修正，使得 BENet 能够防止不知道的深伪攻击。实验结果显示，BENet 与现有的方法相比，在内部资料和跨部资料评估中表现出色。”

Dynamic Multi-Domain Knowledge Networks for Chest X-ray Report Generation

paper_url: http://arxiv.org/abs/2310.05119
repo_url: None
paper_authors: Weihua Liu, Youyuan Xue, Chaochao Lin, Said Boumaraf
for: This paper aims to address the challenges of automatically generating radiology diagnostic reports, particularly the imbalance in data distribution between normal and abnormal samples, by proposing a Dynamic Multi-Domain Knowledge (DMDK) network.
methods: The proposed DMDK network consists of four modules: Chest Feature Extractor (CFE), Dynamic Knowledge Extractor (DKE), Specific Knowledge Extractor (SKE), and Multi-knowledge Integrator (MKI) module. The network utilizes dynamic disease topic labels, domain-specific dynamic knowledge graphs, and multi-knowledge integration to mitigate data biases and enhance interpretability.
results: The proposed method was extensively evaluated on two widely used datasets (IU X-Ray and MIMIC-CXR) and achieved state-of-the-art performance in all evaluation metrics, outperforming previous models.

Abstract
The automated generation of radiology diagnostic reports helps radiologists make timely and accurate diagnostic decisions while also enhancing clinical diagnostic efficiency. However, the significant imbalance in the distribution of data between normal and abnormal samples (including visual and textual biases) poses significant challenges for a data-driven task like automatically generating diagnostic radiology reports. Therefore, we propose a Dynamic Multi-Domain Knowledge(DMDK) network for radiology diagnostic report generation. The DMDK network consists of four modules: Chest Feature Extractor(CFE), Dynamic Knowledge Extractor(DKE), Specific Knowledge Extractor(SKE), and Multi-knowledge Integrator(MKI) module. Specifically, the CFE module is primarily responsible for extracting the unprocessed visual medical features of the images. The DKE module is responsible for extracting dynamic disease topic labels from the retrieved radiology diagnostic reports. We then fuse the dynamic disease topic labels with the original visual features of the images to highlight the abnormal regions in the original visual features to alleviate the visual data bias problem. The SKE module expands upon the conventional static knowledge graph to mitigate textual data biases and amplify the interpretability capabilities of the model via domain-specific dynamic knowledge graphs. The MKI distills all the knowledge and generates the final diagnostic radiology report. We performed extensive experiments on two widely used datasets, IU X-Ray and MIMIC-CXR. The experimental results demonstrate the effectiveness of our method, with all evaluation metrics outperforming previous state-of-the-art models.

摘要
自动生成 radiology 诊断报告可以帮助 radiologist 更快、更准确地作出诊断决策，同时提高临床诊断效率。然而，数据分布的巨大偏度问题（包括视觉和文本偏见）对于数据驱动的任务如自动生成 radiology 诊断报告来说是一个挑战。因此，我们提出了动态多Domain知识（DMDK）网络，用于 radiology 诊断报告生成。DMDK 网络由四个模块组成：胸部特征提取器（CFE）、动态知识提取器（DKE）、特定知识提取器（SKE）和多知识 интегратор（MKI）模块。具体来说，CFE 模块主要负责从未处理的医学影像中提取视觉特征。DKE 模块负责从检索到的 radiology 诊断报告中提取动态疾病话题标签。我们将这些动态疾病话题标签与原始视觉特征相结合，以减少视觉数据偏见问题。SKE 模块在传统的静止知识图中增强了文本数据偏见问题，并通过域pecific的动态知识图来提高模型的解释能力。MKI 模块将所有的知识integrirated，并生成最终的 radiology 诊断报告。我们在 IU X-Ray 和 MIMIC-CXR 两个广泛使用的数据集上进行了广泛的实验，结果表明我们的方法效果很高，所有评价指标都高于之前的状态对照模型。

Lightweight In-Context Tuning for Multimodal Unified Models

paper_url: http://arxiv.org/abs/2310.05109
repo_url: None
paper_authors: Yixin Chen, Shuai Zhang, Boran Han, Jiaya Jia
For: The paper aims to address the challenges of in-context learning (ICL) in multimodal tasks, specifically the difficulty of extrapolating from contextual examples to perform ICL as more modalities are added.* Methods: The proposed solution is called MultiModal In-conteXt Tuning (M$^2$IXT), a lightweight module that incorporates an expandable context window to incorporate various labeled examples of multiple modalities. The module can be prepended to various multimodal unified models and trained via a mixed-tasks strategy to enable rapid few-shot adaption on multiple tasks and datasets.* Results: The paper shows that M$^2$IXT can significantly boost the few-shot ICL performance (e.g., 18% relative increase for OFA) and achieve state-of-the-art results across various tasks, including visual question answering, image captioning, visual grounding, and visual entailment, while being considerably small in terms of model parameters.

Abstract
In-context learning (ICL) involves reasoning from given contextual examples. As more modalities comes, this procedure is becoming more challenging as the interleaved input modalities convolutes the understanding process. This is exemplified by the observation that multimodal models often struggle to effectively extrapolate from contextual examples to perform ICL. To address these challenges, we introduce MultiModal In-conteXt Tuning (M$^2$IXT), a lightweight module to enhance the ICL capabilities of multimodal unified models. The proposed M$^2$IXT module perceives an expandable context window to incorporate various labeled examples of multiple modalities (e.g., text, image, and coordinates). It can be prepended to various multimodal unified models (e.g., OFA, Unival, LLaVA) of different architectures and trained via a mixed-tasks strategy to enable rapid few-shot adaption on multiple tasks and datasets. When tuned on as little as 50K multimodal data, M$^2$IXT can boost the few-shot ICL performance significantly (e.g., 18\% relative increase for OFA), and obtained state-of-the-art results across an array of tasks including visual question answering, image captioning, visual grounding, and visual entailment, while being considerably small in terms of model parameters (e.g., $\sim$$20\times$ smaller than Flamingo or MMICL), highlighting the flexibility and effectiveness of M$^2$IXT as a multimodal in-context learner.

摘要
宽 Context 学习（ICL） involve 从Contextual例子进行推理。随着更多modalities来临，这个过程变得更加挑战，因为交错的输入modalities会混淆理解过程。这被示示于多modal模型在 Contextual例子上进行ICL时的表现不佳。为了解决这些挑战，我们介绍 MultiModal In-conteXt Tuning（M$^2$IXT）模块，用于提高多modal unified模型的ICL能力。该模块可以预pend于多modal unified模型（例如OFA、Unival、LLaVA）的不同架构上，并通过混合任务策略进行训练，以实现快速少量数据适应。当与50000个多modal数据进行训练时，M$^2$IXT可以显著提高ICL性能（例如18%相对提高），并在视觉问答、图像描述、视觉固定、视觉包容等任务上达到了状态之册的结果，而且模型参数很小（例如$\sim$$20\times$ smaller than Flamingo或MMICL），这说明M$^2$IXT具有多modal Contextual学习的灵活性和效果。

Enhancing Representations through Heterogeneous Self-Supervised Learning

paper_url: http://arxiv.org/abs/2310.05108
repo_url: None
paper_authors: Zhong-Yu Li, Bo-Wen Yin, Shanghua Gao, Yongxiang Liu, Li Liu, Ming-Ming Cheng
for: 提高自我超vised学习中基础模型的表示质量，通过在基础模型上添加不同架构的辅助头来增强表示能力。
methods: 提出了一种基于不同架构的辅助头的自我超vised学习方法（HSSL），通过让基础模型学习辅助头的不同架构来增强表示能力，而不需要Structural changes。
results: 通过多种不同的基础模型和辅助头组合，对多个下游任务（包括图像分类、 semantic segmentation、instance segmentation、object detection）进行了实验，并发现了基础模型的表示质量随着辅助头的架构差异增加而提高。此外，还提出了一种快速找到最适合基础模型学习的辅助头的搜索策略和一些简单 yet effective的方法来扩大模型差异。

Abstract
Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection. Our source code will be made publicly available.

摘要
将不同架构的表示结合在一起已经提高了许多视觉任务的性能，例如混合网络将转换器和卷积结合使用。然而，这些不同架构之间的补做性未得到了自我超vised学习中的充分利用。因此，我们提出了多样化自我超vised学习（HSSL），它要求基本模型从auxiliary头中学习，auxiliary头的架构与基本模型不同。在这个过程中，HSSL使得基本模型学习新的特征，不需要结构性改变。为了全面了解HSSL，我们在不同的 heterogeneous对中进行了实验，发现当基本模型和auxiliary头的架构差异增大时，基本模型的表示质量会提高。这一观察使我们提出了一种快速找到最适合基本模型学习的auxiliary头的搜索策略，以及一些简单 yet effective的方法来扩大模型差异。HSSL可以与多种自我超vised方法结合使用，在多种下游任务上达到了更高的性能，包括图像分类、 semantic segmentation、实例 segmentation和对象检测。我们将代码公开发布。

OV-PARTS: Towards Open-Vocabulary Part Segmentation

paper_url: http://arxiv.org/abs/2310.05107
repo_url: https://github.com/openrobotlab/ov_parts
paper_authors: Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang
for: 本研究旨在提出一个开 vocabulary part segmentation（OV-PARTS）benchmark，以探索在实世界中具有多元定义的部分构成的挑战。
methods: 本研究使用了两个公开可用的数据集：Pascal-Part-116和ADE20K-Part-234，并提出了三个特定任务：通用零基准部分分类、跨数据集部分分类和少量基准部分分类，以探索模型对于不同定义的部分的数据分类能力。
results: 本研究通过实验分析了两种现有的物件水平OVSS方法的适用性，并提供了一个精确的数据集和代码，以便未来研究者可以在OV-PARTS领域进行更多的探索和创新。

Abstract
Segmenting and recognizing diverse object parts is a crucial ability in applications spanning various computer vision and robotic tasks. While significant progress has been made in object-level Open-Vocabulary Semantic Segmentation (OVSS), i.e., segmenting objects with arbitrary text, the corresponding part-level research poses additional challenges. Firstly, part segmentation inherently involves intricate boundaries, while limited annotated data compounds the challenge. Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world. Furthermore, the large-scale vision and language models, which play a key role in the open vocabulary setting, struggle to recognize parts as effectively as objects. To comprehensively investigate and tackle these challenges, we propose an Open-Vocabulary Part Segmentation (OV-PARTS) benchmark. OV-PARTS includes refined versions of two publicly available datasets: Pascal-Part-116 and ADE20K-Part-234. And it covers three specific tasks: Generalized Zero-Shot Part Segmentation, Cross-Dataset Part Segmentation, and Few-Shot Part Segmentation, providing insights into analogical reasoning, open granularity and few-shot adapting abilities of models. Moreover, we analyze and adapt two prevailing paradigms of existing object-level OVSS methods for OV-PARTS. Extensive experimental analysis is conducted to inspire future research in leveraging foundational models for OV-PARTS. The code and dataset are available at https://github.com/OpenRobotLab/OV_PARTS.

摘要
Segmenting and recognizing diverse object parts is a crucial ability in various computer vision and robotic tasks. Although significant progress has been made in object-level Open-Vocabulary Semantic Segmentation (OVSS), segmenting objects with arbitrary text, the corresponding part-level research poses additional challenges. Firstly, part segmentation involves intricate boundaries, and limited annotated data makes it more challenging. Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world. Moreover, large-scale vision and language models, which play a key role in the open vocabulary setting, struggle to recognize parts as effectively as objects.To comprehensively investigate and tackle these challenges, we propose an Open-Vocabulary Part Segmentation (OV-PARTS) benchmark. OV-PARTS includes refined versions of two publicly available datasets: Pascal-Part-116 and ADE20K-Part-234. It covers three specific tasks: Generalized Zero-Shot Part Segmentation, Cross-Dataset Part Segmentation, and Few-Shot Part Segmentation, providing insights into analogical reasoning, open granularity, and few-shot adapting abilities of models. Moreover, we analyze and adapt two prevailing paradigms of existing object-level OVSS methods for OV-PARTS. Extensive experimental analysis is conducted to inspire future research in leveraging foundational models for OV-PARTS. The code and dataset are available at https://github.com/OpenRobotLab/OV_PARTS.

Cross-head mutual Mean-Teaching for semi-supervised medical image segmentation

paper_url: http://arxiv.org/abs/2310.05082
repo_url: https://github.com/leesoon1984/cmmt-net
paper_authors: Wei Li, Ruifeng Bian, Wenyi Zhao, Weijin Xu, Huihua Yang
for: 提高 semi-supervised medical image segmentation 的精度和一致性
methods: 提出了一种新的 Cross-head mutual mean-teaching Network (CMMT-Net)，包括 teacher-student 师生网络和 pseudo label 生成等技术，以提高自教学和一致学习的性能
results: 实验结果显示，CMMT-Net 在三个公共可用的数据集上取得了前所未有的提高，在不同的 semi-supervised enario中均表现出色

Abstract
Semi-supervised medical image segmentation (SSMIS) has witnessed substantial advancements by leveraging limited labeled data and abundant unlabeled data. Nevertheless, existing state-of-the-art (SOTA) methods encounter challenges in accurately predicting labels for the unlabeled data, giving rise to disruptive noise during training and susceptibility to erroneous information overfitting. Moreover, applying perturbations to inaccurate predictions further reduces consistent learning. To address these concerns, we propose a novel Cross-head mutual mean-teaching Network (CMMT-Net) incorporated strong-weak data augmentation, thereby benefitting both self-training and consistency learning. Specifically, our CMMT-Net consists of both teacher-student peer networks with a share encoder and dual slightly different decoders, and the pseudo labels generated by one mean teacher head are adopted to supervise the other student branch to achieve a mutual consistency. Furthermore, we propose mutual virtual adversarial training (MVAT) to smooth the decision boundary and enhance feature representations. To diversify the consistency training samples, we employ Cross-Set CutMix strategy, which also helps address distribution mismatch issues. Notably, CMMT-Net simultaneously implements data, feature, and network perturbations, amplifying model diversity and generalization performance. Experimental results on three publicly available datasets indicate that our approach yields remarkable improvements over previous SOTA methods across various semi-supervised scenarios. Code and logs will be available at https://github.com/Leesoon1984/CMMT-Net.

摘要
semi-supervised医学图像分割（SSMIS）在过去几年中取得了重大进步，通过利用有限的标注数据和庞大的无标注数据。然而，现有的状态之 искусственный智能（SOTA）方法在准确地预测无标注数据的标签时遇到了挑战，从而导致训练过程中的干扰和模型过拟合。此外，在应用扰动后，模型的学习不稳定。为解决这些问题，我们提出了一种新的交叉头同义启发网络（CMMT-Net），具有强弱数据增强和一致学习。具体来说，我们的CMMT-Net包括一个共享encoder和两个不同的decoder，其中一个用于生成 pseudo标签，另一个用于自我训练。此外，我们还提出了一种mutual virtual adversarial training（MVAT），以缓解决决策边界的问题，并提高特征表示。为了多样化一致学习样本，我们采用了 Cross-Set CutMix策略，这也有助于解决分布匹配问题。值得注意的是，CMMT-Net同时实现了数据、特征和网络扰动，从而扩大模型多样性和总体性能。我们的方法在三个公开的数据集上进行了实验，并取得了前所未有的提高。代码和日志将在https://github.com/Leesoon1984/CMMT-Net上提供。

Language-driven Open-Vocabulary Keypoint Detection for Animal Body and Face

paper_url: http://arxiv.org/abs/2310.05056
repo_url: None
paper_authors: Hao Zhang, Kaipeng Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao
for: 提出开放词汇键点检测（OVKD）任务，以用文本提示来ocalize任意种类的键点。
methods: 提出Open-Vocabulary Keypoint Detection with Semantic-feature Matching（KDSM）方法，利用视觉和语言模型来利用文本和视觉之间的关系，从而实现键点检测。
results: 实验表明，我们提出的组件带来了显著性能提升，而我们的总方法在OVKD中达到了非常出色的结果，甚至在零例学习方式下超过了现状卷积检测方法。

Abstract
Current approaches for image-based keypoint detection on animal (including human) body and face are limited to specific keypoints and species. We address the limitation by proposing the Open-Vocabulary Keypoint Detection (OVKD) task. It aims to use text prompts to localize arbitrary keypoints of any species. To accomplish this objective, we propose Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM), which utilizes both vision and language models to harness the relationship between text and vision and thus achieve keypoint detection through associating text prompt with relevant keypoint features. Additionally, KDSM integrates domain distribution matrix matching and some special designs to reinforce the relationship between language and vision, thereby improving the model's generalizability and performance. Extensive experiments show that our proposed components bring significant performance improvements, and our overall method achieves impressive results in OVKD. Remarkably, our method outperforms the state-of-the-art few-shot keypoint detection methods using a zero-shot fashion. We will make the source code publicly accessible.

摘要
当前对人体和面部图像中的关键点检测方法受限于特定关键点和种类。我们解决这一限制，提出开放词汇关键点检测（OVKD）任务。该任务的目标是使用文本提示来确定任意种类的关键点。为 достичь这一目标，我们提出了开放词汇关键点检测与semantic特征匹配（KDSM）方法，该方法利用视觉和语言模型来利用视觉和文本之间的关系，从而通过文本提示与相关的关键点特征相匹配来实现关键点检测。此外，KDSM还 integrates域名分布矩阵匹配和一些特殊的设计，以强化语言和视觉之间的关系，从而提高模型的普适性和性能。我们的实验表明，我们提出的组件带来了显著的性能提升，而我们的总方法在OVKD中实现了卓越的成绩，并且在零shot模式下超越了现有的状态对抗方法。我们将将源代码公开访问。

FairTune: Optimizing Parameter Efficient Fine Tuning for Fairness in Medical Image Analysis

paper_url: http://arxiv.org/abs/2310.05055
repo_url: None
paper_authors: Raman Dutt, Ondrej Bohdal, Sotirios A. Tsaftaris, Timothy Hospedales
for: 这个论文目标是提高机器学习模型的鲁棒性和公平性，特别是在具有伦理敏感性的应用领域，如医学诊断。
methods: 这篇论文使用了两级优化方法来解决公平学习问题，即在验证集上优化学习策略以确保公平性，并在更新参数时考虑公平性。
results: 论文的实验结果表明，使用 FairTune 框架可以提高医学图像 datasets 上的公平性。

Abstract
Training models with robust group fairness properties is crucial in ethically sensitive application areas such as medical diagnosis. Despite the growing body of work aiming to minimise demographic bias in AI, this problem remains challenging. A key reason for this challenge is the fairness generalisation gap: High-capacity deep learning models can fit all training data nearly perfectly, and thus also exhibit perfect fairness during training. In this case, bias emerges only during testing when generalisation performance differs across subgroups. This motivates us to take a bi-level optimisation perspective on fair learning: Optimising the learning strategy based on validation fairness. Specifically, we consider the highly effective workflow of adapting pre-trained models to downstream medical imaging tasks using parameter-efficient fine-tuning (PEFT) techniques. There is a trade-off between updating more parameters, enabling a better fit to the task of interest vs. fewer parameters, potentially reducing the generalisation gap. To manage this tradeoff, we propose FairTune, a framework to optimise the choice of PEFT parameters with respect to fairness. We demonstrate empirically that FairTune leads to improved fairness on a range of medical imaging datasets.

摘要
训练模型具有坚固的群体公平性质iels是在伦理敏感的应用领域，如医疗诊断，是非常重要的。despite the growing body of work aiming to minimize demographic bias in AI, this problem remains challenging. A key reason for this challenge is the fairness generalization gap: high-capacity deep learning models can fit all training data nearly perfectly, and thus also exhibit perfect fairness during training. In this case, bias emerges only during testing when generalization performance differs across subgroups. This motivates us to take a bi-level optimization perspective on fair learning: optimizing the learning strategy based on validation fairness. Specifically, we consider the highly effective workflow of adapting pre-trained models to downstream medical imaging tasks using parameter-efficient fine-tuning (PEFT) techniques. There is a trade-off between updating more parameters, enabling a better fit to the task of interest vs. fewer parameters, potentially reducing the generalization gap. To manage this tradeoff, we propose FairTune, a framework to optimize the choice of PEFT parameters with respect to fairness. We demonstrate empirically that FairTune leads to improved fairness on a range of medical imaging datasets.Here's the text with some additional information about the translation:I used the Google Translate API to translate the text into Simplified Chinese. The translation is in the "Simplified Chinese" language setting, which is the standard writing system used in mainland China.Please note that the translation may not be perfect, and there may be some nuances or cultural references that are lost in translation. Additionally, the translation may not be suitable for all audiences, especially in formal or professional settings. It's always a good idea to double-check the translation with a native speaker or a professional translator to ensure accuracy and cultural appropriateness.

Low-Resolution Self-Attention for Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.05026
repo_url: https://github.com/yuhuan-wu/LRFormer
paper_authors: Yu-Huan Wu, Shi-Chen Zhang, Yun Liu, Le Zhang, Xin Zhan, Daquan Zhou, Jiashi Feng, Ming-Ming Cheng, Liangli Zhen
for:LRFormer is designed for semantic segmentation tasks, specifically to improve the efficiency of vision transformers while maintaining performance.methods:LRFormer uses a Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a reduced computational cost, along with 3x3 depth-wise convolutions to capture fine details in the high-resolution space.results:LRFormer outperforms state-of-the-art models on the ADE20K, COCO-Stuff, and Cityscapes datasets.

Abstract
Semantic segmentation tasks naturally require high-resolution information for pixel-wise segmentation and global context information for class prediction. While existing vision transformers demonstrate promising performance, they often utilize high resolution context modeling, resulting in a computational bottleneck. In this work, we challenge conventional wisdom and introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost. Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution, with additional 3x3 depth-wise convolutions to capture fine details in the high-resolution space. We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure. Extensive experiments on the ADE20K, COCO-Stuff, and Cityscapes datasets demonstrate that LRFormer outperforms state-of-the-art models. The code will be made available at https://github.com/yuhuan-wu/LRFormer.

摘要
<>将文本翻译为简化字符的中文。>semantic segmentation任务自然需要高分辨率信息进行像素精度分割和全局上下文信息进行类型预测。而现有的视觉变换器经常使用高分辨率上下文模型，导致计算扰乱。在这项工作中，我们挑战传统的观点，并引入低分辨率自我关注（LRSA）机制，以 capture全局上下文，但是计算成本明显减少。我们的方法是在固定的低分辨率空间内计算自我关注，并在高分辨率空间内进行3x3深度感知 convolution来捕捉细节。我们建立了LRFormer，一个具有encoder-decoder结构的视觉变换器。我们的实验表明，LRFormer在ADE20K、COCO-Stuff和Cityscapes datasets上表现出色，超过了当前最佳模型。代码将在https://github.com/yuhuan-wu/LRFormer上提供。

Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn

paper_url: http://arxiv.org/abs/2310.05024
repo_url: None
paper_authors: Sanhita Pathak, Vinay Kaushik, Brejesh Lall
for: 提供一种基于图像的虚拟尝试服装系统，使得用户可以在图像上尝试不同的服装。
methods: 提posed a novel single-stage framework that implicitly learns garment warping and body synthesis from target pose keypoints, using a semantic-contextual fusion attention module and a lightweight linear attention framework to address misalignment and artifacts.
results: 比较 existed methods有更高的效率和质量，提供更可靠和真实的虚拟尝试体验。

Abstract
Image-based virtual try-on aims to fit an in-shop garment onto a clothed person image. Garment warping, which aligns the target garment with the corresponding body parts in the person image, is a crucial step in achieving this goal. Existing methods often use multi-stage frameworks to handle clothes warping, person body synthesis and tryon generation separately or rely on noisy intermediate parser-based labels. We propose a novel single-stage framework that implicitly learns the same without explicit multi-stage learning. Our approach utilizes a novel semantic-contextual fusion attention module for garment-person feature fusion, enabling efficient and realistic cloth warping and body synthesis from target pose keypoints. By introducing a lightweight linear attention framework that attends to garment regions and fuses multiple sampled flow fields, we also address misalignment and artifacts present in previous methods. To achieve simultaneous learning of warped garment and try-on results, we introduce a Warped Cloth Learning Module. WCLM uses segmented warped garments as ground truth, operating within a single-stage paradigm. Our proposed approach significantly improves the quality and efficiency of virtual try-on methods, providing users with a more reliable and realistic virtual try-on experience. We evaluate our method on the VITON dataset and demonstrate its state-of-the-art performance in terms of both qualitative and quantitative metrics.

摘要
文本翻译：图像基于的虚拟试穿目标是将店内衣服适应到披衣人像中的不同姿势。衣服扭曲是实现这个目标的关键步骤，但现有方法frequently使用多个阶段框架来处理衣服扭曲、人体身体合成和试穿生成。我们提出了一种新的单阶段框架，不需要显式的多阶段学习。我们的方法利用了一种新的 semantics-contextual fusion attention模块来拼接衣服和人体特征，从而实现高效和真实的衣服扭曲和人体合成。我们还引入了一种轻量级的线性注意机制，用于衣服区域的注意和多个抽取流场的融合，以解决先前方法中的偏移和零散。为了同时学习扭曲衣服和试穿结果，我们引入了扭曲衣服学习模块（WCLM）。WCLM使用分割后的扭曲衣服作为真实数据，在单阶段框架中进行学习。我们的提议方法可以大幅提高虚拟试穿方法的质量和效率，为用户提供更可靠和更真实的虚拟试穿体验。我们在VITON数据集上进行了评估，并证明了我们的方法在质量和量化指标上具有当前领域的state-of-the-art性。

Detecting Abnormal Health Conditions in Smart Home Using a Drone

paper_url: http://arxiv.org/abs/2310.05012
repo_url: None
paper_authors: Pronob Kumar Barman
for: 本研究旨在开发一种智能化的跌倒检测系统，以帮助年轻和老年人独立生活。
methods: 该系统使用视觉基于的跌倒监测，通过图像或视频分割和物体检测方法，以实现跌倒的识别。
results: 研究结果表明，该系统可以准确地识别跌倒物体，准确率为0.9948。

Abstract
Nowadays, detecting aberrant health issues is a difficult process. Falling, especially among the elderly, is a severe concern worldwide. Falls can result in deadly consequences, including unconsciousness, internal bleeding, and often times, death. A practical and optimal, smart approach of detecting falling is currently a concern. The use of vision-based fall monitoring is becoming more common among scientists as it enables senior citizens and those with other health conditions to live independently. For tracking, surveillance, and rescue, unmanned aerial vehicles use video or image segmentation and object detection methods. The Tello drone is equipped with a camera and with this device we determined normal and abnormal behaviors among our participants. The autonomous falling objects are classified using a convolutional neural network (CNN) classifier. The results demonstrate that the systems can identify falling objects with a precision of 0.9948.

摘要

Data Augmentation through Pseudolabels in Automatic Region Based Coronary Artery Segmentation for Disease Diagnosis

paper_url: http://arxiv.org/abs/2310.05990
repo_url: None
paper_authors: Sandesh Pokhrel, Sanjay Bhandari, Eduard Vazquez, Yash Raj Shrestha, Binod Bhattarai
for: 诊断心血管疾病（CAD）的准确性和效率有很大提高的需求，但现有的诊断方法往往困难和资源占用。在这种情况下，分割arteries的技术成为一种帮助临床专业人员作出准确诊断的工具。
methods: 本研究使用 pseudolabels 作为数据增强技术，以提高基eline Yolo 模型的性能。
results: 在验证集中，使用 pseudolabels 增强基eline Yolo 模型的 F1 分数提高了 9%，在测试集中提高了 3%。

Abstract
Coronary Artery Diseases(CADs) though preventable are one of the leading causes of death and disability. Diagnosis of these diseases is often difficult and resource intensive. Segmentation of arteries in angiographic images has evolved as a tool for assistance, helping clinicians in making accurate diagnosis. However, due to the limited amount of data and the difficulty in curating a dataset, the task of segmentation has proven challenging. In this study, we introduce the idea of using pseudolabels as a data augmentation technique to improve the performance of the baseline Yolo model. This method increases the F1 score of the baseline by 9% in the validation dataset and by 3% in the test dataset.

摘要
心血管疾病(CAD) 虽可预防，但它是死亡和残疾的主要原因之一。诊断这种疾病的困难和资源占用。artery segmentation in angiographic images has evolved as a tool for assistance, helping clinicians make accurate diagnoses. However, due to limited data and difficulty in curating a dataset, the task of segmentation has proven challenging. In this study, we introduce the idea of using pseudolabels as a data augmentation technique to improve the performance of the baseline Yolo model. This method increases the F1 score of the baseline by 9% in the validation dataset and by 3% in the test dataset.Note: "心血管" (xīn xuè màn) is a shortened form of "心血管疾病" (xīn xuè màn jì bìng), which means "cardiovascular disease" in Chinese.

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

paper_url: http://arxiv.org/abs/2310.05010
repo_url: https://github.com/wengzejia1/open-vclip
paper_authors: Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang
for:* The paper aims to adapt Contrastive Language-Image Pretraining (CLIP) for zero-shot video recognition, with the goal of identifying novel actions and events in videos.methods:* The proposed method, called Open-VCLIP++, modifies CLIP to capture spatial-temporal relationships in videos, and leverages a technique called Interpolated Weight Optimization to improve generalization.* The method also utilizes large language models to produce fine-grained video descriptions, which are aligned with video features to facilitate a better transfer of CLIP to the video domain.results:* The proposed method achieves zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outperforming the best-performing alternative methods by 8.5%, 8.2%, and 12.3%.* The method also delivers competitive video-to-text and text-to-video retrieval performance on the MSR-VTT video-text retrieval dataset, with substantially less fine-tuning data compared to other methods.

Abstract
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon large language models to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.

摘要
尽管尝试语言图像预训练（CLIP）在零shot图像识别中取得了显著的成果，但是对其在零shot视频识别方面的潜在能力尚未得到了充分的探讨。本文提出了Open-VCLIP++框架，这是一种简单而有效的方法，可以将CLIP adapted into a strong zero-shot video classifier，能够在测试中识别新的动作和事件。Open-VCLIP++只需要微小地修改CLIP，以捕捉视频中的空间-时间关系，从而创造一个特циализирован的视频分类器，同时尽量保持泛化能力。我们正式证明，在训练Open-VCLIP++时， Equivalent to continual learning with zero historical data。为解决这个问题，我们提出了 interpolated weight optimization 技术，该技术利用在训练和测试中 weight interpolation 的优势。此外，我们基于大型自然语言模型来生成细腻的视频描述，这些描述与视频特征进行了更好的匹配，以便更好地将CLIP转移到视频领域。我们的方法在三个常用的动作识别数据集上进行了评估，采用了多种零shot评估协议。结果表明，我们的方法在 UCF、HMDB 和 Kinetics-600 数据集上的零shot精度分别达到了 88.1%、58.7% 和 81.2%，与最佳替代方法相比提高了8.5%、8.2% 和 12.3%。我们还在 MSR-VTT 视频-文本检索数据集上评估了我们的方法，其在视频-文本和文本-视频检索中表现竞争力强，而且使用的微型 fine-tuning 数据量相比其他方法更少。代码可以在上下载。

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

paper_url: http://arxiv.org/abs/2310.04999
repo_url: https://github.com/wzx99/clipocr
paper_authors: Zixiao Wang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Boqiang Zhang, Yongdong Zhang
for: 这种研究旨在探讨CLIP模型在场景文本识别（STR）领域的潜力，并提出了一种新的对称语言特征泵化框架（CLIP-OCR），用于利用CLIP模型中的视觉和语言知识。
methods: 该研究提出了一种对称特征泵化策略（SDS），该策略可以更好地捕捉CLIP模型中的语言知识。具体来说，通过将CLIP图像encoder与反转的CLIP文本encoder串联起来，建立了一个对称结构，该结构包括一个图像到文本的特征流，这个特征流包括了视觉和语言信息。
results: 实验结果表明，CLIP-OCR可以在六个流行的STR benchmark上达到93.8%的平均准确率。

Abstract
In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation.Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer.Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task.Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.Code will be available at https://github.com/wzx99/CLIPOCR.

摘要
在这篇论文中，我们探索了CLIP模型在文本识别（STR）领域的潜力，并提出了一种新的对称语言特征精炼框架（CLIP-OCR），用于利用CLIP模型的视觉和语言知识。与前者CLIP基于方法主要关注视觉编码特征泛化，我们提议一种对称精炼策略（SDS），进一步捕捉CLIP文本编码器中的语言知识。通过将CLIP图像编码器与反转CLIP文本编码器串联起来，建立了一个对称结构，涵盖了图像和文本之间的视觉和语言信息 для精炼。由于CLIP自然的对称性，这种导向流提供了一个逐层优化目标，从视觉到语言，可以超visually guided feature forwarding process layer by layer.此外，我们还提出了一种新的语言一致损失（LCL），以提高语言能力，通过考虑第二阶段统计信息进行优化。总的来说，CLIP-OCR是STR任务中首次设计了图像和文本之间的平滑过渡。我们的实验结果显示，CLIP-OCR在六个流行的STR benchmark上获得了93.8%的平均准确率。代码将在https://github.com/wzx99/CLIPOCR上公开。

SemST: Semantically Consistent Multi-Scale Image Translation via Structure-Texture Alignment

paper_url: http://arxiv.org/abs/2310.04995
repo_url: None
paper_authors: Ganning Zhao, Wenhui Cui, Suya You, C. -C. Jay Kuo
for: 本研究旨在提出一种能够维护semantic consistency的无监督图像到图像（I2I）翻译方法，以 Addressing the challenge of content discrepancy in I2I translation.
methods: 本方法使用对比学习和最大化输入和输出之间的共聚信息，以降低semantic distortion。另外，一种多尺度方法也是引入，以提高翻译性能。
results: 实验表明，本方法能够有效地减少semantic distortion，并 achieve state-of-the-art performance。此外，在域适应（DA）中应用SemST也被证明可以作为semantic segmentation任务的有利预处理。

Abstract
Unsupervised image-to-image (I2I) translation learns cross-domain image mapping that transfers input from the source domain to output in the target domain while preserving its semantics. One challenge is that different semantic statistics in source and target domains result in content discrepancy known as semantic distortion. To address this problem, a novel I2I method that maintains semantic consistency in translation is proposed and named SemST in this work. SemST reduces semantic distortion by employing contrastive learning and aligning the structural and textural properties of input and output by maximizing their mutual information. Furthermore, a multi-scale approach is introduced to enhance translation performance, thereby enabling the applicability of SemST to domain adaptation in high-resolution images. Experiments show that SemST effectively mitigates semantic distortion and achieves state-of-the-art performance. Also, the application of SemST to domain adaptation (DA) is explored. It is demonstrated by preliminary experiments that SemST can be utilized as a beneficial pre-training for the semantic segmentation task.

摘要
<>设想系统：源领域到目标领域的无监督图像译换学习映射输入源领域的图像到目标领域的图像，保持 Semantics 的含义。一个挑战是源领域和目标领域的含义统计不同，导致内容偏差，即semantic distortion。为解决这个问题，本文提出了一种新的I2I方法，命名为SemST，该方法保持了Semantics的一致性。SemST通过对输入和输出的结构和文本特征进行匹配，使得它们之间的信息共同性最大化，从而减少semantic distortion。此外，本文还提出了一种多尺度方法，以提高译换性能，使得SemST可以应用于高分辨率图像的领域适应。实验表明，SemST可以有效地减少semantic distortion，并达到领域顶尖性能。此外，本文还探讨了SemST的应用于领域适应（DA），并通过初步实验表明，SemST可以作为semantic segmentation任务的有利预训练。

paper_url: http://arxiv.org/abs/2310.04992
repo_url: None
paper_authors: Jianing Qiu, Jian Wu, Hao Wei, Peilun Shi, Minqing Zhang, Yunyun Sun, Lin Li, Hanruo Liu, Hongyi Liu, Simeng Hou, Yuyang Zhao, Xuehui Shi, Junfang Xian, Xiaoxia Qu, Sirui Zhu, Lijie Pan, Xiaoniao Chen, Xiaojia Zhang, Shuai Jiang, Kebing Wang, Chenlong Yang, Mingqiang Chen, Sujie Fan, Jianhua Hu, Aiguo Lv, Hui Miao, Li Guo, Shujun Zhang, Cheng Pei, Xiaojuan Fan, Jianqin Lei, Ting Wei, Junguo Duan, Chun Liu, Xiaobo Xia, Siqi Xiong, Junhong Li, Benny Lo, Yih Chung Tham, Tien Yin Wong, Ningli Wang, Wu Yuan
for: 这个论文是为了开发一个基于340万张眼科图像的基础模型，用于推动多种眼科人工智能应用程序。
methods: 该模型使用了340万张眼科图像，涵盖了各种眼科疾病、图像设备和人口类型，进行预训练。
results: 模型在12种常见眼科疾病的诊断中与专业医生的合作诊断性能相当或更高，并在新的大规模眼科疾病诊断数据集和检测数据集上表现出优异性能。

Abstract
We present VisionFM, a foundation model pre-trained with 3.4 million ophthalmic images from 560,457 individuals, covering a broad range of ophthalmic diseases, modalities, imaging devices, and demography. After pre-training, VisionFM provides a foundation to foster multiple ophthalmic artificial intelligence (AI) applications, such as disease screening and diagnosis, disease prognosis, subclassification of disease phenotype, and systemic biomarker and disease prediction, with each application enhanced with expert-level intelligence and accuracy. The generalist intelligence of VisionFM outperformed ophthalmologists with basic and intermediate levels in jointly diagnosing 12 common ophthalmic diseases. Evaluated on a new large-scale ophthalmic disease diagnosis benchmark database, as well as a new large-scale segmentation and detection benchmark database, VisionFM outperformed strong baseline deep neural networks. The ophthalmic image representations learned by VisionFM exhibited noteworthy explainability, and demonstrated strong generalizability to new ophthalmic modalities, disease spectrum, and imaging devices. As a foundation model, VisionFM has a large capacity to learn from diverse ophthalmic imaging data and disparate datasets. To be commensurate with this capacity, in addition to the real data used for pre-training, we also generated and leveraged synthetic ophthalmic imaging data. Experimental results revealed that synthetic data that passed visual Turing tests, can also enhance the representation learning capability of VisionFM, leading to substantial performance gains on downstream ophthalmic AI tasks. Beyond the ophthalmic AI applications developed, validated, and demonstrated in this work, substantial further applications can be achieved in an efficient and cost-effective manner using VisionFM as the foundation.

摘要
我们介绍VisionFM，一个基础模型，通过340万张眼科图像和560457名个人数据进行预训练，覆盖了广泛的眼科疾病、模式、成像设备和人口学。预训练后，VisionFM提供了一个基础，以推动多种眼科人工智能应用程序，如疾病检测和诊断、疾病诊断、疾病类型分 subclassification和系统生物标志和疾病预测，每个应用程序都受到专家水平的智能和准确性的提高。VisionFM的通用智能超过了基本和中级水平的眼科医生，在共同诊断12种常见眼科疾病方面表现出色。在一个新的大规模眼科疾病诊断benchmark数据集和一个新的大规模分割和检测benchmark数据集上进行评估，VisionFM表现出色，并超越了强基线深度神经网络。眼科图像学习的VisionFM表现出了值得注意的解释性，并在新的眼科模式、疾病谱和成像设备上表现出了强大的普适性。作为基础模型，VisionFM具有大量学习眼科成像数据和多种数据集的能力。为了与这种能力相符，我们不仅使用了实际数据进行预训练，还生成并利用了合理的 synthetic眼科成像数据。实验结果表明，通过visual Turing测试，合理的synthetic数据也可以提高VisionFM的表征学习能力，导致下游眼科人工智能任务的性能提高。除了在本工作中开发、验证和示例的眼科人工智能应用程序外，VisionFM可以在高效和cost-effective的方式实现更多的应用。

paper_url: http://arxiv.org/abs/2310.04991
repo_url: None
paper_authors: Haogeng Liu, Qihang Fan, Tingkai Liu, Linjie Yang, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
for: 这篇论文旨在提出一种视频语言基础模型，以便进行视频描述生成任务。
methods: 该模型使用多模态融合和精细的模态对齐来显著提高视频描述生成的效果。它利用冻结预训练的视觉和语言模块，并在描述生成过程中使用大型自然语言模型来生成 concise 和 elaborate 的视频描述。
results: 实验结果表明，该模型可以准确地理解视频内容，并生成 coherent 和精细的语言描述。 fine-grained 模态对齐目标可以提高模型的能力（4% 提高 CIDEr 分数在 MSR-VTT），仅需训练参数增加 13%，并在推理过程中不增加额外成本。

Abstract
This paper proposes Video-Teller, a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment to significantly enhance the video-to-text generation task. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions. To effectively integrate visual and auditory information, Video-Teller builds upon the image-based BLIP-2 model and introduces a cascaded Q-Former which fuses information across frames and ASR texts. To better guide video summarization, we introduce a fine-grained modality alignment objective, where the cascaded Q-Former's output embedding is trained to align with the caption/summary embedding created by a pretrained text auto-encoder. Experimental results demonstrate the efficacy of our proposed video-language foundation model in accurately comprehending videos and generating coherent and precise language descriptions. It is worth noting that the fine-grained alignment enhances the model's capabilities (4% improvement of CIDEr score on MSR-VTT) with only 13% extra parameters in training and zero additional cost in inference.

摘要
Translation in Simplified Chinese:这篇论文提出了 Video-Teller，一种基于视频-语言基础模型，通过多modal融合和精细模态对接来备受提高视频到文本生成任务的能力。Video-Teller 通过冻结预训练的视觉和语言模块来提高训练效率。它利用大语言模型的 Robust 语言功能，以便生成 Both concise 和 elaborate 的视频描述。为了有效地 инте格ри视觉和听觉信息，Video-Teller 基于 BLIP-2 模型，并引入一个卷积扩展器，将信息融合到帧和 ASR 文本之间。为了更好地引导视频摘要，我们引入了精细模态对接目标，使得卷积扩展器的输出嵌入与预训练文本自动编码器创建的 Caption/Summary 嵌入进行对接。实验结果表明我们提出的视频语言基础模型在准确理解视频并生成准确和精细的语言描述方面具有很高的能力。值得注意的是，精细对接对模型的能力提高带来了4%的 CIDEr 分数提高（在 MSR-VTT 上），只需要在训练中增加13%的参数，并在推理时间中没有额外的成本。

Compositional Semantics for Open Vocabulary Spatio-semantic Representations

paper_url: http://arxiv.org/abs/2310.04981
repo_url: None
paper_authors: Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda
for: 本研究旨在实现无需人工指令的通用移动机器人，通过大语言模型（LLM）和感知语言模型（VLM）来实现常识世界知识和理性计划。
methods: 本研究提出了幽领结构嵌入〈z〉，实现可询问的空间 semantics 表示。通过数学证明和实验验证，〈z〉可以在任意集Z中找到最佳中心，并且可以由梯度下降优化从视觉外观和单词描述中学习。
results: 实验结果显示，〈z〉可以表示到100个高维度嵌入中的10个 semantics，并且可以提高非关连的开 vocabulary segmentation性能。使用CLIP和SBERT嵌入空间的实验结果显示，一个简单的 dense VLM可以在COCO-Stuff dataset上学习〈z〉，实现181个相互关联的 semantics 的构成。

Abstract
General-purpose mobile robots need to complete tasks without exact human instructions. Large language models (LLMs) is a promising direction for realizing commonsense world knowledge and reasoning-based planning. Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable by iterative optimization by gradient descent from visual appearance and singular descriptions. We experimentally verify our findings on four embedding spaces incl. CLIP and SBERT. Our results show that z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics for ideal uniformly distributed high-dimensional embeddings. We demonstrate that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181 overlapping semantics by 42.23 mIoU, while improving conventional non-overlapping open-vocabulary segmentation performance by +3.48 mIoU compared with a popular SOTA model.

摘要
通用移动机器人需要完成任务无需准确的人类指令。大型语言模型（LLM）是实现通用世界知识和理由预测的可能性的方向。视觉语言模型（VLM）将环境感知转化为可解释的视觉语言 semantics，但完成复杂任务通常需要对现有信息之外的信息进行推理。我们提议使用幽默的 composer semantic embedding z* 作为可学习基于知识表示的原则。我们 математичеamente 证明 z* 总是可以找到，并且最佳 z* 是 Z 集合中的中心。我们 derive 一个 probabilistic bound 用于估计相关和不相关 semantics 之间的分化程度。我们证明 z* 可以通过迭代的梯度下降从视觉特征和 singular descriptions 进行学习。我们在 CLIP 和 SBERT 等四个 embedding space 上进行实验，并证明 z* 可以表示 SBERT 中的 10 个 semantics，以及高维 embedding 中的 100 个 semantics。我们还示出了一个简单的 dense VLM 在 COCO-Stuff 数据集上可以通过 z* 学习 181 个 overlap semantics，而且提高了非 overlap segmentation 性能。

Learning Many-to-Many Mapping for Unpaired Real-World Image Super-resolution and Downscaling

paper_url: http://arxiv.org/abs/2310.04964
repo_url: None
paper_authors: Wanjie Sun, Zhenzhong Chen
for: 这篇论文旨在提出一种不需要对比的单图超解析（SISR）方法，用于处理真实世界中的图像，因为现有的大多数不监督的实世界SISR方法采用了两个阶段训练策略，首先将高分辨率图像转换成低分辨率图像，然后在监督下训练超解析模型。
methods: 该方法提出了一种名为SDFlow的图像下采样和超解析模型，该模型同时学习了 bidirectional 多对多 mapping между实世界低分辨率图像和高分辨率图像，无需对比。SDFlow 通过分离图像内容和降解信息在幂空间中，使得低分辨率图像和高分辨率图像的内容信息分布在共同的幂空间中匹配。
results: 实验结果表明，SDFlow 可以生成多个真实和可见的低分辨率图像和高分辨率图像，并且能够Quantitatively and qualitatively improve the performance of real-world image super-resolution.

Abstract
Learning based single image super-resolution (SISR) for real-world images has been an active research topic yet a challenging task, due to the lack of paired low-resolution (LR) and high-resolution (HR) training images. Most of the existing unsupervised real-world SISR methods adopt a two-stage training strategy by synthesizing realistic LR images from their HR counterparts first, then training the super-resolution (SR) models in a supervised manner. However, the training of image degradation and SR models in this strategy are separate, ignoring the inherent mutual dependency between downscaling and its inverse upscaling process. Additionally, the ill-posed nature of image degradation is not fully considered. In this paper, we propose an image downscaling and SR model dubbed as SDFlow, which simultaneously learns a bidirectional many-to-many mapping between real-world LR and HR images unsupervisedly. The main idea of SDFlow is to decouple image content and degradation information in the latent space, where content information distribution of LR and HR images is matched in a common latent space. Degradation information of the LR images and the high-frequency information of the HR images are fitted to an easy-to-sample conditional distribution. Experimental results on real-world image SR datasets indicate that SDFlow can generate diverse realistic LR and SR images both quantitatively and qualitatively.

摘要
学习基于单个图像超分辨 (SISR) 对实际世界图像进行研究是一个活跃的研究话题，但是是一个具有挑战性的任务，因为缺乏匹配的低分辨率 (LR) 和高分辨率 (HR) 训练图像。大多数现有的无监督实际世界 SISR 方法采用了两个阶段训练策略，先将实际LR图像Synthesize into HR counterparts，然后在监督性训练SR模型。但是，训练图像减退和其 inverse upscaling 过程中的相互依赖关系未被考虑，同时不完全考虑图像减退的不定性。在这篇论文中，我们提出了一种名为SDFlow的图像减退和SR模型，可以同时学习实际LR和HR图像之间的 bidirectional many-to-many 映射，无需监督。主要思想是在幂空间中分离图像内容和减退信息，LR图像的内容信息和HR图像的高频信息在幂空间中匹配。LR图像的减退信息和HR图像的高频信息被 fitted 到一个易于样本的 conditional distribution。实验结果表明，SDFlow可以生成多样化的实际LR和SR图像，并且具有较高的量化和质量指标。

2023-10-08

Progressive Neural Compression for Adaptive Image Offloading under Timing Constraints

GestSync: Determining who is speaking without a talking head

Image Compression and Decompression Framework Based on Latent Diffusion Model for Breast Mammography

MSight: An Edge-Cloud Infrastructure-based Perception System for Connected Automated Vehicles

The Emergence of Reproducibility and Consistency in Diffusion Models

Structure-Preserving Instance Segmentation via Skeleton-Aware Distance Transform

SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval

Latent Diffusion Model for Medical Image Standardization and Enhancement

Enhancing Cross-Dataset Performance of Distracted Driving Detection With Score-Softmax Classifier

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

HOD: A Benchmark Dataset for Harmful Object Detection

AANet: Aggregation and Alignment Network with Semi-hard Positive Sample Mining for Hierarchical Place Recognition

ITRE: Low-light Image Enhancement Based on Illumination Transmission Ratio Estimation

LocoNeRF: A NeRF-based Approach for Local Structure from Motion for Precise Localization

Geometry Aware Field-to-field Transformations for 3D Semantic Segmentation

Bidirectional Knowledge Reconfiguration for Lightweight Point Cloud Analysis

Cross-domain Robust Deepfake Bias Expansion Network for Face Forgery Detection

Dynamic Multi-Domain Knowledge Networks for Chest X-ray Report Generation

Lightweight In-Context Tuning for Multimodal Unified Models

Enhancing Representations through Heterogeneous Self-Supervised Learning

OV-PARTS: Towards Open-Vocabulary Part Segmentation

Cross-head mutual Mean-Teaching for semi-supervised medical image segmentation

Language-driven Open-Vocabulary Keypoint Detection for Animal Body and Face

FairTune: Optimizing Parameter Efficient Fine Tuning for Fairness in Medical Image Analysis

Low-Resolution Self-Attention for Semantic Segmentation

Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn

Detecting Abnormal Health Conditions in Smart Home Using a Drone

Data Augmentation through Pseudolabels in Automatic Region Based Coronary Artery Segmentation for Disease Diagnosis

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

SemST: Semantically Consistent Multi-Scale Image Translation via Structure-Texture Alignment

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling

Compositional Semantics for Open Vocabulary Spatio-semantic Representations

Learning Many-to-Many Mapping for Unpaired Real-World Image Super-resolution and Downscaling