results: 对两个广泛使用的 refer benchmark 进行了广泛的实验,结果表明,提出的 CIParsing 方法可以提高 MHP 模型的泛化能力和鲁棒性,并且可以适应不同的图像样式和外部干扰。Abstract
Existing methods of multiple human parsing (MHP) apply statistical models to acquire underlying associations between images and labeled body parts. However, acquired associations often contain many spurious correlations that degrade model generalization, leading statistical models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causality inspired parsing paradigm termed CIParsing, which follows fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an input image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones cause the generation process of human parsing.Since causal/non-causal factors are unobservable, a human parser in proposed CIParsing is required to construct latent representations of causal factors and learns to enforce representations to satisfy the causal properties. In this way, the human parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CIParsing is designed in a plug-and-play fashion and can be integrated into any existing MHP models. Extensive experiments conducted on two widely used benchmarks demonstrate the effectiveness and generalizability of our method.
摘要
SG-Former: Self-guided Transformer with Evolving Token Reallocation
results: 在 ImageNet-1K 和 CoCo 等任务上达到了现状之最,比如 Swin Transformer 高出 \textbf{+1.3% / +2.7 mAP/ +3 mIoU},同时具有较低的计算成本和参数数量。Abstract
Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer~(SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves \textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo, \textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% / +2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code is available at \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}
摘要
“视Transformer”已经在不同的视觉任务上表现出色。然而,它的计算成本呈 quadratic 关系,即Sequence length的长度,这限制了它在处理大的特征图时的能力。以前的工作通过 either 精细自注意或者缩短序列长度来降低计算成本,但这会导致 coarse 粒度。在这篇论文中,我们提出了一种新的模型,称为自适应Transformer(SG-Former),以实现高效的全球自注意。我们的方法是通过在训练中估算出的 Significance 地图,来调整token的分配,以达到 fine-grained 的自注意。具体来说,我们将更多的token分配给突出的区域,以实现细致的自注意,而同时减少 Token 的分配,以减少计算成本和全球接收器。我们的SG-Former 模型在 ImageNet-1K 上达到了 \textbf{84.7\%} Top-1 准确率,在 CoCo 上达到了 \textbf{51.2mAP} bbAP,在 ADE20K 上达到了 \textbf{52.7mIoU},超过 Swin Transformer 的 \textbf{+1.3\% / +2.7 mAP / +3 mIoU},同时具有更低的计算成本和 fewer 参数。代码可以在 \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former} 上找到。
Towards Real-Time Analysis of Broadcast Badminton Videos
paper_authors: Nitin Nilesh, Tushar Sharma, Anurag Ghosh, C. V. Jawahar
for: 这个研究旨在实时分析羽毛球比赛中玩家的运动动作。
methods: 该方法使用直播启用的视频输入,并从视频中提取玩家的运动轨迹。
results: 该方法可以实时计算玩家在场上覆盖的距离和速度,以及在场上覆盖的区域。I hope this helps! Let me know if you have any other questions.Abstract
Analysis of player movements is a crucial subset of sports analysis. Existing player movement analysis methods use recorded videos after the match is over. In this work, we propose an end-to-end framework for player movement analysis for badminton matches on live broadcast match videos. We only use the visual inputs from the match and, unlike other approaches which use multi-modal sensor data, our approach uses only visual cues. We propose a method to calculate the on-court distance covered by both the players from the video feed of a live broadcast badminton match. To perform this analysis, we focus on the gameplay by removing replays and other redundant parts of the broadcast match. We then perform player tracking to identify and track the movements of both players in each frame. Finally, we calculate the distance covered by each player and the average speed with which they move on the court. We further show a heatmap of the areas covered by the player on the court which is useful for analyzing the gameplay of the player. Our proposed framework was successfully used to analyze live broadcast matches in real-time during the Premier Badminton League 2019 (PBL 2019), with commentators and broadcasters appreciating the utility.
摘要
analysis of player movements 是体育分析中的一个重要子集。现有的玩家运动分析方法使用已经完成的比赛录像。在这项工作中,我们提议一种终端框架 дляBadminton比赛中的玩家运动分析。我们只使用比赛直播中的视觉输入,与其他方法不同,我们的方法不使用多Modal感知数据。我们提出了一种方法来计算比赛直播中Badminton比赛中每个玩家在视频中的场地覆盖距离。为了进行这种分析,我们减去了重播和其他无关的部分,然后对每帧中的玩家运动进行跟踪和识别。最后,我们计算了每个玩家在场地上的平均速度和覆盖距离。此外,我们还显示了每个玩家在场地上的热图,这有助于分析玩家的游戏风格。我们的提议的框架在Premier Badminton League 2019(PBL 2019)中实时分析了直播比赛,评论员和广播员都很欣赏了其有用性。
Sign Language Translation with Iterative Prototype
results: 实验表明,IP-SLT可以提供更加流畅和合适的翻译结果,并且可以轻松地整合到现有的SLT系统中。Abstract
This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT.
摘要
Tumor-Centered Patching for Enhanced Medical Image Segmentation
results: 实验结果显示,这个方法可以改善类别不均衡, segmentation scores 为 0.78、0.76 和 0.71 для整体、核心和增强肿瘤,分别。这些结果显示这种方法具有潜力,可以帮助改善医疗影像诊断系统的效能。Abstract
The realm of medical image diagnosis has advanced significantly with the integration of computer-aided diagnosis and surgical systems. However, challenges persist, particularly in achieving precise image segmentation. While deep learning techniques show potential, obstacles like limited resources, slow convergence, and class imbalance impede their effectiveness. Traditional patch-based methods, though common, struggle to capture intricate tumor boundaries and often lead to redundant samples, compromising computational efficiency and feature quality. To tackle these issues, this research introduces an innovative approach centered on the tumor itself for patch-based image analysis. This novel tumor-centered patching method aims to address the class imbalance and boundary deficiencies, enabling focused and accurate tumor segmentation. By aligning patches with the tumor's anatomical context, this technique enhances feature extraction accuracy and reduces computational load. Experimental results demonstrate improved class imbalance, with segmentation scores of 0.78, 0.76, and 0.71 for whole, core, and enhancing tumors, respectively using a lightweight simple U-Net. This approach shows potential for enhancing medical image segmentation and improving computer-aided diagnosis systems.
摘要
医学影像诊断领域已经具有了深刻的改进,尤其是通过计算机助成诊断和手术系统的结合。然而,困难仍然存在,主要是准确的图像分割。深度学习技术表现出了潜在的优势,但是有限的资源、慢速的融合和类别不均衡问题使其效果受限。传统的贴图方法,虽然广泛使用,但是它们往往难以捕捉肿瘤边界的细节,并且经常产生重复的样本,从而降低计算效率和特征质量。为了解决这些问题,本研究提出了一种推新的肿瘤中心的贴图方法。这种新方法通过将贴图与肿瘤的生物学上的位置进行对应,以提高特征提取精度和减少计算负担。实验结果表明,这种方法可以更好地处理类别不均衡问题,并且 segmentation 分数为0.78、0.76和0.71 для整体、核心和增强肿瘤,分别使用一种轻量级的简单U-Net。这种方法显示出了改善医学影像分割的潜在可能性,并且可能为计算机助成诊断系统提供新的思路。
NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos
results: 研究结果表明,NPSNet可以达到状态older的性能水平,并且在不同的频谱和多Modal中展示了优异的性能。此外,研究还发现了多模态网络设计和多频域训练的优势和缺陷,为未来的研究提供了丰富的发展方向。Abstract
Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}.
摘要
非摄影实际视频在metaverse潮流中受欢迎,但是研究不充分。这项工作希望通过眼动检测(即眼动检测)来理解人类如何看待非摄影实际视频,这对媒体制作、艺术设计和游戏用户体验都是重要的。为了填补这一研究领域缺乏适用 datasets的问题,我们提供了NPF-200 dataset,这是首个含有声道和多Modal的非摄影实际视频眼动检测 dataset。我们的数据集有以下三个特点:1)它们包含视觉和心理学研究认为是必要的声道;2)它们包含多样化的 semantics 内容和高质量的视频;3)它们具有视频中的丰富运动。我们进行了一系列分析,以获得更深入的理解,并与多种现有方法进行比较,以探索自然图像和非摄影实际数据之间的差距。此外,人类注意系统通常会通过不同频率EXTRACT visual和声音特征,因此我们提出了一种适用于多Modal非摄影实际数据的频率意识非摄影眼动检测模型,称为NPSNet。结果表明NPSNet在这种任务上具有国际前沿性。我们的数据集和代码可以在 GitHub 上找到。
Mesh Conflation of Oblique Photogrammetric Models using Virtual Cameras and Truncated Signed Distance Field
for: 高精度地表模型的合并 (concatenation of high-resolution digital surface models)
methods: 基于投影相机场的TSDF折叠 (concatenation using Truncated Signed Distance Fields and panoramic cameras)
results: 提高了传统方法的精度和效率 (improved accuracy and efficiency compared to traditional methods)Abstract
Conflating/stitching 2.5D raster digital surface models (DSM) into a large one has been a running practice in geoscience applications, however, conflating full-3D mesh models, such as those from oblique photogrammetry, is extremely challenging. In this letter, we propose a novel approach to address this challenge by conflating multiple full-3D oblique photogrammetric models into a single, and seamless mesh for high-resolution site modeling. Given two or more individually collected and created photogrammetric meshes, we first propose to create a virtual camera field (with a panoramic field of view) to incubate virtual spaces represented by Truncated Signed Distance Field (TSDF), an implicit volumetric field friendly for linear 3D fusion; then we adaptively leverage the truncated bound of meshes in TSDF to conflate them into a single and accurate full 3D site model. With drone-based 3D meshes, we show that our approach significantly improves upon traditional methods for model conflations, to drive new potentials to create excessively large and accurate full 3D mesh models in support of geoscience and environmental applications.
摘要
合并/缝合2.5D矩阵数字地表模型(DSM)是地球科学应用中常见的做法,但是将全3D网格模型,如从斜视 photogrammetry 获得的模型,则是非常困难的。在这封信中,我们提出了一种新的方法,用于解决这个问题,即将两个或更多个 individually 收集和创建的 photogrammetric 模型进行合并,并生成一个完整、精准的full 3D站点模型。我们首先创建了一个虚拟摄像头场(具有扩展的全景视野),以便在 Truncated Signed Distance Field (TSDF) 中表示虚拟空间,然后利用TSDF中的截断缩限来合并这些模型,并生成一个准确、完整的full 3D站点模型。使用无人机生成的3D模型,我们表明了我们的方法在传统方法上显著提高了模型合并的性能,以支持地球科学和环境应用。
Select-and-Combine (SAC): A Novel Multi-Stereo Depth Fusion Algorithm for Point Cloud Generation via Efficient Local Markov Netlets
results: 对比 existed方法,该方法提高了F1分数(考虑准确性和完整性)的提升2.07%,并生成了18%更加简洁的点云,同时保持高度准确性。Abstract
Many practical systems for image-based surface reconstruction employ a stereo/multi-stereo paradigm, due to its ability to scale for large scenes and its ease of implementation for out-of-core operations. In this process, multiple and abundant depth maps from stereo matching must be combined and fused into a single, consistent, and clean point cloud. However, the noises and outliers caused by stereo matching and the heterogenous geometric errors of the poses present a challenge for existing fusion algorithms, since they mostly assume Gaussian errors and predict fused results based on data from local spatial neighborhoods, which may inherit uncertainties from multiple depths resulting in lowered accuracy. In this paper, we propose a novel depth fusion paradigm, that instead of numerically fusing points from multiple depth maps, selects the best depth map per point, and combines them into a single and clean point cloud. This paradigm, called select-and-combine (SAC), is achieved through modeling the point level fusion using local Markov Netlets, a micro-network over point across neighboring views for depth/view selection, followed by a Netlets collapse process for point combination. The Markov Netlets are optimized such that they can inherently leverage spatial consistencies among depth maps of neighboring views, thus they can address errors beyond Gaussian ones. Our experiment results show that our approach outperforms existing depth fusion approaches by increasing the F1 score that considers both accuracy and completeness by 2.07% compared to the best existing method. Finally, our approach generates clearer point clouds that are 18% less redundant while with a higher accuracy before fusion
摘要
多种实用系统用stereo/多 Sterero paradigm进行图像基于表面重建,因为它可以扩展到大型场景并且实现离核操作。在这个过程中,多个和充沛的深度地图从stereo匹配得到,然后需要将它们合并和融合成一个单一、一致和干净的点云。然而,stereo匹配中的噪声和异常值以及不同视角的姿态差引入的精度问题,对现有的融合算法提出了挑战,因为它们大多数假设 Gaussian 错误并基于本地空间邻域数据预测融合结果,这可能会继承多个深度的不确定性,导致准确性下降。在这篇论文中,我们提出了一种新的深度融合方法,而不是纯数学地融合多个深度地图中的点,而是选择每个点最佳的深度地图,然后将它们融合成一个单一、一致和干净的点云。这种方法,我们称之为选择并融合(SAC)方法。我们通过模型点级融合使用本地Markov Netlets,一种微网络在不同视角的深度/视图之间的点级选择,然后进行Netlets归一化过程来融合点。Markov Netlets是通过优化,使其能够自然地利用邻近视角的深度地图之间的空间一致性,因此它们可以解决超出Gaussian错误的错误。我们的实验结果显示,我们的方法比现有的深度融合方法提高了F1分数,考虑到准确性和完整性的比较,提高了2.07%。此外,我们的方法生成的点云更清晰,同时具有18% less redundancy,而且准确率更高。
Lite-HRNet Plus: Fast and Accurate Facial Landmark Detection
results: 这个方法在两个颜面特征检测 dataset 上进行了实验,并证明了它在比较于传统方法更高的精度下,并且在10M FLOPs的计算复杂度范围内实现了顶尖的性能。Abstract
Facial landmark detection is an essential technology for driver status tracking and has been in demand for real-time estimations. As a landmark coordinate prediction, heatmap-based methods are known to achieve a high accuracy, and Lite-HRNet can achieve a fast estimation. However, with Lite-HRNet, the problem of a heavy computational cost of the fusion block, which connects feature maps with different resolutions, has yet to be solved. In addition, the strong output module used in HRNetV2 is not applied to Lite-HRNet. Given these problems, we propose a novel architecture called Lite-HRNet Plus. Lite-HRNet Plus achieves two improvements: a novel fusion block based on a channel attention and a novel output module with less computational intensity using multi-resolution feature maps. Through experiments conducted on two facial landmark datasets, we confirmed that Lite-HRNet Plus further improved the accuracy in comparison with conventional methods, and achieved a state-of-the-art accuracy with a computational complexity with the range of 10M FLOPs.
摘要
Facial landmark detection 是一种重要的技术,用于车手状态追踪,需要实时估计。为了实现高精度的坐标预测,热图基方法在实时估计中具有高精度。然而,Lite-HRNet中的融合块具有重要的计算成本问题,即将不同分辨率的特征图连接。此外,HRNetV2中使用的强大输出模块没有应用于Lite-HRNet。为了解决这些问题,我们提出了一种新的架构:Lite-HRNet Plus。Lite-HRNet Plus实现了两项改进:一种新的融合块基于通道注意力,以及一种新的输出模块使用多resolution特征图,从而降低计算复杂性。通过在两个 facial landmark 数据集上进行的实验,我们证明了 Lite-HRNet Plus 在比较方法中提高了精度,并在10M FLOPs的计算复杂性范围内达到了状态之精度。
The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures
results: 这篇论文发布了105个高分辨率的快速扫描顾密镜像,包括约19000个实例的掩码。此外,还发布了261个 curaated 视频剪辑,包括1293个高分辨率镜像,以便无监督地理解细胞的运动和形态。Abstract
Segmenting cells and tracking their motion over time is a common task in biomedical applications. However, predicting accurate instance-wise segmentation and cell motions from microscopy imagery remains a challenging task. Using microstructured environments for analyzing single cells in a constant flow of media adds additional complexity. While large-scale labeled microscopy datasets are available, we are not aware of any large-scale dataset, including both cells and microstructures. In this paper, we introduce the trapped yeast cell (TYC) dataset, a novel dataset for understanding instance-level semantics and motions of cells in microstructures. We release $105$ dense annotated high-resolution brightfield microscopy images, including about $19$k instance masks. We also release $261$ curated video clips composed of $1293$ high-resolution microscopy images to facilitate unsupervised understanding of cell motions and morphology. TYC offers ten times more instance annotations than the previously largest dataset, including cells and microstructures. Our effort also exceeds previous attempts in terms of microstructure variability, resolution, complexity, and capturing device (microscopy) variability. We facilitate a unified comparison on our novel dataset by introducing a standardized evaluation strategy. TYC and evaluation code are publicly available under CC BY 4.0 license.
摘要
分 segmenting cells和跟踪其运动过时是生物医学应用中常见任务。然而,准确预测单个单元Instance-wise segmentation和细胞运动从微scopic imaging中remains a challenging task。使用微结构环境 для单元细胞分析在流动媒体中添加了额外复杂性。虽然大规模标注微scopic imaging数据集是可用的,但我们没有发现任何包括细胞和微结构的大规模数据集。在这篇论文中,我们介绍了被陷 yeast cell(TYC)数据集,一个新的数据集用于理解单元级别 semantics和细胞运动。我们发布了105个高分辨率炸einstein imaging microscopy图像,包括约19000个实例涂抹。我们还发布了261个CURATED video clip,包括1293个高分辨率微scopic imaging图像,以便无监督地理解细胞运动和形态。TYC提供了前一次最大的实例标注数量,包括细胞和微结构。我们的努力也超过了之前的尝试,териms of microstructure variability, resolution, complexity, and capturing device(微scopic imaging)variability。我们提出了一种标准化评估策略,以便对我们的新数据集进行一致的比较。TYC和评估代码在CC BY 4.0license下公开可用。
Less is More – Towards parsimonious multi-task models using structured sparsity
methods: 本研究使用了通道级l1/l2群 sparse(channel-wise l1/l2 group sparsity)在共享层中,这种方法不仅可以消除无关的通道(group),同时也对 weights 加以罚 penalty,从而提高所有任务的学习。
results: compared with单任务和多任务实验,在group sparse情况下,模型的性能保持在相似或更高水平,而且可以降低模型的内存占用量、计算量和预测时间。此外,研究还发现,随着 sparse度的变化,模型的性能和群的稀畴度具有一定的关系。Abstract
Group sparsity in Machine Learning (ML) encourages simpler, more interpretable models with fewer active parameter groups. This work aims to incorporate structured group sparsity into the shared parameters of a Multi-Task Learning (MTL) framework, to develop parsimonious models that can effectively address multiple tasks with fewer parameters while maintaining comparable or superior performance to a dense model. Sparsifying the model during training helps decrease the model's memory footprint, computation requirements, and prediction time during inference. We use channel-wise l1/l2 group sparsity in the shared layers of the Convolutional Neural Network (CNN). This approach not only facilitates the elimination of extraneous groups (channels) but also imposes a penalty on the weights, thereby enhancing the learning of all tasks. We compare the outcomes of single-task and multi-task experiments under group sparsity on two publicly available MTL datasets, NYU-v2 and CelebAMask-HQ. We also investigate how changing the sparsification degree impacts both the performance of the model and the sparsity of groups.
摘要
(简化中文)机器学习(ML)中的组 sparse 激励 simpler, more interpretable 的模型,具有 fewer active parameter groups。这项工作想要将结构化的组 sparse integrate into 多任务学习(MTL)框架中,以开发更具有简洁性和可解释性的模型,能够更好地处理多个任务,而且具有更少的参数。在训练过程中,将模型简化可以降低模型的内存占用量、计算需求和预测时间。我们在 convolutional neural network(CNN) 中使用 channel-wise L1/L2 组 sparse,这种方法不仅可以消除无用的组(通道),还对权重进行罚款,从而提高所有任务的学习。我们在 NYU-v2 和 CelebAMask-HQ 两个公开available MTL 数据集上进行单任务和多任务实验,并 investigate 如何更改简化学习度强度对模型性能和组简洁度产生的影响。
Advancements in Point Cloud Data Augmentation for Deep Learning: A Survey
for: This paper focuses on point cloud data augmentation methods for tasks such as detection, segmentation, and classification in computer vision.
methods: The paper surveys and discusses various point cloud data augmentation methods, categorizing them into a taxonomy framework, and evaluates their potentials and limitations.
results: The paper provides a comprehensive understanding of the current status of point cloud data augmentation and suggests possible future research directions, promoting the wider application and development of point cloud processing techniques.Abstract
Point cloud has a wide range of applications in areas such as autonomous driving, mapping, navigation, scene reconstruction, and medical imaging. Due to its great potentials in these applications, point cloud processing has gained great attention in the field of computer vision. Among various point cloud processing techniques, deep learning (DL) has become one of the mainstream and effective methods for tasks such as detection, segmentation and classification. To reduce overfitting during training DL models and improve model performance especially when the amount and/or diversity of training data are limited, augmentation is often crucial. Although various point cloud data augmentation methods have been widely used in different point cloud processing tasks, there are currently no published systematic surveys or reviews of these methods. Therefore, this article surveys and discusses these methods and categorizes them into a taxonomy framework. Through the comprehensive evaluation and comparison of the augmentation methods, this article identifies their potentials and limitations and suggests possible future research directions. This work helps researchers gain a holistic understanding of the current status of point cloud data augmentation and promotes its wider application and development.
摘要
点云处理具有广泛的应用领域,如自动驾驶、地图建模、导航、场景重建和医疗影像等。由于其在这些应用领域的潜力,点云处理技术在计算机视觉领域得到了广泛的关注。深度学习(DL)已成为点云处理中一种主流和有效的方法,用于任务如检测、分割和分类。在训练DL模型时,以避免过拟合,数据扩展是非常重要。现在,点云数据扩展方法已经广泛应用在不同的点云处理任务中,但是没有发表过系统性的报告或评论。因此,本文对这些方法进行了抽查和讨论,并将其分类到一个分类框架中。通过对各种扩展方法的全面评估和比较,本文可以承认和限制这些方法,并提出可能的未来研究方向。这些研究结果可以帮助研究人员更好地了解当前点云数据扩展的状况,并推动其更广泛的应用和发展。
results: 实验表明,现有的CL方法在新任务中不能够积累知识,而我们提出的方法可以通过 combining supervised和无标签信号,使得学习Agent可以更好地积累知识并避免忘记。此外,我们的方法也可以在实际场景中表现出优于强CL方法和GCD技术。Abstract
Most of Continual Learning (CL) methods push the limit of supervised learning settings, where an agent is expected to learn new labeled tasks and not forget previous knowledge. However, these settings are not well aligned with real-life scenarios, where a learning agent has access to a vast amount of unlabeled data encompassing both novel (entirely unlabeled) classes and examples from known classes. Drawing inspiration from Generalized Category Discovery (GCD), we introduce a novel framework that relaxes this assumption. Precisely, in any task, we allow for the existence of novel and known classes, and one must use continual version of unsupervised learning methods to discover them. We call this setting Generalized Continual Category Discovery (GCCD). It unifies CL and GCD, bridging the gap between synthetic benchmarks and real-life scenarios. With a series of experiments, we present that existing methods fail to accumulate knowledge from subsequent tasks in which unlabeled samples of novel classes are present. In light of these limitations, we propose a method that incorporates both supervised and unsupervised signals and mitigates the forgetting through the use of centroid adaptation. Our method surpasses strong CL methods adopted for GCD techniques and presents a superior representation learning performance.
摘要
大多数持续学习(CL)方法在supervised learning设置下进行推广,其中一个agent需要学习新的标注任务而不忘记之前的知识。然而,这些设置并不适合实际生活中的情景,where a learning agent有大量未标注数据,包括新的类和已知类的示例。 drawing inspiration from Generalized Category Discovery(GCD),我们介绍了一个新的框架,允许在任务中存在新的和已知的类,并且使用 continual version of unsupervised learning方法来发现它们。我们称这种设置为Generalized Continual Category Discovery(GCCD)。它将CL和GCD融合, bridge the gap between synthetic benchmarks和实际生活中的情景。通过一系列实验,我们发现了现有方法在后续任务中处理未标注的新类样本时存在问题,即忘记之前的知识。为了解决这些限制,我们提出了一种方法,该方法将supervised和unsupervised信号相互衔接,并通过中心点修改来减轻忘记。我们的方法超越了强大的CL方法,并在表示学习性能方面表现出色。
Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection
paper_authors: Chao Tian, Zikun Zhou, Yuqing Huang, Gaojun Li, Zhenyu He for:这篇论文的目的是提出一种新的不注册RGB-T人体检测方法,以解决实际中RGB-T图像对不匹配的问题。methods:这种方法使用了交叉模态提案导向特征挖掘(CPFM)机制,提取RGB和热图像中人体特征,并将其组合成为一个精准的人体检测结果。results:实验结果表明,这种方法可以有效地处理不匹配的RGB-T图像,并且能够提高人体检测的精度和稳定性。Abstract
RGB-Thermal (RGB-T) pedestrian detection aims to locate the pedestrians in RGB-T image pairs to exploit the complementation between the two modalities for improving detection robustness in extreme conditions. Most existing algorithms assume that the RGB-T image pairs are well registered, while in the real world they are not aligned ideally due to parallax or different field-of-view of the cameras. The pedestrians in misaligned image pairs may locate at different positions in two images, which results in two challenges: 1) how to achieve inter-modality complementation using spatially misaligned RGB-T pedestrian patches, and 2) how to recognize the unpaired pedestrians at the boundary. To deal with these issues, we propose a new paradigm for unregistered RGB-T pedestrian detection, which predicts two separate pedestrian locations in the RGB and thermal images, respectively. Specifically, we propose a cross-modality proposal-guided feature mining (CPFM) mechanism to extract the two precise fusion features for representing the pedestrian in the two modalities, even if the RGB-T image pair is unaligned. It enables us to effectively exploit the complementation between the two modalities. With the CPFM mechanism, we build a two-stream dense detector; it predicts the two pedestrian locations in the two modalities based on the corresponding fusion feature mined by the CPFM mechanism. Besides, we design a data augmentation method, named Homography, to simulate the discrepancy in scales and views between images. We also investigate two non-maximum suppression (NMS) methods for post-processing. Favorable experimental results demonstrate the effectiveness and robustness of our method in dealing with unregistered pedestrians with different shifts.
摘要
DISGAN: Wavelet-informed Discriminator Guides GAN to MRI Super-resolution with Noise Cleaning
paper_authors: Qi Wang, Lucas Mahler, Julius Steiglechner, Florian Birk, Klaus Scheffler, Gabriele Lohmann
for: 这 paper 是为了解决 MRI 超分解 (SR) 和噪声纠正 зада题的基础挑战。
methods: 这 paper 使用了一种新的方法,即同时解决 SR 和噪声纠正两个任务的单一深度学习模型,不需要额外的噪声和清晰图像对应的训练数据。
results: 这 paper 的模型能够同时实现高质量的 SR 和噪声纠正效果,并且不需要额外的噪声和清晰图像对应的训练数据。模型的性能在多个 MRI 数据集上进行了评估,包括 Human Connectome Project (HCP) 数据集和患有脑肿和 эпилепси的subjects的 MRI 数据集。Abstract
MRI super-resolution (SR) and denoising tasks are fundamental challenges in the field of deep learning, which have traditionally been treated as distinct tasks with separate paired training data. In this paper, we propose an innovative method that addresses both tasks simultaneously using a single deep learning model, eliminating the need for explicitly paired noisy and clean images during training. Our proposed model is primarily trained for SR, but also exhibits remarkable noise-cleaning capabilities in the super-resolved images. Instead of conventional approaches that introduce frequency-related operations into the generative process, our novel approach involves the use of a GAN model guided by a frequency-informed discriminator. To achieve this, we harness the power of the 3D Discrete Wavelet Transform (DWT) operation as a frequency constraint within the GAN framework for the SR task on magnetic resonance imaging (MRI) data. Specifically, our contributions include: 1) a 3D generator based on residual-in-residual connected blocks; 2) the integration of the 3D DWT with $1\times 1$ convolution into a DWT+conv unit within a 3D Unet for the discriminator; 3) the use of the trained model for high-quality image SR, accompanied by an intrinsic denoising process. We dub the model "Denoising Induced Super-resolution GAN (DISGAN)" due to its dual effects of SR image generation and simultaneous denoising. Departing from the traditional approach of training SR and denoising tasks as separate models, our proposed DISGAN is trained only on the SR task, but also achieves exceptional performance in denoising. The model is trained on 3D MRI data from dozens of subjects from the Human Connectome Project (HCP) and further evaluated on previously unseen MRI data from subjects with brain tumours and epilepsy to assess its denoising and SR performance.
摘要
MRI超分解(SR)和噪声除除(denoising)是深度学习领域的基本挑战,传统上被视为独立的两个任务,需要分别培训独立的深度学习模型。在这篇论文中,我们提出了一种创新的方法, simultaneous addressing both tasks using a single deep learning model, eliminating the need for explicitly paired noisy and clean images during training. Our proposed model is primarily trained for SR, but also exhibits remarkable noise-cleaning capabilities in the super-resolved images. Instead of conventional approaches that introduce frequency-related operations into the generative process, our novel approach involves the use of a GAN模型 guided by a frequency-informed discriminator. To achieve this, we harness the power of the 3D Discrete Wavelet Transform (DWT) operation as a frequency constraint within the GAN framework for the SR task on magnetic resonance imaging (MRI) data. Specifically, our contributions include: 1) a 3D generator based on residual-in-residual connected blocks; 2) the integration of the 3D DWT with $1\times 1$ convolution into a DWT+conv unit within a 3D Unet for the discriminator; 3) the use of the trained model for high-quality image SR, accompanied by an intrinsic denoising process. We dub the model "Denoising Induced Super-resolution GAN (DISGAN)" due to its dual effects of SR image generation and simultaneous denoising. Departing from the traditional approach of training SR and denoising tasks as separate models, our proposed DISGAN is trained only on the SR task, but also achieves exceptional performance in denoising. The model is trained on 3D MRI data from dozens of subjects from the Human Connectome Project (HCP) and further evaluated on previously unseen MRI data from subjects with brain tumours and epilepsy to assess its denoising and SR performance.
Understanding Dark Scenes by Contrasting Multi-Modal Observations
results: 在多种任务上,包括不同的光照条件和图像模式,实验显示我们的方法可以效果地提高基于多Modal Image的黑色场景理解,并且与之前的方法进行比较而显示我们的状态精度性能。Abstract
Understanding dark scenes based on multi-modal image data is challenging, as both the visible and auxiliary modalities provide limited semantic information for the task. Previous methods focus on fusing the two modalities but neglect the correlations among semantic classes when minimizing losses to align pixels with labels, resulting in inaccurate class predictions. To address these issues, we introduce a supervised multi-modal contrastive learning approach to increase the semantic discriminability of the learned multi-modal feature spaces by jointly performing cross-modal and intra-modal contrast under the supervision of the class correlations. The cross-modal contrast encourages same-class embeddings from across the two modalities to be closer and pushes different-class ones apart. The intra-modal contrast forces same-class or different-class embeddings within each modality to be together or apart. We validate our approach on a variety of tasks that cover diverse light conditions and image modalities. Experiments show that our approach can effectively enhance dark scene understanding based on multi-modal images with limited semantics by shaping semantic-discriminative feature spaces. Comparisons with previous methods demonstrate our state-of-the-art performance. Code and pretrained models are available at https://github.com/palmdong/SMMCL.
摘要
《理解黑色场景基于多Modal图像数据是具有挑战性的,因为可见和辅助modalities都提供有限的semantic信息 для任务。先前的方法主要关注两Modalities的融合,但忽视了semantic类别之间的相关性when minimizing losses to align pixels with labels, resulting in inaccurate class predictions。为解决这些问题,我们提出了一种监督多Modal异构学习方法,以增强学习的多Modal特征空间semantic抑制能力。我们同时在交叉Modal和内部Modal上进行对比,以便在监督class关系下同时实现跨Modal和内部Modal的匹配。交叉Modal对比使得同一个类别的embeddings从不同modalities中更加相近,而不同类别的embeddings则更加分开。内部Modal对比使得同一个类别或不同类别的embeddings在每个modalities中都是 вместе或分开。我们在多种任务上验证了我们的方法,包括不同的照明条件和图像modalities。实验结果表明,我们的方法可以有效地提高基于多Modal图像的黑色场景理解,并且可以Shape semantic-discriminative feature spaces。与先前的方法进行比较,我们的性能达到了国际水平。代码和预训练模型可以在https://github.com/palmdong/SMMCL上获取。》
SILT: Shadow-aware Iterative Label Tuning for Learning to Detect Shadows from Noisy Labels
paper_authors: Han Yang, Tianyu Wang, Xiaowei Hu, Chi-Wing Fu
for: 提高阴影检测模型的性能, addresses the issue of missing or mislabeled shadows in existing shadow detection datasets.
methods: 提出了一种名为SILT的shadow-aware iterative label tuning框架, which explicitly considers noise in shadow labels and trains the deep model in a self-training manner.
results: 通过对SBU数据集的测试集重新标注和多种实验,our results show that even a simple U-Net trained with SILT can outperform all state-of-the-art methods by a large margin. When trained on SBU / UCF / ISTD, our network can successfully reduce the Balanced Error Rate by 25.2% / 36.9% / 21.3% over the best state-of-the-art method.Abstract
Existing shadow detection datasets often contain missing or mislabeled shadows, which can hinder the performance of deep learning models trained directly on such data. To address this issue, we propose SILT, the Shadow-aware Iterative Label Tuning framework, which explicitly considers noise in shadow labels and trains the deep model in a self-training manner. Specifically, we incorporate strong data augmentations with shadow counterfeiting to help the network better recognize non-shadow regions and alleviate overfitting. We also devise a simple yet effective label tuning strategy with global-local fusion and shadow-aware filtering to encourage the network to make significant refinements on the noisy labels. We evaluate the performance of SILT by relabeling the test set of the SBU dataset and conducting various experiments. Our results show that even a simple U-Net trained with SILT can outperform all state-of-the-art methods by a large margin. When trained on SBU / UCF / ISTD, our network can successfully reduce the Balanced Error Rate by 25.2% / 36.9% / 21.3% over the best state-of-the-art method.
摘要
现有的阴影检测 dataset oftentimes 包含遗传或错abeled 的阴影,这可能会妨碍 deep learning 模型直接在这些数据上训练。为解决这个问题,我们提出了 SILT,即 Shadow-aware Iterative Label Tuning 框架,这个框架会明确地考虑阴影标签中的噪声,并在自适应方式下训练 deep model。具体来说,我们将强大的数据增强器与阴影伪造相结合,帮助网络更好地识别非阴影区域,并减少适应。我们还创造了一个简单 yet effective 的标签修正策略,具有全球-本地融合和阴影-aware 范围筛选,以鼓励网络对阴影标签中的噪声进行重要修正。我们通过对 SBU 测试集进行重新标签并进行多种实验评估 SILT 的性能。我们的结果显示,即使使用 SILT 训练的简单 U-Net 可以在所有状态对抗方法中表现出色,并且在 SBU / UCF / ISTD 上训练时可以成功降低平衡错误率 BY 25.2% / 36.9% / 21.3%。
HarvestNet: A Dataset for Detecting Smallholder Farming Activity Using Harvest Piles and Remote Sensing
results: 研究结果表明,使用最新的模型可以在手动标注数据上达到80%的分类性能,并在真实数据上达到90%、98%的准确率。此外,研究还发现了一些已有的覆盖地图中的偏差,并在提供了56,621公顷的新的农田面积。Abstract
Small farms contribute to a large share of the productive land in developing countries. In regions such as sub-Saharan Africa, where 80% of farms are small (under 2 ha in size), the task of mapping smallholder cropland is an important part of tracking sustainability measures such as crop productivity. However, the visually diverse and nuanced appearance of small farms has limited the effectiveness of traditional approaches to cropland mapping. Here we introduce a new approach based on the detection of harvest piles characteristic of many smallholder systems throughout the world. We present HarvestNet, a dataset for mapping the presence of farms in the Ethiopian regions of Tigray and Amhara during 2020-2023, collected using expert knowledge and satellite images, totaling 7k hand-labeled images and 2k ground collected labels. We also benchmark a set of baselines including SOTA models in remote sensing with our best models having around 80% classification performance on hand labelled data and 90%, 98% accuracy on ground truth data for Tigray, Amhara respectively. We also perform a visual comparison with a widely used pre-existing coverage map and show that our model detects an extra 56,621 hectares of cropland in Tigray. We conclude that remote sensing of harvest piles can contribute to more timely and accurate cropland assessments in food insecure region.
摘要
小型农场对发展中国家的生产地域占有重要的比重。如在非洲萨赫拉区,80%的农场面积在2公顷以下(小型农场),评估可持续发展的重要一环是映射小holder农场。然而,传统方法的映射农场面积受到小型农场的多样性和细节的限制。在这里,我们介绍了一种新的方法,基于农作物收割堆的检测。我们提供了在埃塞俄比亚地区特拉YES和阿姆拉地区2020-2023年度的HarvestNet数据集,包括7000个专家知识和卫星图像的手动标注,以及2000个地面采集标注。我们还对比了一些标准的准确性模型,我们的最佳模型在手动标注数据上达到80%的分类性能,在地面真实数据上达到90%、98%的准确性。我们还进行了与一个广泛使用的现有的覆盖地图进行视觉比较,并显示了我们的模型可以检测到特拉YES地区的56621公顷更多的农地。我们结论是,通过远程感知收割堆可以为食物不足地区提供更时准确的农地评估。
Manipulating Embeddings of Stable Diffusion Prompts
results: 实验结果表明该方法的可行性。Abstract
Generative text-to-image models such as Stable Diffusion allow users to generate images based on a textual description, the prompt. Changing the prompt is still the primary means for the user to change a generated image as desired. However, changing the image by reformulating the prompt remains a difficult process of trial and error, which has led to the emergence of prompt engineering as a new field of research. We propose and analyze methods to change the embedding of a prompt directly instead of the prompt text. It allows for more fine-grained and targeted control that takes into account user intentions. Our approach treats the generative text-to-image model as a continuous function and passes gradients between the image space and the prompt embedding space. By addressing different user interaction problems, we can apply this idea in three scenarios: (1) Optimization of a metric defined in image space that could measure, for example, image style. (2) Assistance of users in creative tasks by enabling them to navigate the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that the user has seen in a particular seed but finds difficult to describe in the prompt. Our experiments demonstrate the feasibility of the described methods.
摘要
<>将文本描述转换为生成图像的模型,如稳定扩散,允许用户根据文本描述生成图像。但是,通过修改描述文本来更改生成的图像仍然是一个困难的过程,它导致了提前工程的出现。我们提议和分析改变描述文本 embedding 的方法,以实现更细化和有target的控制,考虑用户的INTENT。我们的方法将生成文本到图像模型看作是连续函数,将 gradients 传递 между图像空间和描述文本 embedding 空间。通过解决不同的用户交互问题,我们可以应用这个想法在三个场景中:(1)优化图像空间中定义的一个指标,例如图像风格。(2)帮助用户完成创意任务,让他们可以在选择的方向上导航图像空间。(3)将描述文本 embedding 包含用户看到的信息,但是很难用描述在描述文本中。我们的实验表明这些方法的可行性。
DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration
methods: 该论文提出了一种名为分布调整 WITH semantic calibration(DR-Tune)的新方法,该方法通过对下游任务头部进行分布调整,来防止过拟合而允许下游Encoder sufficient training。此外,该方法还提出了一种名为semantic calibration(SC)模块,用于对先验知识和下游知识之间的差异进行填充。
results: 该论文通过对多个图像分类 datasets 进行广泛的实验,证明了 DR-Tune 可以在不同的预训练策略下提高表达力的性能。Abstract
The visual models pretrained on large-scale benchmarks encode general knowledge and prove effective in building more powerful representations for downstream tasks. Most existing approaches follow the fine-tuning paradigm, either by initializing or regularizing the downstream model based on the pretrained one. The former fails to retain the knowledge in the successive fine-tuning phase, thereby prone to be over-fitting, and the latter imposes strong constraints to the weights or feature maps of the downstream model without considering semantic drift, often incurring insufficient optimization. To deal with these issues, we propose a novel fine-tuning framework, namely distribution regularization with semantic calibration (DR-Tune). It employs distribution regularization by enforcing the downstream task head to decrease its classification error on the pretrained feature distribution, which prevents it from over-fitting while enabling sufficient training of downstream encoders. Furthermore, to alleviate the interference by semantic drift, we develop the semantic calibration (SC) module to align the global shape and class centers of the pretrained and downstream feature distributions. Extensive experiments on widely used image classification datasets show that DR-Tune consistently improves the performance when combing with various backbones under different pretraining strategies. Code is available at: https://github.com/weeknan/DR-Tune.
摘要
“视觉模型在大规模标准 datasets 预训练后encode general knowledge,并且在下游任务建立更强大的表示。现有的方法大多采用 fine-tuning 方法,包括在预训练模型的基础上初始化或正则化下游模型。前者容易过拟合,后者对下游模型的权重或特征图进行强制约束,而不考虑 semantics 的变化,通常导致优化不足。为解决这些问题,我们提出了一种新的 fine-tuning 框架,即 distribution regularization with semantic calibration (DR-Tune)。它通过在下游任务头中减少预训练特征分布上的类错误,防止过拟合而允许下游编码器充分训练。此外,为了缓解 semantics 的变化所带来的干扰,我们开发了 semantic calibration (SC) 模块,用于对预训练和下游特征分布的全球形态和类中心进行对齐。我们在各种图像分类数据集进行了广泛的实验,结果表明 DR-Tune 在不同的预训练策略下均能提高表现。代码可以在 GitHub 上获取:https://github.com/weeknan/DR-Tune。”
Head-Tail Cooperative Learning Network for Unbiased Scene Graph Generation
paper_authors: Lei Wang, Zejian Yuan, Yao Lu, Badong Chen
for: solves the challenge of head-biased prediction in scene graph generation by proposing a model-agnostic Head-Tail Collaborative Learning (HTCL) network.
methods: includes head-prefer and tail-prefer feature representation branches that collaborate to achieve accurate recognition of both head and tail predicates, and a self-supervised learning approach to enhance the prediction ability of the tail-prefer feature representation branch.
results: achieves higher mean Recall with a minimal sacrifice in Recall and achieves a new state-of-the-art overall performance on various SGG models on VG150, Open Images V6 and GQA200 datasets.Abstract
Scene Graph Generation (SGG) as a critical task in image understanding, facing the challenge of head-biased prediction caused by the long-tail distribution of predicates. However, current unbiased SGG methods can easily prioritize improving the prediction of tail predicates while ignoring the substantial sacrifice in the prediction of head predicates, leading to a shift from head bias to tail bias. To address this issue, we propose a model-agnostic Head-Tail Collaborative Learning (HTCL) network that includes head-prefer and tail-prefer feature representation branches that collaborate to achieve accurate recognition of both head and tail predicates. We also propose a self-supervised learning approach to enhance the prediction ability of the tail-prefer feature representation branch by constraining tail-prefer predicate features. Specifically, self-supervised learning converges head predicate features to their class centers while dispersing tail predicate features as much as possible through contrast learning and head center loss. We demonstrate the effectiveness of our HTCL by applying it to various SGG models on VG150, Open Images V6 and GQA200 datasets. The results show that our method achieves higher mean Recall with a minimal sacrifice in Recall and achieves a new state-of-the-art overall performance. Our code is available at https://github.com/wanglei0618/HTCL.
摘要
Scene Graph Generation(SGG)是图像理解中的关键任务,面临长 хвоста分布的预测问题。然而,当前的无偏SGG方法可能会忽略改进头预测的代价,导致偏头偏尾的转换。为解决这个问题,我们提出了无关模型的Head-Tail Collaborative Learning(HTCL)网络,包括头预测和尾预测的特征表示分支。我们还提出了一种自然学习方法来增强尾预测特征表示分支的预测能力。具体来说,自然学习使得头预测特征分布到其类中心,而尾预测特征分布到最大程度可能的位置,通过对比学习和头中心损失来实现。我们在VG150、Open Images V6和GQA200 datasets上应用了我们的HTCL方法,结果显示,我们的方法可以实现更高的含涵率,同时减少预测错误的代价,达到新的领先性表现。我们的代码可以在https://github.com/wanglei0618/HTCL上获取。
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
results: 实验结果表明,通过结合现状最佳2D referring表达理解模型和对象跟踪算法,可以实现视频中referenced对象的跟踪,包括对象离屏或多个相似对象出现在视频中。Abstract
Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.
摘要
文本表达grounding在场景物体上从第一人称视角是开发意识到围场景和按照直觉文本指令行为的真正挑战。这种能力是让玻璃设备或自动化机器人本地化参照对象的必要条件。在传统的图像引用表达理解任务上, however, dataset是基于网络爬虫数据构建的,而不具备多样化的实际世界结构。近期,一个大规模的 Egocentric video 数据集 Ego4D 被提出。Ego4D 覆盖了全球多样化的实际场景,包括室内和室外的 cooking、shopping、散步、谈话、制造等场景。基于 Ego4D 的 egocentric 视频,我们构建了包括 более чем 12k 视频剪辑和 41 小时的视频基于引用表达理解注解。在实验中,我们将 state-of-the-art 2D 引用表达理解模型与物体跟踪算法结合,实现视频基础referenced对象跟踪,包括对象在视频中心位置出现或多个类似对象出现在视频中的情况。
Distribution-Aware Calibration for Object Detection with Noisy Bounding Boxes
results: 在大规模噪音图像集(Pascal VOC和MS-COCO)上进行了广泛的实验,结果表明,DISCO可以在高噪音水平上达到顶峰性状检测性能。Abstract
Large-scale well-annotated datasets are of great importance for training an effective object detector. However, obtaining accurate bounding box annotations is laborious and demanding. Unfortunately, the resultant noisy bounding boxes could cause corrupt supervision signals and thus diminish detection performance. Motivated by the observation that the real ground-truth is usually situated in the aggregation region of the proposals assigned to a noisy ground-truth, we propose DIStribution-aware CalibratiOn (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. In DISCO, spatial distribution modeling is performed to statistically extract the potential locations of objects. Based on the modeled distribution, three distribution-aware techniques, i.e., distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est), are developed to improve classification, localization, and interpretability, respectively. Extensive experiments on large-scale noisy image datasets (i.e., Pascal VOC and MS-COCO) demonstrate that DISCO can achieve state-of-the-art detection performance, especially at high noise levels.
摘要
Motivated by the observation that real ground truth is usually located in the aggregation region of proposals assigned to noisy ground truth, we propose Distribution-aware Calibration (DISCO) to model the spatial distribution of proposals for calibrating supervision signals. DISCO uses spatial distribution modeling to statistically extract potential object locations. Based on the modeled distribution, we develop three distribution-aware techniques: distribution-aware proposal augmentation (DA-Aug), distribution-aware box refinement (DA-Ref), and distribution-aware confidence estimation (DA-Est) to improve classification, localization, and interpretability, respectively.Extensive experiments on large-scale noisy image datasets (Pascal VOC and MS-COCO) show that DISCO achieves state-of-the-art detection performance, especially at high noise levels.
results: 对于 six state-of-the-art 方法进行了比较,结果表明 StofNet 具有较高的精度、可靠性和模型复杂度。 我们还发布了 SToF-Chirp 数据集, captured by 一架飞行式 ultrasound 探测器,并提供了相关的代码。Abstract
Time of Flight (ToF) is a prevalent depth sensing technology in the fields of robotics, medical imaging, and non-destructive testing. Yet, ToF sensing faces challenges from complex ambient conditions making an inverse modelling from the sparse temporal information intractable. This paper highlights the potential of modern super-resolution techniques to learn varying surroundings for a reliable and accurate ToF detection. Unlike existing models, we tailor an architecture for sub-sample precise semi-global signal localization by combining super-resolution with an efficient residual contraction block to balance between fine signal details and large scale contextual information. We consolidate research on ToF by conducting a benchmark comparison against six state-of-the-art methods for which we employ two publicly available datasets. This includes the release of our SToF-Chirp dataset captured by an airborne ultrasound transducer. Results showcase the superior performance of our proposed StofNet in terms of precision, reliability and model complexity. Our code is available at https://github.com/hahnec/stofnet.
摘要
时间飞行(ToF)是现代深度感知技术的广泛应用领域,包括 робо扮、医疗成像和非锋渠测试。然而,ToF感知受到了复杂的 ambient 环境的挑战,从而使得对稀疏时间信息的逆模型变得不可能。本文强调了现代超分解技术的潜在作用,以提高ToF探测的可靠性和准确性。与现有模型不同,我们专门设计了一种束缚精度和大规模信息的混合块,以平衡细信息和大规模信息的权重。我们对ToF进行了 benchmark 比较,使用了六种现有的状态之 искусственный智能方法,并使用了两个公共可用的数据集。这包括我们发布的 SToF-Chirp 数据集,由空中ultrasound 传感器记录。结果表明我们提posed StofNet 的性能比其他六种方法更高, both in terms of precision and reliability.我们的代码可以在 https://github.com/hahnec/stofnet 上获取。
Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition
results: 该模型在RGB-D动作和姿势识别任务上表现出色,比之前的方法更高效。Abstract
RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi- Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.
摘要
MFST模型包括3D中心差异卷积核心(CDC-Stem)模块和多个因子化空间时间阶段。CDC-Stem模块使得细节的时间感知更加细化,而多个层次的因子化空间时间阶段通过多 scales卷积和Transformer(MSC-Trans)混合块和Weight-shared Multi-Scale Transformer(WMS-Trans)块来构建维度独立的高级semantic primitives。具体来说,CDC-Stem模块首先捕捉最低级的空间时间特征,然后将其传递给以下因子化空间时间阶段,以 capture层次的空间和时间特征。这些创新的设计结合使得RGB-D动作和姿势识别达到了state-of-the-art的表现。
Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
paper_authors: Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, Weisi Lin for:This paper focuses on improving image quality assessment (IQA) tasks, which are challenging due to diverse image contents and limited data availability.methods:The authors use a combination of pre-trained convolutional neural networks (CNNs) and a local distortion extractor/injector to extract and inject local distortion features into a large-scale pre-trained vision transformer (ViT) model.results:The proposed method achieves state-of-the-art performance on popular IQA datasets, indicating that IQA can benefit from stronger high-level features drawn from large-scale pre-trained models.Here’s the simplified Chinese version:for:本文关注改进图像质量评估(IQA)任务,这些任务受到多样化图像内容和数据有限的限制。methods:作者使用了一种组合使用预训练的 convolutional neural networks(CNN)和一个地方扰动提取器/注入器来提取和注入地方扰动特征到大规模预训练的视transformer(ViT)模型中。results:提出的方法在流行的IQA数据集上达到了状态机器人的性能,表明IQA不仅是一个低级别的问题,而且可以受益于更强的高级别特征,从大规模预训练模型中继承知识。Abstract
Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.
摘要
图像质量评估(IQA)是计算机视觉领域的基本任务,但它仍然是一个未解决的挑战,主要是因为图像的复杂预期、多样化内容和数据的有限性。在最近,社区所看到的是许多大规模预训练基础模型的出现,这些模型受益于数据和参数的增加。然而,是否存在一个涉及到IQA任务的扩展法则仍然是一个开放的问题。在这篇论文中,我们表明了一种将本地扭曲特征注入到大规模预训练的基础模型中,以提高IQA任务的表现。具体来说,由于视Transformer(ViT)缺乏本地扭曲结构和引导因子,我们采用另一种预训练的卷积神经网络(CNN),以提取多尺度图像特征。此外,我们提出了一种本地扭曲EXTractor,以从预训练CNN中提取本地扭曲特征。最后,我们提出了一种本地扭曲注入器,以注入本地扭曲特征到ViT中。只需训练EXTractor和注入器,我们的方法可以从强大的基础模型中继承丰富的知识,并在流行的IQA数据集上达到状态率表现,表明IQA不仅是一个低级问题,还可以受益于更强的高级特征,从大规模预训练模型中继承。
Progressive Feature Mining and External Knowledge-Assisted Text-Pedestrian Image Retrieval
results: 对三个挑战性数据集进行了广泛的实验,并证明了提议方法的有效性和超越性,甚至在大规模数据集上超越大规模模型基础方法。Abstract
Text-Pedestrian Image Retrieval aims to use the text describing pedestrian appearance to retrieve the corresponding pedestrian image. This task involves not only modality discrepancy, but also the challenge of the textual diversity of pedestrians with the same identity. At present, although existing research progress has been made in text-pedestrian image retrieval, these methods do not comprehensively consider the above-mentioned problems. Considering these, this paper proposes a progressive feature mining and external knowledge-assisted feature purification method. Specifically, we use a progressive mining mode to enable the model to mine discriminative features from neglected information, thereby avoiding the loss of discriminative information and improving the expression ability of features. In addition, to further reduce the negative impact of modal discrepancy and text diversity on cross-modal matching, we propose to use other sample knowledge of the same modality, i.e., external knowledge to enhance identity-consistent features and weaken identity-inconsistent features. This process purifies features and alleviates the interference caused by textual diversity and negative sample correlation features of the same modal. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed method, and the retrieval performance even surpasses that of the large-scale model-based method on large-scale datasets.
摘要
文本行人图像检索目标是使用描述行人外观的文本来检索对应的行人图像。这个任务面临着多样性扩散和文本多样性问题。现有研究已取得一定进展,但这些方法并不完全考虑上述问题。为此,本文提出了一种进程式特征挖掘和外知助动特征纯化方法。具体来说,我们使用进程式挖掘模式,让模型从抛弃信息中挖掘出特征,以避免损失特征表达能力和提高特征表达能力。此外,为了进一步减少模式差异和文本多样性对于跨模态匹配的负面影响,我们提出了使用同一模式其他样本的知识,即外知来增强一致性特征和弱化不一致性特征。这个过程纯化特征,减少了文本多样性和负样本相互干扰的影响。我们在三个挑战性 dataset 进行了广泛的实验,结果表明我们的方法效果和前期研究超越,甚至在大规模模型基础方法上的大规模dataset上表现出色。
RankMixup: Ranking-Based Mixup Training for Network Calibration
paper_authors: Jongyoun Noh, Hyekang Park, Junghyup Lee, Bumsub Ham for:这篇论文的目的是为了精确地估算深度神经网络的信任度,特别是在实际应用中使用深度神经网络时。methods:这篇论文使用了mixup来帮助网络进行训练,并且提出了一个新的架构,即RankMixup,以解决mixup中混合标签问题。results:实验结果显示,RankMixup可以对网络进行更好的训练,并且可以实现更好的信任度估算。Abstract
Network calibration aims to accurately estimate the level of confidences, which is particularly important for employing deep neural networks in real-world systems. Recent approaches leverage mixup to calibrate the network's predictions during training. However, they do not consider the problem that mixtures of labels in mixup may not accurately represent the actual distribution of augmented samples. In this paper, we present RankMixup, a novel mixup-based framework alleviating the problem of the mixture of labels for network calibration. To this end, we propose to use an ordinal ranking relationship between raw and mixup-augmented samples as an alternative supervisory signal to the label mixtures for network calibration. We hypothesize that the network should estimate a higher level of confidence for the raw samples than the augmented ones (Fig.1). To implement this idea, we introduce a mixup-based ranking loss (MRL) that encourages lower confidences for augmented samples compared to raw ones, maintaining the ranking relationship. We also propose to leverage the ranking relationship among multiple mixup-augmented samples to further improve the calibration capability. Augmented samples with larger mixing coefficients are expected to have higher confidences and vice versa (Fig.1). That is, the order of confidences should be aligned with that of mixing coefficients. To this end, we introduce a novel loss, M-NDCG, in order to reduce the number of misaligned pairs of the coefficients and confidences. Extensive experimental results on standard benchmarks for network calibration demonstrate the effectiveness of RankMixup.
摘要
for: 这 paper 的目的是提出一种cost-effective和高度准确的道路分割方法,使用多modal sensor数据,包括RGB和LiDAR深度图像,以及IMU/GNSS各自导航系统。
methods: 该方法使用raw sensor输入,而不是 Typically done in many SOTA works,使用高预处理成本的表面法或dense depth prediction。它使用一种低成本模型,以Minimize both pre-processing and model computation costs。
results: 该方法在 KITTI 数据集上进行了实验,并达到了fast和高性能的解决方案。同时,对 Cityscapes 数据集进行了实验,并证明了该方法可以使用不同的感知模式。 segmentation 结果对于全分辨率和半分辨率图像均是与现有方法竞争力的。Abstract
Multi-modal systems have the capacity of producing more reliable results than systems with a single modality in road detection due to perceiving different aspects of the scene. We focus on using raw sensor inputs instead of, as it is typically done in many SOTA works, leveraging architectures that require high pre-processing costs such as surface normals or dense depth predictions. By using raw sensor inputs, we aim to utilize a low-cost model thatminimizes both the pre-processing andmodel computation costs. This study presents a cost-effective and highly accurate solution for road segmentation by integrating data from multiple sensorswithin a multi-task learning architecture.Afusion architecture is proposed in which RGB and LiDAR depth images constitute the inputs of the network. Another contribution of this study is to use IMU/GNSS (inertial measurement unit/global navigation satellite system) inertial navigation system whose data is collected synchronously and calibrated with a LiDAR-camera to compute aggregated dense LiDAR depth images. It has been demonstrated by experiments on the KITTI dataset that the proposed method offers fast and high-performance solutions. We have also shown the performance of our method on Cityscapes where raw LiDAR data is not available. The segmentation results obtained for both full and half resolution images are competitive with existing methods. Therefore, we conclude that our method is not dependent only on raw LiDAR data; rather, it can be used with different sensor modalities. The inference times obtained in all experiments are very promising for real-time experiments.
摘要
多模式系统可以生成更可靠的结果,因为它们可以感知不同方面的场景。我们集中在使用原始感知输入而不是,如多数State-of-the-Art工作一样,利用需要高预处理成本的 arquitectures,例如表面法向量或密集深度预测。通过使用原始感知输入,我们希望实现低成本模型,以降低预处理和计算成本。这个研究提出了一种可靠和高精度的解决方案,通过将多种感知器 Integration into a multi-task learning architecture。我们提议的架构包括RGB和LiDAR深度图像作为网络的输入。此外,我们还使用IMU/GNSS(普通测量单元/全球导航卫星系统)的抗 gravitational 数据,同步采集并与LiDAR-Camera进行同步准确 calibration,以计算聚合的密集LiDAR深度图像。经过实验表明,我们的方法可以在KITTI数据集上提供快速和高性能的解决方案。此外,我们还对Cityscapes数据集进行了实验,并证明我们的方法可以使用不同的感知模式。所得到的分割结果与现有方法相当,因此我们可以 concluced that our method is not dependent on raw LiDAR data; rather, it can be used with different sensor modalities。实验结果表明,在所有实验中的推理时间很有前途,适用于实时实验。
results: 通过对MVP数据集上进行随机变换,对Point cloud completion方法进行比较,研究发现RICNet在不同pose下的Point cloud completion性能较高,超过了现有方法Abstract
Real-world point clouds usually suffer from incompleteness and display different poses. While current point cloud completion methods excel in reproducing complete point clouds with consistent poses as seen in the training set, their performance tends to be unsatisfactory when handling point clouds with diverse poses. We propose a network named Rotation-Invariant Completion Network (RICNet), which consists of two parts: a Dual Pipeline Completion Network (DPCNet) and an enhancing module. Firstly, DPCNet generates a coarse complete point cloud. The feature extraction module of DPCNet can extract consistent features, no matter if the input point cloud has undergone rotation or translation. Subsequently, the enhancing module refines the fine-grained details of the final generated point cloud. RICNet achieves better rotation invariance in feature extraction and incorporates structural relationships in man-made objects. To assess the performance of RICNet and existing methods on point clouds with various poses, we applied random transformations to the point clouds in the MVP dataset and conducted experiments on them. Our experiments demonstrate that RICNet exhibits superior completion performance compared to existing methods.
摘要
Translation:real-world point clouds 通常会受到不完整性和不同姿态的影响。当前的点云完成方法能够很好地重建完整的点云,但是它们在处理不同姿态的点云时表现不佳。我们提议一种名为Rotation-Invariant Completion Network(RICNet)的网络,它包括两部分:一个双管道完成网络(DPCNet)和一个优化模块。首先,DPCNet生成一个粗略的完整点云。DPCNet的特征提取模块可以无论输入点云是否经历了旋转或平移,都可以提取一致的特征。然后,优化模块进行细化细节的更新。RICNet实现了更好的旋转不变性在特征提取中,并具有结构关系在人工物体中。为了评估RICNet和现有方法在不同姿态的点云上的表现,我们对MVP数据集中的点云应用了随机变换,并在其上进行了实验。我们的实验结果表明,RICNet在不同姿态的点云上的完成性比现有方法更高。
Anisotropic Hybrid Networks for liver tumor segmentation with uncertainty quantification
paper_authors: Benjamin Lambert, Pauline Roca, Florence Forbes, Senan Doyle, Michel Dojat
for: Liver and tumor segmentation for treatment strategy guidance.
methods: Two different pipelines based on anisotropic models were used for segmentation: a baseline multi-class model and two distinct binary models.
results: Both pipelines exhibited different strengths and weaknesses, and an uncertainty quantification strategy was proposed to identify potential false positive tumor lesions.Abstract
The burden of liver tumors is important, ranking as the fourth leading cause of cancer mortality. In case of hepatocellular carcinoma (HCC), the delineation of liver and tumor on contrast-enhanced magnetic resonance imaging (CE-MRI) is performed to guide the treatment strategy. As this task is time-consuming, needs high expertise and could be subject to inter-observer variability there is a strong need for automatic tools. However, challenges arise from the lack of available training data, as well as the high variability in terms of image resolution and MRI sequence. In this work we propose to compare two different pipelines based on anisotropic models to obtain the segmentation of the liver and tumors. The first pipeline corresponds to a baseline multi-class model that performs the simultaneous segmentation of the liver and tumor classes. In the second approach, we train two distinct binary models, one segmenting the liver only and the other the tumors. Our results show that both pipelines exhibit different strengths and weaknesses. Moreover we propose an uncertainty quantification strategy allowing the identification of potential false positive tumor lesions. Both solutions were submitted to the MICCAI 2023 Atlas challenge regarding liver and tumor segmentation.
摘要
liver tumors 是一种重要的负担, ranks as the fourth leading cause of cancer mortality。在hepatocellular carcinoma(HCC)的情况下,通过对吸引磁共振成像(CE-MRI)图像进行定义肝脏和肿瘤的分割,以便guide the treatment strategy。然而,这项工作需要很高的专业技能,时间费用很高,并且存在 между观察员的变化,因此有强需求 для自动工具。然而,由于数据不足以及图像分辨率和MRI序列的高变化,这些挑战是非常大的。在这项工作中,我们提出了两个不同的管道,基于不规则模型来获得肝脏和肿瘤的分割。第一个管道是一个基线多类模型,同时分割肝脏和肿瘤类型。在第二个方法中,我们训练了两个不同的二进制模型,一个用于分割肝脏,另一个用于分割肿瘤。我们的结果表明,这两个管道具有不同的优劣点。此外,我们还提出了一种不确定性评估策略,以便标识潜在的假阳性肿瘤涂抹。这两个解决方案都被提交到了MICCAI 2023 Atlas challenge关于肝脏和肿瘤分割。
results: 研究发现这个系统具有低延迟和低MAC/周期,并且能够在实时进行眼动估计。Abstract
Gaze estimation is a valuable technology with numerous applications in fields such as human-computer interaction, virtual reality, and medicine. This report presents the implementation of a gaze estimation system using the Sony Spresense microcontroller board and explores its performance in latency, MAC/cycle, and power consumption. The report also provides insights into the system's architecture, including the gaze estimation model used. Additionally, a demonstration of the system is presented, showcasing its functionality and performance. Our lightweight model TinyTrackerS is a mere 169Kb in size, using 85.8k parameters and runs on the Spresense platform at 3 FPS.
摘要
gaze estimation 是一种有价值的技术,它在人工智能、虚拟现实和医疗领域有很多应用。这份报告介绍了使用索尼 Spresense 微控器板实现的 gaze estimation 系统,并评估了它的响应时间、MAC/周期和电力消耗。报告还提供了系统的架构设计,包括使用的 gaze estimation 模型。此外,报告还提供了系统的示例,展示了它的功能和性能。我们的轻量级模型 TinyTrackerS 仅有 169 KB 大小,使用 85.8k 参数,在 Spresense 平台上运行于 3 FPS。
Efficient Transfer Learning in Diffusion Models via Adversarial Noise
results: 在几个数据少的图像生成任务中进行了广泛的实验,结果表明,我们的方法不仅高效,而且在图像质量和多样性方面也比现有的GAN-based和DPM-based方法出色。Abstract
Diffusion Probabilistic Models (DPMs) have demonstrated substantial promise in image generation tasks but heavily rely on the availability of large amounts of training data. Previous works, like GANs, have tackled the limited data problem by transferring pre-trained models learned with sufficient data. However, those methods are hard to be utilized in DPMs since the distinct differences between DPM-based and GAN-based methods, showing in the unique iterative denoising process integral and the need for many timesteps with no-targeted noise in DPMs. In this paper, we propose a novel DPMs-based transfer learning method, TAN, to address the limited data problem. It includes two strategies: similarity-guided training, which boosts transfer with a classifier, and adversarial noise selection which adaptive chooses targeted noise based on the input image. Extensive experiments in the context of few-shot image generation tasks demonstrate that our method is not only efficient but also excels in terms of image quality and diversity when compared to existing GAN-based and DDPM-based methods.
摘要
各种扩散概率模型(DPM)在图像生成任务中表现出了重要的承袭潜力,但它们受到充足的训练数据的限制。先前的工作,如GANs,通过将预训练的模型转移到具有足够数据的环境中来解决这个问题。然而,这些方法在DPM中很难实现,因为DPM和GAN之间存在重要的差异,即DPM中的迭代净化过程的独特特性和需要许多步骤和无目标噪声。在这篇论文中,我们提出了一种基于DPM的转移学习方法,称为TAN,以解决有限数据问题。该方法包括两个策略:相似性引导的训练和对输入图像进行适应性选择噪声。我们在少量图像生成任务中进行了广泛的实验,并证明了我们的方法不仅高效,还能够在图像质量和多样性方面超越现有的GAN基于和DPM基于的方法。
Boosting Diffusion Models with an Adaptive Momentum Sampler
for: This paper aims to improve the sampling process in Diffusion Probabilistic Models (DPMs) to generate high-quality images.
methods: The proposed method is a novel reverse sampler for DPMs, inspired by the Adam optimizer, which uses momentum mechanisms and adaptive updating to smooth the reverse sampling process and ensure stable generation.
results: The proposed reverse sampler achieves remarkable improvements over different baselines, yielding enhanced quality outputs.Here’s the full summary in Simplified Chinese:
results: 提议的reverse抽样器在多个 benchmark 上得到了显著的改善,生成的输出质量得到了提升。Abstract
Diffusion probabilistic models (DPMs) have been shown to generate high-quality images without the need for delicate adversarial training. However, the current sampling process in DPMs is prone to violent shaking. In this paper, we present a novel reverse sampler for DPMs inspired by the widely-used Adam optimizer. Our proposed sampler can be readily applied to a pre-trained diffusion model, utilizing momentum mechanisms and adaptive updating to smooth the reverse sampling process and ensure stable generation, resulting in outputs of enhanced quality. By implicitly reusing update directions from early steps, our proposed sampler achieves a better balance between high-level semantics and low-level details. Additionally, this sampler is flexible and can be easily integrated into pre-trained DPMs regardless of the sampler used during training. Our experimental results on multiple benchmarks demonstrate that our proposed reverse sampler yields remarkable improvements over different baselines. We will make the source code available.
摘要
diffusion probabilistic models (DPMs) 有 shown 能生成高质量图像,无需精细 adversarial training。然而,当前的采样 процес在 DPMs 中存在激动难控的问题。在这篇论文中,我们提出了一种新的 reverse sampler for DPMs, Drawing inspiration from the widely-used Adam optimizer。我们的提议的抽样器可以 readily applied 到预训练的 diffusion model,使用动量机制和自适应更新来平稳抽样过程,以保证生成的输出质量。通过重用早期步骤的更新方向,我们的抽样器实现了更好的semantic balance 和细节层次结构。此外,这种抽样器 flexible 可以轻松地integrated into pre-trained DPMs,不管在training中使用的抽样器。我们的实验结果表明,我们的提议的 reverse sampler 在多个benchmark上具有remarkable improvement。我们将会公开源代码。
Synergistic Multiscale Detail Refinement via Intrinsic Supervision for Underwater Image Enhancement
results: 对比 estado-of-the-art方法,SMDR-IS示出出色的性能。Abstract
Visual restoration of underwater scenes is crucial for visual tasks, and avoiding interference from underwater media has become a prominent concern. In this work, we present a synergistic multiscale detail refinement via intrinsic supervision (SMDR-IS) to recover underwater scene details. The low-degradation stage provides multiscale detail for original stage, which achieves synergistic multiscale detail refinement through feature propagation via the adaptive selective intrinsic supervised feature module (ASISF), which achieves synergistic multiscale detail refinement. ASISF is developed using intrinsic supervision to precisely control and guide feature transmission in the multi-degradation stages. ASISF improves the multiscale detail refinement while reducing interference from irrelevant scene information from the low-degradation stage. Additionally, within the multi-degradation encoder-decoder of SMDR-IS, we introduce a bifocal intrinsic-context attention module (BICA). This module is designed to effectively leverage multi-scale scene information found in images, using intrinsic supervision principles as its foundation. BICA facilitates the guidance of higher-resolution spaces by leveraging lower-resolution spaces, considering the significant dependency of underwater image restoration on spatial contextual relationships. During the training process, the network gains advantages from the integration of a multi-degradation loss function. This function serves as a constraint, enabling the network to effectively exploit information across various scales. When compared with state-of-the-art methods, SMDR-IS demonstrates its outstanding performance. Code will be made publicly available.
摘要
<>TRANSLATE_TEXT视觉修复水下场景是重要的视觉任务之一,避免水下媒体干扰已成为一项显著问题。在这种工作中,我们提出了一种同时多级细节重建via自身监督(SMDR-IS),用于恢复水下场景细节。低损阶段提供多级细节 для原始阶段,通过特有的自适应选择性内在监督特征模块(ASISF)实现同时多级细节重建。ASISF通过内在监督精准控制和导引特征传输在多损阶段。ASISF提高多级细节重建,同时减少不相关场景信息的干扰。此外,在SMDR-IS中的多损阶段Encoder-Decoder中,我们引入了一种多级内在上下文注意力模块(BICA)。这个模块是基于内在监督原则设计的,用于有效利用图像中的多尺度场景信息,并且可以在训练过程中为网络提供优势。在训练过程中,网络通过多损阶段损失函数的集成获得了优势。这个函数作为约束,使网络能够有效利用多个尺度的信息。与现状技术相比,SMDR-IS表现出色。代码将公开。
OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes
paper_authors: Tao Xie, Kun Dai, Siyi Lu, Ke Wang, Zhiqiang Jiang, Jinghan Gao, Dedong Liu, Jie Xu, Lijun Zhao, Ruifeng Li
for: 本 paper 的目的是 Predict camera poses across scenes with a multi-task learning manner.
methods: 本 paper 提出了 OFVL-MS 框架,一种可以高效地存储和精确地visual localization 的框架,通过自适应分享策略和权重学习来解决多场景集合学习中的梯度冲突问题。
results: 在多个benchmark上和新发布的室内 dataset LIVL 上,OFVL-MS 家族的模型显著超越了现有的状态网络,并且可以在新场景中训练 fewer parameters 而 achieve superior localization performance。Abstract
In this work, we seek to predict camera poses across scenes with a multi-task learning manner, where we view the localization of each scene as a new task. We propose OFVL-MS, a unified framework that dispenses with the traditional practice of training a model for each individual scene and relieves gradient conflict induced by optimizing multiple scenes collectively, enabling efficient storage yet precise visual localization for all scenes. Technically, in the forward pass of OFVL-MS, we design a layer-adaptive sharing policy with a learnable score for each layer to automatically determine whether the layer is shared or not. Such sharing policy empowers us to acquire task-shared parameters for a reduction of storage cost and task-specific parameters for learning scene-related features to alleviate gradient conflict. In the backward pass of OFVL-MS, we introduce a gradient normalization algorithm that homogenizes the gradient magnitude of the task-shared parameters so that all tasks converge at the same pace. Furthermore, a sparse penalty loss is applied on the learnable scores to facilitate parameter sharing for all tasks without performance degradation. We conduct comprehensive experiments on multiple benchmarks and our new released indoor dataset LIVL, showing that OFVL-MS families significantly outperform the state-of-the-arts with fewer parameters. We also verify that OFVL-MS can generalize to a new scene with much few parameters while gaining superior localization performance.
摘要
在这个工作中,我们尝试预测相机pose在场景之间,使用多任务学习方式,视每个场景为一个新任务。我们提出了OFVL-MS框架,它摒弃了传统的每个场景都需要单独训练模型的做法,并避免了多场景集合优化induced的梯度冲突,从而实现高效的存储和精确的视觉地址 localization。技术上,在OFVL-MS的前向传播中,我们设计了层adaptive共享策略,通过学习得分来自动决定每层是否共享。这种共享策略使得我们可以获得任务共享参数,以降低存储成本,同时获得任务特定参数,以学习场景相关特征,解决梯度冲突。在OFVL-MS的反向传播中,我们引入了梯度 normalization算法,使得任务共享参数的梯度大小均匀,使所有任务在同一个步长进行迭代。此外,我们还应用了稀疏罚失函数来促进参数共享,确保OFVL-MS不会影响性能。我们在多个benchmark上进行了广泛的实验,并在我们新发布的室内 dataset LIVL 上进行了测试,结果显示OFVL-MS家族在参数量少的情况下significantly outperform了状态 искус技术。我们还证明OFVL-MS可以在新场景中进行参数化,并且在获得superior localization性能的情况下,具有较少的参数。
Recovering a Molecule’s 3D Dynamics from Liquid-phase Electron Microscopy Movies
results: 研究表明,TEMPOR算法可以从流体电子镜像电影中回归不同的动态变化,并且是首次直接从流体电子镜像电影中回归3D结构。这提供了一种有前途的新方法,用于生物分子结构生物学中研究分子的3D动态。Abstract
The dynamics of biomolecules are crucial for our understanding of their functioning in living systems. However, current 3D imaging techniques, such as cryogenic electron microscopy (cryo-EM), require freezing the sample, which limits the observation of their conformational changes in real time. The innovative liquid-phase electron microscopy (liquid-phase EM) technique allows molecules to be placed in the native liquid environment, providing a unique opportunity to observe their dynamics. In this paper, we propose TEMPOR, a Temporal Electron MicroscoPy Object Reconstruction algorithm for liquid-phase EM that leverages an implicit neural representation (INR) and a dynamical variational auto-encoder (DVAE) to recover time series of molecular structures. We demonstrate its advantages in recovering different motion dynamics from two simulated datasets, 7bcq and Cas9. To our knowledge, our work is the first attempt to directly recover 3D structures of a temporally-varying particle from liquid-phase EM movies. It provides a promising new approach for studying molecules' 3D dynamics in structural biology.
摘要
生物分子的动力学是我们理解它们在生物系统中的功能的关键。然而,现有的3D成像技术,如冷气电子镜微scopy(cryo-EM),需要将样本冻结,这限制了观察分子的拓展变化的实时观察。新的液相电子镜微scopy(液相EM)技术使分子能够置于原始液体环境中,提供了一个独特的机会来观察它们的动态。在这篇论文中,我们提出了TEMPOR,一种基于偶极神经网络表示(INR)和动态变分自动编码器(DVAE)的电子镜微scopy对象重建算法。我们在两个 simulate datasets,7bcq和Cas9 中展示了它的优势。根据我们所知,我们的工作是直接从液相EM电影中恢复3D变化的首次尝试。它提供了一个有前途的新方法,用于Structural biology中研究分子的3D动力学。
MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
results: 在多个Scene文字检测 dataset 上,MixNet 已经实现了State-of-the-art 的结果,并且在不 ideal 的光线和不规则位置下具有较高的准确率和效率。Abstract
Detecting small scene text instances in the wild is particularly challenging, where the influence of irregular positions and nonideal lighting often leads to detection errors. We present MixNet, a hybrid architecture that combines the strengths of CNNs and Transformers, capable of accurately detecting small text from challenging natural scenes, regardless of the orientations, styles, and lighting conditions. MixNet incorporates two key modules: (1) the Feature Shuffle Network (FSNet) to serve as the backbone and (2) the Central Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene text. We first introduce a novel feature shuffling strategy in FSNet to facilitate the exchange of features across multiple scales, generating high-resolution features superior to popular ResNet and HRNet. The FSNet backbone has achieved significant improvements over many existing text detection methods, including PAN, DB, and FAST. Then we design a complementary CTBlock to leverage center line based features similar to the medial axis of text regions and show that it can outperform contour-based approaches in challenging cases when small scene texts appear closely. Extensive experimental results show that MixNet, which mixes FSNet with CTBlock, achieves state-of-the-art results on multiple scene text detection datasets.
摘要
通过推出 MixNet 混合体系,我们可以准确地检测自然场景中小文本实例,无论文本方向、风格或照明条件如何。 MixNet 包含两个关键模块:(1)特点混淆网络(FSNet)作为基础,以及(2)中心转换块(CTBlock)来利用场景文本的1D manifold约束。我们首先介绍了一种新的特点混淆策略,以便在多个缩放级别之间互换特点,生成高分辨率的特点,超过了流行的 ResNet 和 HRNet。FSNet 后置网络已经超越了许多现有的文本检测方法,包括 PAN、DB 和 FAST。然后,我们设计了一种补充的 CTBlock,以利用文本区域的中心线基本特征,并示出它可以在挑战性较高的情况下,当小场景文本相互靠近时,超过边框基本方法。广泛的实验结果表明,将 MixNet 混合体系与多个场景文本检测数据集进行比较,可以获得最佳结果。
AMSP-UOD: When Vortex Convolution and Stochastic Perturbation Meet Underwater Object Detection
methods: 我们提出了AMSP Vortex Convolution(AMSP-VConv)来破坏噪声分布,提高特征提取能力,减少参数,提高网络的可靠性。此外,我们设计了Feature Association Decoupling Cross Stage Partial(FAD-CSP)模块,增强了长和短距离特征之间的强相关性,提高网络在复杂水下环境中的性能。
results: 我们对URPC和RUOD数据集进行了广泛的实验,结果显示,我们的方法在精度和噪声抗性方面比现有的状态切入方法更高。AMSP-UOD提出了一种创新的解决方案,具有实际应用潜在性。代码将公开发布。Abstract
In this paper, we present a novel Amplitude-Modulated Stochastic Perturbation and Vortex Convolutional Network, AMSP-UOD, designed for underwater object detection. AMSP-UOD specifically addresses the impact of non-ideal imaging factors on detection accuracy in complex underwater environments. To mitigate the influence of noise on object detection performance, we propose AMSP Vortex Convolution (AMSP-VConv) to disrupt the noise distribution, enhance feature extraction capabilities, effectively reduce parameters, and improve network robustness. We design the Feature Association Decoupling Cross Stage Partial (FAD-CSP) module, which strengthens the association of long and short-range features, improving the network performance in complex underwater environments. Additionally, our sophisticated post-processing method, based on non-maximum suppression with aspect-ratio similarity thresholds, optimizes detection in dense scenes, such as waterweed and schools of fish, improving object detection accuracy. Extensive experiments on the URPC and RUOD datasets demonstrate that our method outperforms existing state-of-the-art methods in terms of accuracy and noise immunity. AMSP-UOD proposes an innovative solution with the potential for real-world applications. Code will be made publicly available.
摘要
在这篇论文中,我们提出了一种新的振荡干扰随机激活网络(AMSP-UOD),用于水下物体检测。AMSP-UOD特点是解决水下环境中物体检测精度下降的非理想捕集因素的影响。为了减少噪声对物体检测性能的影响,我们提议AMSP激活 Vortex Convolution(AMSP-VConv),以扰乱噪声分布,提高特征提取能力,降低参数,提高网络的可靠性。我们设计了Feature Association Decoupling Cross Stage Partial(FAD-CSP)模块,强化长和短距离特征之间的关联,提高网络在复杂水下环境中的性能。此外,我们提出了一种复杂的后处理方法,基于非最大值抑制器和方向相似度阈值,以优化检测在紧凑场景中,如水蕴和鱼群,提高物体检测精度。广泛的实验表明,我们的方法在URPC和RUOD数据集上比现有状态的方法更高的准确率和噪声抗性。AMSP-UOD提出了一种创新的解决方案,具有实际应用前景。代码将公开发布。
Semantic-Aware Implicit Template Learning via Part Deformation Consistency
results: 对baseline方法进行了广泛的实验,并证明了提议的方法在不同任务中(包括键点传输、部分标签传输和 Texture传输)具有更高的性能。此外,我们还提供了质量分析,以验证semantic-aware减杂码的效果。代码可以在https://github.com/mlvlab/PDC上下载。Abstract
Learning implicit templates as neural fields has recently shown impressive performance in unsupervised shape correspondence. Despite the success, we observe current approaches, which solely rely on geometric information, often learn suboptimal deformation across generic object shapes, which have high structural variability. In this paper, we highlight the importance of part deformation consistency and propose a semantic-aware implicit template learning framework to enable semantically plausible deformation. By leveraging semantic prior from a self-supervised feature extractor, we suggest local conditioning with novel semantic-aware deformation code and deformation consistency regularizations regarding part deformation, global deformation, and global scaling. Our extensive experiments demonstrate the superiority of the proposed method over baselines in various tasks: keypoint transfer, part label transfer, and texture transfer. More interestingly, our framework shows a larger performance gain under more challenging settings. We also provide qualitative analyses to validate the effectiveness of semantic-aware deformation. The code is available at https://github.com/mlvlab/PDC.
摘要
学习隐式模板作为神经场景,最近显示了无监督形态匹配的卓越表现。尽管成功,我们发现当前的方法,即solely rely on geometric information,经常学习低效的形态变换 across 通用物体形态,这些形态具有高度结构变动。在这篇论文中,我们强调部分形态一致性的重要性,并提出了semantic-aware implicit template learning框架,以启用semantically plausible的形态变换。通过利用自动学习的semantic prior,我们建议了本地条件化和新的semantic-aware deformation code,以及deformation consistency regularization regarding part deformation、global deformation和global scaling。我们的广泛实验表明我们的提案方法比基eline在各种任务中表现出色:键点传输、部件标签传输和 Texture传输。更有趣的是,我们的框架在更加复杂的设置下表现出更大的性能提升。我们还提供了质量分析,以验证semantic-aware deformation的效果。代码可以在https://github.com/mlvlab/PDC上获取。
ACLS: Adaptive and Conditional Label Smoothing for Network Calibration
results: 研究人员通过实验证明,新引入的损失函数ACLS可以兼顾现有正则化方法的优点,并避免其缺点。这种损失函数在图像分类和 semantic segmentation 中具有广泛的应用前景。Abstract
We address the problem of network calibration adjusting miscalibrated confidences of deep neural networks. Many approaches to network calibration adopt a regularization-based method that exploits a regularization term to smooth the miscalibrated confidences. Although these approaches have shown the effectiveness on calibrating the networks, there is still a lack of understanding on the underlying principles of regularization in terms of network calibration. We present in this paper an in-depth analysis of existing regularization-based methods, providing a better understanding on how they affect to network calibration. Specifically, we have observed that 1) the regularization-based methods can be interpreted as variants of label smoothing, and 2) they do not always behave desirably. Based on the analysis, we introduce a novel loss function, dubbed ACLS, that unifies the merits of existing regularization methods, while avoiding the limitations. We show extensive experimental results for image classification and semantic segmentation on standard benchmarks, including CIFAR10, Tiny-ImageNet, ImageNet, and PASCAL VOC, demonstrating the effectiveness of our loss function.
摘要
我们考虑了深度神经网络的准确率调整问题,有许多使用常规化方法来调整神经网络的方法。虽然这些方法有效地调整神经网络,但是还没有很好地理解这些常规化方法在神经网络调整中的下面原理。在这篇论文中,我们提供了对现有常规化方法的深入分析,从而更好地理解它们如何影响神经网络调整。 Specifically, we have observed that 1) 常规化方法可以被视为变种的标签平滑,和 2) 它们不总是愿望的。基于分析,我们提出了一种新的损失函数,名为 ACLS,它结合了现有常规化方法的优点,而避免了其限制。我们在标准的benchmark上,包括CIFAR10、Tiny-ImageNet、ImageNet和PASCAL VOC,进行了广泛的实验,并证明了我们的损失函数的效果。
Edge-aware Hard Clustering Graph Pooling for Brain Imaging Data
paper_authors: Cheng Zhu, Jiayi Zhu, Lijuan Zhang, Xi Wu, Shuqi Yang, Ping Liang, Honghan Chen, Ying Tan
for: This paper aims to develop a deep learning method for probing different types of abnormal functional brain networks from a data-driven perspective.
methods: The proposed method, called Edge-aware hard clustering graph pooling (EHCPool), uses a clustering graph pooling method that supports multidimensional edge features and assesses node feature significance based on edge features. It also uses a novel Iteration n-top strategy to adaptively learn sparse hard clustering assignments for graphs, and an innovative N-E Aggregation strategy to aggregate node and edge feature information in each independent subgraph.
results: The proposed model was evaluated on multi-site brain imaging public datasets and yielded state-of-the-art performance.Abstract
Graph Convolutional Networks (GCNs) can capture non-Euclidean spatial dependence between different brain regions, and the graph pooling operator in GCNs is key to enhancing the representation learning capability and acquiring abnormal brain maps. However, the majority of existing research designs graph pooling operators only from the perspective of nodes while disregarding the original edge features, in a way that not only confines graph pooling application scenarios, but also diminishes its ability to capture critical substructures. In this study, a clustering graph pooling method that first supports multidimensional edge features, called Edge-aware hard clustering graph pooling (EHCPool), is developed. EHCPool proposes the first 'Edge-to-node' score evaluation criterion based on edge features to assess node feature significance. To more effectively capture the critical subgraphs, a novel Iteration n-top strategy is further designed to adaptively learn sparse hard clustering assignments for graphs. Subsequently, an innovative N-E Aggregation strategy is presented to aggregate node and edge feature information in each independent subgraph. The proposed model was evaluated on multi-site brain imaging public datasets and yielded state-of-the-art performance. We believe this method is the first deep learning tool with the potential to probe different types of abnormal functional brain networks from data-driven perspective.
摘要
格子卷积网络(GCNs)可以捕捉不同脑区之间的非欧几何空间相互关系,并且图 pooling 运算在 GCNs 中是关键来提高表示学习能力和获得异常脑图。然而,现有大多数研究只从节点的角度设计图 pooling 操作符,而忽视原始边特征,这不仅限制了图 pooling 应用场景,而且减少了其捕捉关键子结构的能力。在这种研究中,我们开发了一种集群图 pooling 方法,称为 Edge-aware hard clustering graph pooling(EHCPool)。EHCPool 首先支持多维边特征,并提出了基于边特征的 'Edge-to-node' 分数评价标准来评估节点特征重要性。为更好地捕捉关键子图,我们还设计了一种 novel Iteration n-top 策略,以适应性地学习 sparse hard clustering 分配。然后,我们提出了一种 innovative N-E Aggregation 策略,用于在每个独立子图中集成节点和边特征信息。我们的模型在多地点脑成像公共数据集上进行了评估,并实现了状态前的性能。我们认为这是深度学习工具的第一个能够从数据驱动的方式探索不同类型的异常功能脑网络的方法。
Rethinking Data Perturbation and Model Stabilization for Semi-supervised Medical Image Segmentation
results: DPMS 可以实现新的顶尖性能在公共 2D ACDC 和 3D LA 数据集上,在不同的半supervised 设定下。例如,DPMS 在 ACDC 上比前一代 SOTA 提高了22.62%。Abstract
Studies on semi-supervised medical image segmentation (SSMIS) have seen fast progress recently. Due to the limited labelled data, SSMIS methods mainly focus on effectively leveraging unlabeled data to enhance the segmentation performance. However, despite their promising performance, current state-of-the-art methods often prioritize integrating complex techniques and loss terms rather than addressing the core challenges of semi-supervised scenarios directly. We argue that the key to SSMIS lies in generating substantial and appropriate prediction disagreement on unlabeled data. To this end, we emphasize the crutiality of data perturbation and model stabilization in semi-supervised segmentation, and propose a simple yet effective approach to boost SSMIS performance significantly, dubbed DPMS. Specifically, we first revisit SSMIS from three distinct perspectives: the data, the model, and the loss, and conduct a comprehensive study of corresponding strategies to examine their effectiveness. Based on these examinations, we then propose DPMS, which adopts a plain teacher-student framework with a standard supervised loss and unsupervised consistency loss. To produce appropriate prediction disagreements, DPMS perturbs the unlabeled data via strong augmentations to enlarge prediction disagreements considerably. On the other hand, using EMA teacher when strong augmentation is applied does not necessarily improve performance. DPMS further utilizes a forwarding-twice and momentum updating strategies for normalization statistics to stabilize the training on unlabeled data effectively. Despite its simplicity, DPMS can obtain new state-of-the-art performance on the public 2D ACDC and 3D LA datasets across various semi-supervised settings, e.g. obtaining a remarkable 22.62% improvement against previous SOTA on ACDC with 5% labels.
摘要
研究 semi-supervised medical image segmentation (SSMIS) 在近期已经进展很快。由于有限的标签数据,SSMIS 方法主要是利用无标注数据来提高 segmentation 性能。然而,当前状态的艺术方法frequently强调混合复杂的技术和损失函数,而不是直接面临 semi-supervised 场景的核心挑战。我们认为,SSMIS 的关键在于生成足够和适当的预测差异在无标注数据上。为此,我们强调了数据抖动和模型稳定在 semi-supervised 分割中的重要性,并提出了一种简单 yet effective 的方法,称为 DPMS。 Specifically, we first revisit SSMIS from three distinct perspectives: the data, the model, and the loss, and conduct a comprehensive study of corresponding strategies to examine their effectiveness. Based on these examinations, we then propose DPMS, which adopts a plain teacher-student framework with a standard supervised loss and unsupervised consistency loss. To produce appropriate prediction disagreements, DPMS perturbs the unlabeled data via strong augmentations to enlarge prediction disagreements considerably. On the other hand, using EMA teacher when strong augmentation is applied does not necessarily improve performance. DPMS further utilizes a forwarding-twice and momentum updating strategies for normalization statistics to stabilize the training on unlabeled data effectively. Despite its simplicity, DPMS can obtain new state-of-the-art performance on the public 2D ACDC and 3D LA datasets across various semi-supervised settings, e.g. obtaining a remarkable 22.62% improvement against previous SOTA on ACDC with 5% labels.
Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification
results: 实验结果表明,该方法可以在标本频道和Synthetic-to-real场景中实现高效的人识别,并且可以减少摄像头偏见问题。Abstract
We present a novel unsupervised domain adaption method for person re-identification (reID) that generalizes a model trained on a labeled source domain to an unlabeled target domain. We introduce a camera-driven curriculum learning (CaCL) framework that leverages camera labels of person images to transfer knowledge from source to target domains progressively. To this end, we divide target domain dataset into multiple subsets based on the camera labels, and initially train our model with a single subset (i.e., images captured by a single camera). We then gradually exploit more subsets for training, according to a curriculum sequence obtained with a camera-driven scheduling rule. The scheduler considers maximum mean discrepancies (MMD) between each subset and the source domain dataset, such that the subset closer to the source domain is exploited earlier within the curriculum. For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way. We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs. To address the camera bias problem, we also introduce a camera-diversity (CD) loss encouraging person images of the same pseudo label, but captured across various cameras, to involve more for discriminative feature learning, providing person representations robust to inter-camera variations. Experimental results on standard benchmarks, including real-to-real and synthetic-to-real scenarios, demonstrate the effectiveness of our framework.
摘要
我们提出了一种新的无监督领域适应方法,用于人重识别(reID),可以将源频率域上学习的模型映射到目标频率域上。我们引入了一个摄像头驱动的课程学习(CaCL)框架,利用摄像头标签来从源频率域传递知识到目标频率域。为此,我们将目标频率域数据集分成多个子集,并在每个子集上进行逐步训练,按照一个摄像头驱动的时间序列来调度。时间序列中的每一个子集都是根据摄像头标签与源频率域数据集的最大平均差(MMD)进行选择的。为了在目标频率域上训练reID模型,我们生成了一系列pseudo标签,用于在目标频率域上进行supervised学习。我们发现pseudo标签具有强烈的摄像头偏好, sugggesting that人像图像来自同一摄像头的人是可能具有相同的pseudo标签,即使这些人图像是不同的ID。为了解决摄像头偏好问题,我们还引入了一个摄像头多样性(CD)损失,强制人像图像同一pseudo标签,但来自不同摄像头的人图像,参与更多的特征学习,以提供对摄像头变化的人表示robust。我们的实验结果表明,我们的框架在标准benchmark上得到了显著的效果,包括实际到实际和 sintetic到实际的场景。
HashReID: Dynamic Network with Binary Codes for Efficient Person Re-identification
results: 我们的提案可以在Market1501数据集上降低了超过70%的样本使用对称码,从而降低了网络的计算成本80%,并与其他对称码基础的方法相比提高了60%。这些结果表明了我们的方法具有重要的改善,并且与传统ReID方法相似的精度性表现。Abstract
Biometric applications, such as person re-identification (ReID), are often deployed on energy constrained devices. While recent ReID methods prioritize high retrieval performance, they often come with large computational costs and high search time, rendering them less practical in real-world settings. In this work, we propose an input-adaptive network with multiple exit blocks, that can terminate computation early if the retrieval is straightforward or noisy, saving a lot of computation. To assess the complexity of the input, we introduce a temporal-based classifier driven by a new training strategy. Furthermore, we adopt a binary hash code generation approach instead of relying on continuous-valued features, which significantly improves the search process by a factor of 20. To ensure similarity preservation, we utilize a new ranking regularizer that bridges the gap between continuous and binary features. Extensive analysis of our proposed method is conducted on three datasets: Market1501, MSMT17 (Multi-Scene Multi-Time), and the BGC1 (BRIAR Government Collection). Using our approach, more than 70% of the samples with compact hash codes exit early on the Market1501 dataset, saving 80% of the networks computational cost and improving over other hash-based methods by 60%. These results demonstrate a significant improvement over dynamic networks and showcase comparable accuracy performance to conventional ReID methods. Code will be made available.
摘要
“生物emetric应用程序(ReID)经常在能源有限的设备上部署。而现代ReID方法则倾向于优先高回传率,但是这些方法往往具有较高的计算成本和搜寻时间,使其在实际应用中不太实际。在这个工作中,我们提议一个输入适应网络,该网络可以根据输入的复杂度进行计算节省。为了评估输入的复杂度,我们引入了一个基于时间的分类器,这个分类器驱动了一个新的训练策略。此外,我们还采用了一个二进制对应码生成方法,而不是依赖连续值的特征。这种方法可以很大幅度地改善搜寻过程,提高比例为20倍。确保相似性保持,我们利用一个新的排名调整仪,它可以跨越连续和二进制特征之间的差异。我们对 Market1501、MSMT17(多个场景多个时间)和 BGC1(BRIAR政府收集)三个数据集进行了广泛的分析。使用我们的方法,Market1501上的超过70%的样本可以在输入简短的情况下提前退出,实现80%的网络计算成本的减少,并且与其他对应码基本方法相比,提高了60%.这些结果表明我们的方法具有优秀的改善,并且与传统ReID方法相似的准确性表现。我们将会公开代码。”
Age Prediction From Face Images Via Contrastive Learning
paper_authors: Yeongnam Chae, Poulami Raha, Mijung Kim, Bjorn Stenger
for: accurately estimating age from face images
methods: contrastive learning to extract age-related features, combining cosine similarity and triplet margin losses to suppress identity-related features
results: achieved state-of-the-art performance on two public datasets, FG-NET and MORPH-IIAbstract
This paper presents a novel approach for accurately estimating age from face images, which overcomes the challenge of collecting a large dataset of individuals with the same identity at different ages. Instead, we leverage readily available face datasets of different people at different ages and aim to extract age-related features using contrastive learning. Our method emphasizes these relevant features while suppressing identity-related features using a combination of cosine similarity and triplet margin losses. We demonstrate the effectiveness of our proposed approach by achieving state-of-the-art performance on two public datasets, FG-NET and MORPH-II.
摘要
这篇论文提出了一种新的方法,用于准确地从面像中估算年龄,解决了收集大量同一个人的年龄不同的数据的挑战。而是利用可得到的不同人种的年龄面像数据,并尝试EXTRACT年龄相关特征使用对比学习。我们的方法强调这些相关特征,同时压制身份相关特征使用抽象相似性和三重margin损失。我们在两个公共数据集FG-NET和MORPH-II上进行了实验,并达到了状态的表现。
Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack
results: 我们的研究结果显示,现有的攻击方法无法实现系统级别的攻击效果(如违反交通规则)在实际AD上。我们还发现了两个设计限制:1)物理模型与像素抽象不匹配,2)缺乏车辆植物模型和AD系统模型考虑。我们提出了SysAdv,一种基于系统的攻击设计,并证明了它可以显著提高攻击效果,即违反率提高约70%。Abstract
In autonomous driving (AD), accurate perception is indispensable to achieving safe and secure driving. Due to its safety-criticality, the security of AD perception has been widely studied. Among different attacks on AD perception, the physical adversarial object evasion attacks are especially severe. However, we find that all existing literature only evaluates their attack effect at the targeted AI component level but not at the system level, i.e., with the entire system semantics and context such as the full AD pipeline. Thereby, this raises a critical research question: can these existing researches effectively achieve system-level attack effects (e.g., traffic rule violations) in the real-world AD context? In this work, we conduct the first measurement study on whether and how effectively the existing designs can lead to system-level effects, especially for the STOP sign-evasion attacks due to their popularity and severity. Our evaluation results show that all the representative prior works cannot achieve any system-level effects. We observe two design limitations in the prior works: 1) physical model-inconsistent object size distribution in pixel sampling and 2) lack of vehicle plant model and AD system model consideration. Then, we propose SysAdv, a novel system-driven attack design in the AD context and our evaluation results show that the system-level effects can be significantly improved, i.e., the violation rate increases by around 70%.
摘要
自动驾驶(AD)中精准感知是安全驾驶的关键。由于其安全性的重要性,AD感知的安全性已经得到了广泛的研究。 amongst different AD感知攻击,物理对抗对象逃脱攻击最为严重。然而,我们发现所有的文献都只评估了这些攻击的目标AI组件级别的影响,而不是整个系统的 semantics和context,例如整个AD管道。这引出了一个关键的研究问题:现有的研究是否可以在实际的AD上实现系统级别的效果?在这种工作中,我们进行了首次的测量研究,以确定现有的设计是否可以在AD上实现系统级别的效果,特别是STOP标志逃脱攻击的情况。我们的评估结果表明,所有代表性的先前工作都无法实现任何系统级别的效果。我们发现了两个设计 limitation:1)物理模型不一致的对象大小分布在像素抽样中,2)缺乏车辆植物模型和AD系统模型考虑。然后,我们提出了SysAdv,一种基于系统的攻击设计在AD上。我们的评估结果表明,可以显著提高系统级别的效果,即违反率提高约70%。
A Unified Framework for 3D Point Cloud Visual Grounding
results: 实验结果显示,3DRefTR在ScanRefer dataset上比前一代3DRES方法提高12.43%的mIoU,并比前一代3DREC方法提高0.6%的Acc@0.25IoU。Abstract
3D point cloud visual grounding plays a critical role in 3D scene comprehension, encompassing 3D referring expression comprehension (3DREC) and segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC can help 3DRES locate the referent, while 3DRES can also facilitate 3DREC via more finegrained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By leveraging the heterogeneous CPU-GPU parallelism, while the GPU is occupied generating visual tokens, the CPU concurrently produces superpoints, equivalently accomplishing the upsampling computation; ii) By harnessing on the inherent association between the superpoints and point cloud, it eliminates the heavy computational overhead on the high-resolution visual features for upsampling. This elegant design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model. Empirical evaluations affirm the superiority of 3DRefTR. Specifically, on the ScanRefer dataset, 3DRefTR surpasses the state-of-the-art 3DRES method by 12.43% in mIoU and improves upon the SOTA 3DREC method by 0.6% Acc@0.25IoU.
摘要
三维点云视觉基础LAYER plays a critical role in 3D scene comprehension, including 3D referring expression comprehension (3DREC) and segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC can help 3DRES locate the referent, while 3DRES can also facilitate 3DREC via more finegrained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By leveraging the heterogeneous CPU-GPU parallelism, while the GPU is occupied generating visual tokens, the CPU concurrently produces superpoints, equivalently accomplishing the upsampling computation; ii) By harnessing on the inherent association between the superpoints and point cloud, it eliminates the heavy computational overhead on the high-resolution visual features for upsampling. This elegant design enables 3DRefTR to achieve both well-performing 3DRES and 3DREC capacities with only a 6% additional latency compared to the original 3DREC model. Empirical evaluations affirm the superiority of 3DRefTR. Specifically, on the ScanRefer dataset, 3DRefTR surpasses the state-of-the-art 3DRES method by 12.43% in mIoU and improves upon the SOTA 3DREC method by 0.6% Acc@0.25IoU.
SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets
results: 这篇论文的实验结果显示,这个方法可以在七个问题中 achieved results comparable to, 和在一些情况下甚至超过了可以接触到源数据的方法。实验结果显示,这个方法可以提高 mIoU 的表现,最高提高幅度达12%。Abstract
Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion -- agreement filtering and entropy weighting -- based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at https://github.com/csimo005/SUMMIT.
摘要
Scene理解使用多Modal数据是必需的许多应用程序中,例如自主导航。为了实现这种多种情况下,现有的模型需要能够适应数据分布的变化,而不需要费力数据注释。现有的方法假设源数据可以在适应过程中获得,并且假设源数据是paired multiModal数据。这两个假设可能会成为许多应用程序的问题。源数据可能不可用due to privacy, security, or economic concerns。假设存在paired multiModal数据进行训练也会导致数据收集成本增加,并且不使用可以获得的广泛分布的自由训练uniModal模型。在这种工作中,我们放弃了这两个假设,通过对策 Switching framework来解决在目标领域中使用独立训练在uniModal数据上的模型适应问题,不需要访问原始源数据集。我们的提出的方法通过自动选择 Agreement filtering和Entropy weighting两种跨Modal pseudo-label fusions的方法来解决这个问题,根据估算的领域漏洞。我们在Semantic segmentation问题上进行了实验,并在七个困难的适应场景中证明了我们的方法的效果,与访问源数据的方法相当,甚至在一些场景下超越了这些方法。我们的方法在mIoU上提高了12%以上 compared to竞争对手。我们的代码可以在https://github.com/csimo005/SUMMIT上获取。
Motion-to-Matching: A Mixed Paradigm for 3D Single Object Tracking
results: 广泛的实验表明,MTM-Tracker 在大规模数据集(KITTI和NuScenes)上达到了竞争性表现(70.9% 和 51.70%)。Abstract
3D single object tracking with LiDAR points is an important task in the computer vision field. Previous methods usually adopt the matching-based or motion-centric paradigms to estimate the current target status. However, the former is sensitive to the similar distractors and the sparseness of point cloud due to relying on appearance matching, while the latter usually focuses on short-term motion clues (eg. two frames) and ignores the long-term motion pattern of target. To address these issues, we propose a mixed paradigm with two stages, named MTM-Tracker, which combines motion modeling with feature matching into a single network. Specifically, in the first stage, we exploit the continuous historical boxes as motion prior and propose an encoder-decoder structure to locate target coarsely. Then, in the second stage, we introduce a feature interaction module to extract motion-aware features from consecutive point clouds and match them to refine target movement as well as regress other target states. Extensive experiments validate that our paradigm achieves competitive performance on large-scale datasets (70.9% in KITTI and 51.70% in NuScenes). The code will be open soon at https://github.com/LeoZhiheng/MTM-Tracker.git.
摘要
“3D单目标追踪使用LiDAR点cloud是计算机见识领域中重要的任务。先前的方法通常运用匹配基本或动作中心派生估算目标状态。然而,前者受到相似的余杂物和点云 sparse 的影响,导致依赖于外观匹配而较为敏感;而后者通常专注于短期动作指标(如两帧),忽略了长期动作模式的目标。为了解决这些问题,我们提出了一种混合 paradigma 的两个阶段方法,名为 MTM-Tracker,它结合了运动模型和特征匹配到单一的网络中。具体来说,在第一阶段,我们利用独特的历史盒子作为运动假设,并提出了Encoder-Decoder结构来粗略定位目标。然后,在第二阶段,我们引入了运动相互作用模块,从 consecutives point clouds 中提取动作感知特征,并与其他 point clouds 进行匹配以精确地评估目标运动,同时预测其他目标状态。广泛的实验证明了我们的方法在大规模数据集(KITTI 70.9%和NuScenes 51.70%)中具有竞争性的表现。代码将将在https://github.com/LeoZhiheng/MTM-Tracker.git 中开源。”
Semi-Supervised Learning via Weight-aware Distillation under Class Distribution Mismatch
results: 实验结果显示,WAD比五种现有的SSL方法和一个基准方法在CIFAR10和CIFAR100类别Dataset上表现更好,并且在人工跨dataset上也获得了良好的结果。Abstract
Semi-Supervised Learning (SSL) under class distribution mismatch aims to tackle a challenging problem wherein unlabeled data contain lots of unknown categories unseen in the labeled ones. In such mismatch scenarios, traditional SSL suffers severe performance damage due to the harmful invasion of the instances with unknown categories into the target classifier. In this study, by strict mathematical reasoning, we reveal that the SSL error under class distribution mismatch is composed of pseudo-labeling error and invasion error, both of which jointly bound the SSL population risk. To alleviate the SSL error, we propose a robust SSL framework called Weight-Aware Distillation (WAD) that, by weights, selectively transfers knowledge beneficial to the target task from unsupervised contrastive representation to the target classifier. Specifically, WAD captures adaptive weights and high-quality pseudo labels to target instances by exploring point mutual information (PMI) in representation space to maximize the role of unlabeled data and filter unknown categories. Theoretically, we prove that WAD has a tight upper bound of population risk under class distribution mismatch. Experimentally, extensive results demonstrate that WAD outperforms five state-of-the-art SSL approaches and one standard baseline on two benchmark datasets, CIFAR10 and CIFAR100, and an artificial cross-dataset. The code is available at https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master.
摘要
半监督学习(SSL)在类分布不匹配场景下面临着一个具有挑战性的问题,那就是无标签数据中含有很多未知类别,这些类别不在标签数据中出现过。在这种场景下,传统的SSL表现糟糕,这是因为未知类别的实例侵入了目标分类器,从而导致SSL错误的增加。在这项研究中,通过严格的数学推理,我们发现SSL错误在类分布不匹配情况下是由pseudo-labeling错误和入侵错误两者相互绑定的。为了减轻SSL错误,我们提出了一种可靠SSL框架calledWeight-Aware Distillation(WAD)。WAD通过权重来选择ively传输无标签数据中有助于目标任务的知识到目标分类器。具体来说,WAD捕捉适应性权重和高质量pseudo标签,以便target实例通过在表示空间中探索点对应信息(PMI)来最大化无标签数据的作用,并过滤未知类别。理论上,我们证明WAD在类分布不匹配情况下具有紧binding的人口风险。实验证明,WAD在CIFAR10和CIFAR100两个benchmarkdataset和一个人工交叉dataset上比五种现状顶峰SSL方法和一个标准基eline表现出色,并且可以减轻SSL错误。代码可以在https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master中下载。
CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image Generation
results: 实验表明,该方法无需使用核函数或注意力机制,却可以达到出色的性能。此外,该方法的可读性也使得可以在实验中进行可视化。这些结果证明了该方法的可行性,并促使未来对Context Clustering在更多的图像生成 task中进行进一步研究。Abstract
Image generation tasks are traditionally undertaken using Convolutional Neural Networks (CNN) or Transformer architectures for feature aggregating and dispatching. Despite the frequent application of convolution and attention structures, these structures are not fundamentally required to solve the problem of instability and the lack of interpretability in image generation. In this paper, we propose a unique image generation process premised on the perspective of converting images into a set of point clouds. In other words, we interpret an image as a set of points. As such, our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets, which defies the convention of using convolution or attention mechanisms. Hence, we exclusively depend on this clustering technique, combined with the multi-layer perceptron (MLP) in a generative model. Furthermore, we implement the integration of a module termed the 'Point Increaser' for the model. This module is just an MLP tasked with generating additional points for clustering, which are subsequently integrated within the paradigm of the Generative Adversarial Network (GAN). We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN), which offers a distinctive viewpoint in the domain of feature aggregating and dispatching. Empirical evaluations affirm that our CoC-GAN, devoid of convolution and attention mechanisms, exhibits outstanding performance. Its interpretability, endowed by the CoC module, also allows for visualization in our experiments. The promising results underscore the feasibility of our method and thus warrant future investigations of applying Context Clustering to more novel and interpretable image generation.
摘要
Image 生成任务通常使用 Convolutional Neural Networks (CNN) 或 Transformer 架构来进行特征聚合和派发。尽管这些结构频繁应用,但它们并不是解决图像生成中的不稳定和不可解释性的基本要求。在这篇论文中,我们提出了一种独特的图像生成过程,基于将图像转换为一组点云的思想。即我们将图像视为一组点。因此,我们的方法ология利用简单的聚合方法名为 Context Clustering (CoC) 来生成图像,而不需要使用 convolution 或 attention 机制。因此,我们几乎完全依赖这种聚合技术,加上多层感知机制 (MLP) 在生成模型中。此外,我们还实现了一个模块,称为 'Point Increaser',用于生成更多的点,并将其集成到生成模型中。我们称之为 Context Clustering Generative Adversarial Network (CoC-GAN),它在特征聚合和派发领域提供了一种新的视角。我们的实验结果表明,我们的 CoC-GAN 模型,没有使用 convolution 或 attention 机制,在性能上表现出色。此外,它的可读性,受 CoC 模块的启发,也允许我们在实验中进行可见化。这些优秀的结果证明了我们的方法的可行性,因此在将来的研究中,我们可以继续探索在更多的新和可解释的图像生成中应用 Context Clustering。
Compressed Models Decompress Race Biases: What Quantized Models Forget for Fair Face Recognition
results: 研究发现,使用synthetic数据可以减少大多数测试场景中的偏见,并且对不同的种族背景进行了分析。Abstract
With the ever-growing complexity of deep learning models for face recognition, it becomes hard to deploy these systems in real life. Researchers have two options: 1) use smaller models; 2) compress their current models. Since the usage of smaller models might lead to concerning biases, compression gains relevance. However, compressing might be also responsible for an increase in the bias of the final model. We investigate the overall performance, the performance on each ethnicity subgroup and the racial bias of a State-of-the-Art quantization approach when used with synthetic and real data. This analysis provides a few more details on potential benefits of performing quantization with synthetic data, for instance, the reduction of biases on the majority of test scenarios. We tested five distinct architectures and three different training datasets. The models were evaluated on a fourth dataset which was collected to infer and compare the performance of face recognition models on different ethnicity.
摘要
随着深度学习面Recognition模型的复杂度不断增加,实际部署变得越来越Difficult.研究人员有两个选择:1)使用更小的模型;2)压缩当前模型。然而,使用更小的模型可能会导致问题的偏见,因此压缩变得更加重要。然而,压缩也可能会导致最终模型的偏见增加。我们 investigate了State-of-the-Art量化方法的总性能、每个种族 subgroup 的性能和最终模型的种族偏见。这种分析提供了一些更多的细节,例如使用synthetic数据进行量化可以减少大多数测试场景中的偏见。我们测试了五种不同的架构和三个不同的训练数据集。模型被评估在一个 fourth 数据集上,该数据集用于对不同种族的面Recognition模型的性能进行比较。
PatchBackdoor: Backdoor Attack against Deep Neural Networks without Model Modification
results: 实验结果显示,该攻击方法可以在常见的深度学习模型(VGG、MobileNet、ResNet)上实现攻击成功率为93%~99%。此外,作者还在实际应用中实现了该攻击方法,并证明了其仍然具有威胁性。Abstract
Backdoor attack is a major threat to deep learning systems in safety-critical scenarios, which aims to trigger misbehavior of neural network models under attacker-controlled conditions. However, most backdoor attacks have to modify the neural network models through training with poisoned data and/or direct model editing, which leads to a common but false belief that backdoor attack can be easily avoided by properly protecting the model. In this paper, we show that backdoor attacks can be achieved without any model modification. Instead of injecting backdoor logic into the training data or the model, we propose to place a carefully-designed patch (namely backdoor patch) in front of the camera, which is fed into the model together with the input images. The patch can be trained to behave normally at most of the time, while producing wrong prediction when the input image contains an attacker-controlled trigger object. Our main techniques include an effective training method to generate the backdoor patch and a digital-physical transformation modeling method to enhance the feasibility of the patch in real deployments. Extensive experiments show that PatchBackdoor can be applied to common deep learning models (VGG, MobileNet, ResNet) with an attack success rate of 93% to 99% on classification tasks. Moreover, we implement PatchBackdoor in real-world scenarios and show that the attack is still threatening.
摘要
<> translate the following text into Simplified Chinese<>深度学习系统中的后门攻击是一种重要的威胁,该攻击目的是在攻击者控制的条件下让神经网络模型出现异常行为。然而,大多数后门攻击需要修改神经网络模型通过受攻击者控制的数据和/或直接模型编辑,这导致了一种常见 pero false的信念,即后门攻击可以通过正确地保护模型来避免。在这篇论文中,我们展示了后门攻击可以无需修改模型。相反,我们提议在前置摄像头上放置一个特制的贴图(称为后门贴图),该贴图在与输入图像一起被传递给模型时被训练。贴图可以在大多数情况下保持正常行为,而在攻击者控制的触发对象存在时产生错误预测。我们的主要技术包括生成后门贴图的有效训练方法和提高贴图在实际部署中的可行性的数字物理变换模型。我们的实验显示,PatchBackdoor可以应用于常见的深度学习模型(VGG、MobileNet、ResNet),攻击成功率为93%到99%。此外,我们在实际场景中实现了PatchBackdoor攻击,并证明了攻击仍然具有威胁性。
results: 与状态对比,CLIPMH可以显著提高多媒体检索精度(最大提高率8.38%),CLIP也比文本和视觉基础网络更有优势。Abstract
The multi-modal hashing method is widely used in multimedia retrieval. It can fuse multi-source data to generate binary hash code. However, the current multi-modal methods have the problem of low retrieval accuracy. The reason is that the individual backbone networks have limited feature expression capabilities and are not jointly pre-trained on large-scale unsupervised multi-modal data. To solve this problem, we propose a new baseline CLIP Multi-modal Hashing (CLIPMH) method. It uses CLIP model to extract text and image features, and then fuse to generate hash code. CLIP improves the expressiveness of each modal feature. In this way, it can greatly improve the retrieval performance of multi-modal hashing methods. In comparison to state-of-the-art unsupervised and supervised multi-modal hashing methods, experiments reveal that the proposed CLIPMH can significantly enhance performance (Maximum increase of 8.38%). CLIP also has great advantages over the text and visual backbone networks commonly used before.
摘要
多模式哈希方法广泛应用于多媒体检索。它可以将多源数据 fusion 生成 binary 哈希码。然而,现有的多模式方法受到低回aterra retrieval 精度的限制。原因在于各个后 nac 网络具有有限的特征表达能力,并未在大规模无监督多模式数据上进行联合预训练。为解决这个问题,我们提出了一个新基线CLIP多模式哈希(CLIPMH)方法。它使用CLIP模型提取文本和图像特征,然后融合生成哈希码。CLIP提高了每种Modal特征的表达能力,从而可以大幅提高多模式哈希方法的检索性能。与state-of-the-art无监督和监督多模式哈希方法进行比较,实验表明,提议的CLIPMH可以显著提高性能(最大提升8.38%)。CLIP还在文本和视觉后 nac 网络通常使用之前具有优势。
Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations
paper_authors: Mohammadreza Salehi, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
for: 本研究的目的是提出一种 incorporating temporal consistency in dense self-supervised learning 的方法,以提高视频和图像表示质量。
methods: 该方法从图像预训练模型开始,并使用一种新的自我超vised temporal-alignment clustering loss 来练化图像表示。这种方法可以帮助图像表示中传递高级信息到视频中。
results: 对于无监督 semantic segmentation 任务,该方法可以提高视频表示质量8-10%,并与图像表示质量相同。这种方法可以推动更多的自我超vised scaling,因为视频的可用性很高。代码可以在这里找到:https://github.com/SMSD75/Timetuning。Abstract
Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos-but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. The implementation can be found here : https://github.com/SMSD75/Timetuning
摘要
隐式密集自监学习是一个迅速增长的问题领域,推荐潜在应用于无监督分割和预训练 dense downstream task。尽管视频充满了时间信息,但这些信息却被大量忽略。我们的论文旨在解决这个问题,我们提出了一种新的方法,即时间调整(Time-tuning)。这种方法从图像预训练模型开始,并使用一种新的自我监督时间对齐分群损失来练化图像表示。这有效地传递高级信息从视频到图像表示。时间调整可以提高无监督 semantic segmentation 的状态得到的表示质量,并与图像表示匹配。我们的方法可以提高无监督 semantic segmentation 的状态得到的表示质量,并与图像表示匹配。我们认为这种方法将开拓更多的自监督扩展空间,因为视频的可用性是充沛的。实现可以在以下链接中找到:https://github.com/SMSD75/Timetuning。
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts
results: 实验表明,使用 GNT-MOVE 模型可以在不同场景下提取出高质量的视图 synthesis 结果,并且在 zero-shot 和 few-shot Setting 中表现出色, indicating 模型具有remarkable 的泛化能力。Abstract
Cross-scene generalizable NeRF models, which can directly synthesize novel views of unseen scenes, have become a new spotlight of the NeRF field. Several existing attempts rely on increasingly end-to-end "neuralized" architectures, i.e., replacing scene representation and/or rendering modules with performant neural networks such as transformers, and turning novel view synthesis into a feed-forward inference pipeline. While those feedforward "neuralized" architectures still do not fit diverse scenes well out of the box, we propose to bridge them with the powerful Mixture-of-Experts (MoE) idea from large language models (LLMs), which has demonstrated superior generalization ability by balancing between larger overall model capacity and flexible per-instance specialization. Starting from a recent generalizable NeRF architecture called GNT, we first demonstrate that MoE can be neatly plugged in to enhance the model. We further customize a shared permanent expert and a geometry-aware consistency loss to enforce cross-scene consistency and spatial smoothness respectively, which are essential for generalizable view synthesis. Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes, indicating remarkably better cross-scene generalization in both zero-shot and few-shot settings. Our codes are available at https://github.com/VITA-Group/GNT-MOVE.
摘要
cross-scene总体化NeRF模型,可直接生成未看过场景的新视图,成为NeRF领域的新焦点。现有尝试通过逐渐替换场景表示和渲染模块为高性能神经网络,如转换器,并将新视图生成变成feed-forward推理管道。然而,这些feedforward“神经化”体系仍然无法适应多样化场景,我们提议通过大型语言模型(LLM)的强大混合专家(MoE)想法来桥接它们,这个想法在LLM中已经证明了更好的总体化能力,通过平衡更大的总体模型容量和每个实例特定化的专家。从最近的通用NeRF架构GNT开始,我们首先示出MoE可以简单地插入以提高模型。我们然后定制了共享永久专家和geometry-aware的一致损失,以保持 across-scene一致性和空间平滑性,这些属性是通用视图生成中重要的。我们的提出的模型,命名为GNT with Mixture-of-View-Experts(GNT-MOVE),在实验中表现出了state-of-the-art的结果,在未看过场景中转移时,表现出了很好的横向一致性和少量示例设定下的优秀总体化能力。我们的代码可以在https://github.com/VITA-Group/GNT-MOVE中找到。
An extensible point-based method for data chart value detection
results: 研究表明,在复杂的折线图上,该模型可以准确地检测出关键点,其性能为0.8705 F1(@1.5维度最大差),并在synthetically生成的图表上达到0.9810 F1的高性能。此外,通过专门训练于synthetic数据上的新的扩展,该模型在实际图表上也可以达到良好的性能(0.6621 F1)。Abstract
We present an extensible method for identifying semantic points to reverse engineer (i.e. extract the values of) data charts, particularly those in scientific articles. Our method uses a point proposal network (akin to region proposal networks for object detection) to directly predict the position of points of interest in a chart, and it is readily extensible to multiple chart types and chart elements. We focus on complex bar charts in the scientific literature, on which our model is able to detect salient points with an accuracy of 0.8705 F1 (@1.5-cell max deviation); it achieves 0.9810 F1 on synthetically-generated charts similar to those used in prior works. We also explore training exclusively on synthetic data with novel augmentations, reaching surprisingly competent performance in this way (0.6621 F1) on real charts with widely varying appearance, and we further demonstrate our unchanged method applied directly to synthetic pie charts (0.8343 F1). Datasets, trained models, and evaluation code are available at https://github.com/BNLNLP/PPN_model.
摘要
我们提出了一种可扩展的方法来识别数据图表中的 semantic point(即提取值),特别是在科学文献中的数据图表。我们的方法使用一种点提议网络(类似于物体检测中的区域提议网络)直接预测数据图表中的点位置,并且可以轻松扩展到多种图表类型和元素。我们主要关注科学文献中的复杂柱形图,我们的模型能够检测出这些图表中的突出点,准确率为0.8705 F1(@1.5细分差值);在先前作者使用的图表上进行synthetically生成的图表上,我们的模型可以达到0.9810 F1的准确率。此外,我们还探索了专门使用生成的synthetic数据进行训练,并在实际图表上达到了意外地好的性能(0.6621 F1)。此外,我们还证明了我们的无变方法可以直接应用于synthetic pie chart(0.8343 F1)。我们的数据集、训练模型和评估代码可以在https://github.com/BNLNLP/PPN_model上获取。
Coarse-to-Fine Multi-Scene Pose Regression with Transformers
methods: 该方法使用 encoder 聚合活动图像,并通过 self-attention 机制将多个景象编码进行同时embedding。decoder 则将 latent features 和景象编码转换成姿态预测。
results: 该方法在常见的indoor和outdoor dataset上进行评估,并与多个景象和单个景象绝对摄像头姿态推导器进行比较,并表明其在Localization 精度方面具有优势。Abstract
Absolute camera pose regressors estimate the position and orientation of a camera given the captured image alone. Typically, a convolutional backbone with a multi-layer perceptron (MLP) head is trained using images and pose labels to embed a single reference scene at a time. Recently, this scheme was extended to learn multiple scenes by replacing the MLP head with a set of fully connected layers. In this work, we propose to learn multi-scene absolute camera pose regression with Transformers, where encoders are used to aggregate activation maps with self-attention and decoders transform latent features and scenes encoding into pose predictions. This allows our model to focus on general features that are informative for localization, while embedding multiple scenes in parallel. We extend our previous MS-Transformer approach \cite{shavit2021learning} by introducing a mixed classification-regression architecture that improves the localization accuracy. Our method is evaluated on commonly benchmark indoor and outdoor datasets and has been shown to exceed both multi-scene and state-of-the-art single-scene absolute pose regressors.
摘要
绝对摄像头姿势回归器可以根据捕捉到的图像来估算摄像头的位置和方向。通常情况下,会使用卷积神经网络和多层感知器(MLP)来训练一个单个参考场景。在最近的扩展中,这种方案被替换为多个完全连接层,以学习多个场景。在这个工作中,我们提议使用变换器来学习多个场景绝对摄像头姿势回归,其中扩展器被用来聚合活动图像和自身注意力,而解码器则将缓存特征和场景编码转换为姿势预测。这使得我们的模型能够关注到通用特征,并同时将多个场景embedded在平行。我们在先前的MS-Transformer方法 \cite{shavit2021learning} 上进行了扩展,并引入了混合分类-回归架构,以提高地理位置准确性。我们的方法在常见的indoor和outdoor数据集上进行了评估,并已经超过了多个场景和状态的绝对摄像头姿势回归器。
Understanding Hessian Alignment for Domain Generalization
results: 本文的分析表明,在领域泛化中,梯度和 Hessian 的匹配可以提高 OOD 泛化的性能。此外,本文还提出了两种简单 yet effective 的方法来匹配梯度和 Hessian,不需要直接计算 Hessian。这些方法在不同的 OOD 场景中都达到了良好的性能。Abstract
Out-of-distribution (OOD) generalization is a critical ability for deep learning models in many real-world scenarios including healthcare and autonomous vehicles. Recently, different techniques have been proposed to improve OOD generalization. Among these methods, gradient-based regularizers have shown promising performance compared with other competitors. Despite this success, our understanding of the role of Hessian and gradient alignment in domain generalization is still limited. To address this shortcoming, we analyze the role of the classifier's head Hessian matrix and gradient in domain generalization using recent OOD theory of transferability. Theoretically, we show that spectral norm between the classifier's head Hessian matrices across domains is an upper bound of the transfer measure, a notion of distance between target and source domains. Furthermore, we analyze all the attributes that get aligned when we encourage similarity between Hessians and gradients. Our analysis explains the success of many regularizers like CORAL, IRM, V-REx, Fish, IGA, and Fishr as they regularize part of the classifier's head Hessian and/or gradient. Finally, we propose two simple yet effective methods to match the classifier's head Hessians and gradients in an efficient way, based on the Hessian Gradient Product (HGP) and Hutchinson's method (Hutchinson), and without directly calculating Hessians. We validate the OOD generalization ability of proposed methods in different scenarios, including transferability, severe correlation shift, label shift and diversity shift. Our results show that Hessian alignment methods achieve promising performance on various OOD benchmarks. The code is available at \url{https://github.com/huawei-noah/Federated-Learning/tree/main/HessianAlignment}.
摘要
外部数据(OOD)泛化是深度学习模型在许多实际场景中的关键能力,包括医疗和自动驾驶等。在这些场景中,不同的技术已经被提出来改进OOD泛化。其中,梯度基的正则化方法已经显示了出色的表现,相比其他竞争对手。然而,我们对梯度和梯度对域泛化的理解仍然有限。为了解决这个问题,我们使用最近的OOD理论来分析泛化中梯度和梯度的角色。我们发现,在不同域之间的梯度头矩阵的 спектраль范数是泛化度量的上界,这是一个距离目标和源域的诱导。此外,我们分析了梯度和梯度之间的各种特征的对应关系,包括梯度对域的对应关系。我们的分析解释了许多正则化器,如CORAL、IRM、V-REx、Fish、IGA和Fishr的成功,因为它们在梯度和梯度之间进行了一定的对齐。最后,我们提出了两种简单 yet有效的方法来匹配梯度头矩阵和梯度,基于梯度和梯度产生的产品(HGP)和哈特金森的方法(Hutchinson)。我们在不同的OOD场景中进行了验证,包括传输性、严重相关变换、标签变换和多样性变换。我们的结果显示,梯度对齐方法在多种OOD场景中具有优秀的表现。代码可以在 \url{https://github.com/huawei-noah/Federated-Learning/tree/main/HessianAlignment} 中找到。
SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene Reconstruction by Neural Radiance Field (NeRF)
methods: combining Segment Anything Model (SAM)和Neural Radiance Field (NeRF)技术,通过生成高精度的分割mask来导引NeRF进行高精度动态场景重建。
results: 实验结果表明,我们的方法可以成功重建高精度的动态外科场景,并准确地反映外科工具的空间信息。Abstract
The accurate reconstruction of surgical scenes from surgical videos is critical for various applications, including intraoperative navigation and image-guided robotic surgery automation. However, previous approaches, mainly relying on depth estimation, have limited effectiveness in reconstructing surgical scenes with moving surgical tools. To address this limitation and provide accurate 3D position prediction for surgical tools in all frames, we propose a novel approach called SAMSNeRF that combines Segment Anything Model (SAM) and Neural Radiance Field (NeRF) techniques. Our approach generates accurate segmentation masks of surgical tools using SAM, which guides the refinement of the dynamic surgical scene reconstruction by NeRF. Our experimental results on public endoscopy surgical videos demonstrate that our approach successfully reconstructs high-fidelity dynamic surgical scenes and accurately reflects the spatial information of surgical tools. Our proposed approach can significantly enhance surgical navigation and automation by providing surgeons with accurate 3D position information of surgical tools during surgery.The source code will be released soon.
摘要
准确重建手术场景从手术视频中是许多应用程序的关键之一,包括实时导航和基于图像感知的 робо脑外科自动化。然而,先前的方法,主要基于深度估计,具有限制性,不能准确重建移动的手术工具。为了解决这个限制并提供所有帧中的准确3D位置预测,我们提出了一种新的方法,即SAMSNeRF,它结合了 Segment Anything Model(SAM)和 Neural Radiance Field(NeRF)技术。我们的方法生成了高精度的手术工具分割推荐,这导引了 NeRF 的动态手术场景重建的精度更高。我们的实验结果表明,我们的方法可以成功重建高精度的动态手术场景,并准确反映手术工具的空间信息。我们的提议的方法可以在手术过程中为外科医生提供准确的3D位置信息,从而明显提高手术导航和自动化。我们即将发布源代码。
Weakly Supervised Face and Whole Body Recognition in Turbulent Environments
results: 对于LRFID和BGC1两个预测集,我们的方法可以提高排名一的精度,尤其是在不同的大气干扰和距离下。Abstract
Face and person recognition have recently achieved remarkable success under challenging scenarios, such as off-pose and cross-spectrum matching. However, long-range recognition systems are often hindered by atmospheric turbulence, leading to spatially and temporally varying distortions in the image. Current solutions rely on generative models to reconstruct a turbulent-free image, but often preserve photo-realism instead of discriminative features that are essential for recognition. This can be attributed to the lack of large-scale datasets of turbulent and pristine paired images, necessary for optimal reconstruction. To address this issue, we propose a new weakly supervised framework that employs a parameter-efficient self-attention module to generate domain agnostic representations, aligning turbulent and pristine images into a common subspace. Additionally, we introduce a new tilt map estimator that predicts geometric distortions observed in turbulent images. This estimate is used to re-rank gallery matches, resulting in up to 13.86\% improvement in rank-1 accuracy. Our method does not require synthesizing turbulent-free images or ground-truth paired images, and requires significantly fewer annotated samples, enabling more practical and rapid utility of increasingly large datasets. We analyze our framework using two datasets -- Long-Range Face Identification Dataset (LRFID) and BRIAR Government Collection 1 (BGC1) -- achieving enhanced discriminability under varying turbulence and standoff distance.
摘要
面部和人识别在现在的挑战性场景中已经取得了很大的成功,如偏pose和跨谱匹配。然而,长距离识别系统经常受到大气扰动的影响,导致图像中的扰动变得空间和时间变化。现有的解决方案是使用生成模型来重建扰动的图像,但经常保留着照片实际的特征而不是识别特征,这可以归结于缺乏大规模的扰动和静止图像对应的数据集,这些数据集是必要的 для优化重建。为解决这个问题,我们提出了一个新的弱监督框架,该框架使用高效的自注意模块来生成域无关的表示,将扰动和静止图像调整到共同的空间中。此外,我们还引入了一个新的倾斜地图预测器,该预测器预测了大气扰动中图像中的几何扭曲。这个预测结果用于重新排序画库匹配结果,从而提高了rank-1准确率达到13.86%。我们的方法不需要生成扰动自由图像或真实的对应图像,也不需要大量的注释样本,因此可以更实用和快速地利用越来越大的数据集。我们分析了我们的框架使用两个数据集--长距离面部识别数据集(LRFID)和BRIAR政府收集1(BGC1)--并在不同的扰动和距离 circumstance中获得了提高的分布性。
results: 我们的方法可以提高总准确率,同时提供高质量的缩放下的子架构,而无需重新训练和存储不同场景的模型。在三个多任务benchmark上(PASCALContext、NYUDv2和CIFAR100-MTL),我们的方法可以提高控制性by ~33.5%,而且计算成本较低。Abstract
We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models for various scenarios. To this end, we propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost by jointly adjusting the encoder capacity. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures based on user's constraints. Our training strategy involves a novel 'Configuration-Invariant Knowledge Distillation' loss that enforces backbone representations to be invariant under different runtime width configurations to enhance accuracy. Further, we present a simple but effective search algorithm that translates user constraints to runtime width configurations of both the shared encoder and task decoders, for sampling the sub-architectures. The key rule for the search algorithm is to provide a larger computational budget to the higher preferred task decoder, while searching a shared encoder configuration that enhances the overall MTL performance. Various experiments on three multi-task benchmarks (PASCALContext, NYUDv2, and CIFAR100-MTL) with diverse backbone architectures demonstrate the advantage of our approach. For example, our method shows a higher controllability by ~33.5% in the NYUD-v2 dataset over prior methods, while incurring much less compute cost.
摘要
我们目标是训练一个多任务模型,以便用户在部署后可以调整计算预算和任务表现的重要性,而不需要重新训练。这样可以在动态变化的用户需求下优化性能,而不是带来大量的计算开销来训练和存储多种场景的模型。为此,我们提议一个多任务模型,其中包括共享Encoder和任务特定Decoder。Encoder和Decoder的通道宽度都是可变的。我们的关键思想是通过变化任务特定Decoder的容量来控制任务的重要性,而控制总计算成本的同时,通过共同调整Encoder的容量来提高总MTL性能。这样可以提高精度,提高计算成本控制,并提供基于用户的限制而生成高质量的纤维子模型。我们的训练策略包括一种新的“配置不变知识传播”损失函数,该函数使得后处理表示不变于不同的运行时宽度配置,以提高精度。此外,我们还提出了一种简单 yet有效的搜索算法,该算法将用户的限制翻译成运行时宽度配置的共享Encoder和任务特定Decoder,以采样子模型。搜索算法的关键规则是为高优先级任务Decoder提供更大的计算预算,而搜索共享Encoder配置,以提高总MTL性能。我们在三个多任务测试集(PASCALContext、NYUDv2和CIFAR100-MTL)上进行了多种不同后处理架构的实验,结果显示我们的方法具有更高的可控性(~33.5%),而且带来许多计算成本的减少。
Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape
paper_authors: Jiacong Xu, Yi Zhang, Jiawei Peng, Wufei Ma, Artur Jesslen, Pengliang Ji, Qixin Hu, Jiehua Zhang, Qihao Liu, Jiahao Wang, Wei Ji, Chen Wang, Xiaoding Yuan, Prakhar Kaushik, Guofeng Zhang, Jie Liu, Yushan Xie, Yawen Cui, Alan Yuille, Adam Kortylewski
for: This paper aims to provide a comprehensive dataset for mammal animal 3D pose and shape estimation, which can potentially benefit many downstream applications such as wildlife conservation.
methods: The paper proposes a dataset called Animal3D, which consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and the pose and shape parameters of the SMAL model.
results: The paper benchmarks representative shape and pose estimation models on the Animal3D dataset and demonstrates that synthetic pre-training is a viable strategy to boost the model performance. However, predicting the 3D shape and pose of animals across species remains a very challenging task.Abstract
Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estimation models. Our experimental results demonstrate that predicting the 3D shape and pose of animals across species remains a very challenging task, despite significant advances in human pose estimation. Our results further demonstrate that synthetic pre-training is a viable strategy to boost the model performance. Overall, Animal3D opens new directions for facilitating future research in animal 3D pose and shape estimation, and is publicly available.
摘要
accurately estimating 动物的3D姿态和形状是理解动物行为的重要步骤,可能有多个下游应用,如野生动物保护。然而,这一领域的研究受到缺乏完整和多样的数据集,高质量的3D姿态和形状注释的限制。本文提出了《动物3D》数据集,包括40种哺乳动物的3379张图像,高质量的26个关键点注释,以及SMAL模型的姿态和形状参数。所有注释都是人工标注和检查的,以确保最高质量的结果。基于《动物3D》数据集,我们对代表性的形状和姿态估计模型进行了:(1)只使用《动物3D》数据进行监督学习,(2)从 sintetically生成的图像进行synthetic to real transfer,以及(3)人体姿态和形状估计模型的精细调整。我们的实验结果表明,预测动物各种种的3D姿态和形状仍然是一个非常困难的任务,尽管人体姿态估计技术有所进步。我们的结果还表明,Synthetic pre-training是一种有效的策略,可以提高模型性能。总的来说,《动物3D》开启了未来动物3D姿态和形状估计研究的新途径,并公共可用。
(Un)fair Exposure in Deep Face Rankings at a Distance
results: 经过对两个数据集的重复和识别任务的广泛实验,论文显示了这个领域中的偏见问题仍然存在,需要采取特殊的政策和 corrected measures 来解决。Abstract
Law enforcement regularly faces the challenge of ranking suspects from their facial images. Deep face models aid this process but frequently introduce biases that disproportionately affect certain demographic segments. While bias investigation is common in domains like job candidate ranking, the field of forensic face rankings remains underexplored. In this paper, we propose a novel experimental framework, encompassing six state-of-the-art face encoders and two public data sets, designed to scrutinize the extent to which demographic groups suffer from biases in exposure in the context of forensic face rankings. Through comprehensive experiments that cover both re-identification and identification tasks, we show that exposure biases within this domain are far from being countered, demanding attention towards establishing ad-hoc policies and corrective measures. The source code is available at https://github.com/atzoriandrea/ijcb2023-unfair-face-rankings
摘要
法 enforcement régulièrement 面临涉及到嫌犯人像的排名挑战。深度脸部模型可以帮助这个过程,但经常引入偏见,这些偏见会对某些民族群体产生不公正的影响。在领域 like 聘请候选人排名中,偏见调查是常见的,但在法医脸部排名领域,这一问题仍未得到充分关注。在这篇论文中,我们提出了一个新的实验框架,包括六种最新的脸部编码器和两个公共数据集,用于检验嫌犯人脸部排名中不同民族群体是否受到偏见的影响。通过对重识别和识别任务进行全面的实验,我们显示出,在这个领域中的曝光偏见仍未得到控制,需要采取特殊的政策和纠正措施。源代码可以在 https://github.com/atzoriandrea/ijcb2023-unfair-face-rankings 中下载。
GRIP: Generating Interaction Poses Using Latent Consistency and Spatial Cues
results: 我们的GRIP方法可以在不同的运动捕捉数据集上对手部运动进行升级,并且在不同的物体和运动方式下保持高度的一致性和普适性。量化实验和感知研究表明,GRIP方法在比基eline方法更高效,并且可以扩展到未看过的物体和运动。Abstract
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets.
摘要
手是人类与物体之间的灵活和多样化抓取器,对人类与物体交互的研究非常重要。因此,模拟人类与物体之间的真实互动,包括手指的细微运动,是计算机图形、计算机视觉和混合现实等领域的关键问题。在过去的工作中,对3D中人体和物体的互动进行捕捉和模拟,通常会忽略手姿势。然而,我们介绍了GRIP,一种基于学习的方法,它可以将人体和物体的3D运动作为输入,并生成真实的手姿势。在生成手姿势之前,我们首先使用ANet网络来减噪人体运动。然后,我们利用人体和物体之间的空间时间关系提取两种新的时间互动指示,并使其在两个阶段的推理管道中使用,以生成真实的手姿势。在第一阶段,我们引入了一种新的方法来在幂空间中保证运动的时间一致性(LTC),以生成一致的互动运动。在第二阶段,GRIP生成了不具有手object割辑的补做手姿势。给定含杂的人体和物体运动序列,GRIP可以将其升级为包括手object互动的序列。量化实验和人类感知研究表明,GRIP在不同的动作捕捉数据集上超过了基eline方法,并在不同的物体和运动中展现了普适性。
Delving into Motion-Aware Matching for Monocular 3D Object Tracking
results: 提出了一种基于运动感知的monocular 3D MOT框架,在nuScenes和KITTI datasets上进行了广泛的实验,并达到了与状态当前方法的竞争性表现。 Code和模型在https://github.com/kuanchihhuang/MoMA-M3T上提供。Abstract
Recent advances of monocular 3D object detection facilitate the 3D multi-object tracking task based on low-cost camera sensors. In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches. In this paper, we propose a motion-aware framework for monocular 3D MOT. To this end, we propose MoMA-M3T, a framework that mainly consists of three motion-aware components. First, we represent the possible movement of an object related to all object tracklets in the feature space as its motion features. Then, we further model the historical object tracklet along the time frame in a spatial-temporal perspective via a motion transformer. Finally, we propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results. We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Moreover, the proposed tracker is flexible and can be easily plugged into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T.
摘要
We represent the possible movement of an object in the feature space as its motion features.2. We model the historical object tracklet in a spatial-temporal perspective using a motion transformer.3. We propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results.We conduct extensive experiments on the nuScenes and KITTI datasets and show that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Additionally, our proposed tracker is flexible and can be easily integrated into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T.
GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning
methods: 使用提问学习方法,即使用CLIP和SSL的损失函数,并 introduce visual contrastive loss和prompt consistency loss
results: 在三个困难的领域通用化任务上,GOPro比前一个状态的提示技术得到了显著的提高,并且可以充分发挥CLIP和SSL的优势Abstract
Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.
摘要
大规模基础模型,如CLIP,在视觉识别任务中表现出了惊人的成功,通过嵌入图像在semantically rich space中。自我超vised学习(SSL)也表现出了提高视觉识别的潜力,通过学习不变的特征。然而,CLIP与SSL的组合面临了多任务框架的挑战,包括权重失衡和图像不同视图的不一致性。为解决这些挑战,我们提出了一种名为GOPro的提问学习模型,它是一个统一的框架,使得输入图像的多个扩展视图在一个共享图像文本嵌入空间中保持相似性,通过使用CLIP顶部的两个可学习的图像和文本投影器。我们通过可学习的提问来自动学习这些提示,并利用预训练CLIP中的视觉内容和风格元素来适应目标任务。此外,我们还引入了视觉对比损失和新的提示一致损失,考虑到图像不同视图的差异。GOPro通过综合CLIP和SSL的优点,在三个损失目标上进行了一个整体的训练,并在三个难度大的领域总结任务上表现出了显著的超越。我们的代码可以在https://github.com/mainaksingha01/GOPro上下载。
G3Reg: Pyramid Graph-based Global Registration using Gaussian Ellipsoid Model
results: 研究表明G3Reg框架在三个公共可用数据集和一个自收集的多会话数据集上展现出了superior的Robustness和实时性,并且可以将个体GEM和PAGOR组件与其他算法框架结合以提高其效果。Abstract
This study introduces a novel framework, G3Reg, for fast and robust global registration of LiDAR point clouds. In contrast to conventional complex keypoints and descriptors, we extract fundamental geometric primitives including planes, clusters, and lines (PCL) from the raw point cloud to obtain low-level semantic segments. Each segment is formulated as a unified Gaussian Ellipsoid Model (GEM) by employing a probability ellipsoid to ensure the ground truth centers are encompassed with a certain degree of probability. Utilizing these GEMs, we then present a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). Specifically, we establish an upper bound, which can be traversed based on the confidence level for compatibility testing to construct the pyramid graph. Gradually, we solve multiple maximum cliques (MAC) for each level of the graph, generating numerous transformation candidates. In the verification phase, we adopt a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate. The performance of the algorithm is extensively validated on three publicly available datasets and a self-collected multi-session dataset, without changing any parameter settings in the experimental evaluation. The results exhibit superior robustness and real-time performance of the G3Reg framework compared to state-of-the-art methods. Furthermore, we demonstrate the potential for integrating individual GEM and PAGOR components into other algorithmic frameworks to enhance their efficacy. To advance further research and promote community understanding, we have publicly shared the source code.
摘要
To perform the registration, the framework uses a distrust-and-verify scheme based on a Pyramid Compatibility Graph for Global Registration (PAGOR). This involves establishing an upper bound for compatibility testing and gradually solving multiple maximum cliques (MAC) for each level of the graph. The verification phase then uses a precise and efficient metric for point cloud alignment quality, founded on geometric primitives, to identify the optimal candidate.The performance of the G3Reg framework is extensively validated on three publicly available datasets and a self-collected multi-session dataset, without changing any parameter settings in the experimental evaluation. The results show that the G3Reg framework exhibits superior robustness and real-time performance compared to state-of-the-art methods. Additionally, the framework has the potential to be integrated with other algorithmic frameworks to enhance their efficacy. To promote further research and community understanding, the source code has been publicly shared.
SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation
results: 实验结果表明,使用优化的token mixer可以提高模型性能,并且可以在多个计算机视觉任务上实现性能提升。Abstract
Recent studies show that self-attentions behave like low-pass filters (as opposed to convolutions) and enhancing their high-pass filtering capability improves model performance. Contrary to this idea, we investigate existing convolution-based models with spectral analysis and observe that improving the low-pass filtering in convolution operations also leads to performance improvement. To account for this observation, we hypothesize that utilizing optimal token mixers that capture balanced representations of both high- and low-frequency components can enhance the performance of models. We verify this by decomposing visual features into the frequency domain and combining them in a balanced manner. To handle this, we replace the balancing problem with a mask filtering problem in the frequency domain. Then, we introduce a novel token-mixer named SPAM and leverage it to derive a MetaFormer model termed as SPANet. Experimental results show that the proposed method provides a way to achieve this balance, and the balanced representations of both high- and low-frequency components can improve the performance of models on multiple computer vision tasks. Our code is available at $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}$.
摘要
近期研究表明,自我注意力 behave like low-pass filters(而不是卷积),增强高频过滤能力可以提高模型性能。然而,我们通过 spectral analysis 对现有的卷积基于模型进行分析,发现提高卷积操作中的low-pass filtering也可以提高模型性能。为了解释这一观察,我们假设可以通过使用优化的 токен混合器来捕捉高频和低频组件的平衡表示。我们通过分解视觉特征到频域并将其混合在一起来实现这一点。然后,我们将混合问题转化为频域中的面积滤波问题。接着,我们提出了一种新的 токен混合器名为 SPAM,并使用它来Derive一种基于 SPANet 的 MetaFormer 模型。实验结果表明,我们的方法可以实现这种平衡,并且在多个计算机视觉任务上提高模型性能。我们的代码可以在 $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}$ 上找到。
EndoNet: model for automatic calculation of H-score on histological slides
paper_authors: Egor Ushakov, Anton Naumov, Vladislav Fomberg, Polina Vishnyakova, Aleksandra Asaturova, Alina Badlaeva, Anna Tregubova, Evgeny Karpulevich, Gennady Sukhikh, Timur Fatkhudinov
for: used to assess the presence and distribution of proteins in tissue samples
methods: combines intensity of staining and percentage of stained nuclei, with a computer-aided model (EndoNet) using neural networks to predict H-score values
results: 0.77 mAP on a test dataset, with the ability to adjust the model for specific specialists or laboratories to reproduce the manner of calculating H-scoresAbstract
H-score is a semi-quantitative method used to assess the presence and distribution of proteins in tissue samples by combining the intensity of staining and percentage of stained nuclei. It is widely used but time-consuming and can be limited in accuracy and precision. Computer-aided methods may help overcome these limitations and improve the efficiency of pathologists' workflows. In this work, we developed a model EndoNet for automatic calculation of H-score on histological slides. Our proposed method uses neural networks and consists of two main parts. The first is a detection model which predicts keypoints of centers of nuclei. The second is a H-score module which calculates the value of the H-score using mean pixel values of predicted keypoints. Our model was trained and validated on 1780 annotated tiles with a shape of 100x100 $\mu m$ and performed 0.77 mAP on a test dataset. Moreover, the model can be adjusted to a specific specialist or whole laboratory to reproduce the manner of calculating the H-score. Thus, EndoNet is effective and robust in the analysis of histology slides, which can improve and significantly accelerate the work of pathologists.
摘要1 H-score是一种半量化方法,用于评估组织样本中蛋白的存在和分布,通过将染色强度和染色细胞核的百分比相加。这种方法广泛使用,但时间消耗大并且精度和精度有限。计算机助成方法可能可以帮助超越这些限制,改善病理医生的工作流程。在这项工作中,我们开发了一个名为EndoNet的自动计算H-score模型。我们的提议方法包括两个主要部分:第一是一个检测模型,用于预测细胞核中心点的位置。第二是H-score模块,使用预测的中心点的平均像素值来计算H-score的值。我们的模型在1780个标注的块上进行了训练和验证,在测试数据集上达到了0.77 mAP。此外,模型可以根据特定的专家或整个实验室来调整计算H-score的方式,因此EndoNet是有效和可靠的 histology 板块分析方法,可以加速和改善病理医生的工作。
Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation
results: 本研究在 AVDN Challenge 2023 中获得了冠军,与基线相比,SPL 和 SR 指标上减少了2.2%和3.0%的绝对差。 代码可以在 https://github.com/yifeisu/avdn-challenge 中找到。Abstract
This report details the method of the winning entry of the AVDN Challenge in ICCV 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition, an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge 2023, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/avdn-challenge.
摘要
本报告详细介绍了ICCV 2023年AVDN Challenge的赢家方案。比赛要求飞行器代理人 associate dialog history with aerial observations来达到目的地。为了提高飞行器代理人的 Cross-modal grounding能力,我们提议了 Target-Grounded Graph-Aware Transformer(TG-GAT)框架。具体来说,TG-GAT首先利用graph-aware transformer来捕捉空时间dependency,这有助于导航状态跟踪和Robust action planning。此外,我们还提出了一个auxiliary visual grounding任务,以提高飞行器代理人对参照地标的意识。此外,我们还使用基于大语言模型的混合增强策略来缓解数据稀缺的限制。我们的TG-GAT框架在SPL和SR指标上分别提高了2.2%和3.0%的绝对优势,详细的实现细节可以在https://github.com/yifeisu/avdn-challenge上找到。
results: 这个研究的结果显示,使用度量学习的开集源推实方法可以很好地识别图像的来源,并且可以应对新兴的图像生成技术。Abstract
AI-generated images have become increasingly realistic and have garnered significant public attention. While synthetic images are intriguing due to their realism, they also pose an important misinformation threat. To address this new threat, researchers have developed multiple algorithms to detect synthetic images and identify their source generators. However, most existing source attribution techniques are designed to operate in a closed-set scenario, i.e. they can only be used to discriminate between known image generators. By contrast, new image-generation techniques are rapidly emerging. To contend with this, there is a great need for open-set source attribution techniques that can identify when synthetic images have originated from new, unseen generators. To address this problem, we propose a new metric learning-based approach. Our technique works by learning transferrable embeddings capable of discriminating between generators, even when they are not seen during training. An image is first assigned to a candidate generator, then is accepted or rejected based on its distance in the embedding space from known generators' learned reference points. Importantly, we identify that initializing our source attribution embedding network by pretraining it on image camera identification can improve our embeddings' transferability. Through a series of experiments, we demonstrate our approach's ability to attribute the source of synthetic images in open-set scenarios.
摘要
人工生成的图像在真实性方面有所提高,引起了公众的广泛关注。然而,这些人工图像同时也存在误导性的威胁。为了解决这一问题,研究人员们已经开发出多种检测人工图像并确定其生成器的算法。然而,大多数现有的源归属技术只能在关闭集成enario下运行,即只能用于 отличаbetween已知的图像生成器。而新的图像生成技术在不断演化。为了应对这一问题,有一个很大的需求是开放集成源归属技术,可以在新的生成器出现前就可以识别出来的人工图像的来源。为了解决这个问题,我们提出了一种基于度量学习的新方法。我们的技术通过学习可转移的嵌入来分类生成器,即在训练时没有看到的生成器可以通过嵌入来分类。首先,我们将图像分配给候选生成器,然后根据图像与已知生成器学习的参考点之间的距离来接受或拒绝。我们发现,在初始化我们的源归属嵌入网络时,通过先进行图像摄像头识别的初始化可以提高我们的嵌入的传输性。通过一系列的实验,我们证明了我们的方法可以在开放集成 scenarios中归属人工图像的来源。
results: 实验表明,这种简单的模型在视频到文本和文本到视频两个任务中表现出色,超过了其他模型,并成为了MeVTR任务的robust基础。Abstract
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.
摘要
视频文本检索(VTR)是一个重要的多模态任务,在互联网上充斥着大量视频文本数据时成为了一个突出的方法。许多工作都是使用两核视语言模型架构,这些模型学习了视频文本对的共同表示,成为VTR任务的主要方法。然而,这些模型假设视频内容与文本相对应一一,忽略了实际应用中的具有多个事件的视频内容,而文本则是单一的事件的描述。这种情况导致了以前的训练目标与实际应用之间的差距,从而可能导致早期模型的性能下降。在这个研究中,我们引入了多事件视频文本检索任务(MeVTR),解决了每个视频包含多个不同的事件的场景,这是传统的VTR任务的一种特殊情况。我们提出了一种简单的模型,Me-Retriever,它包括了关键事件视频表示和一个新的MeVTR损失函数。我们进行了全面的实验,表明这种简单的框架在视频到文本和文本到视频两个任务中表现出色,有效地建立了MeVTR任务的基本参考模型。我们认为这项工作为未来研究提供了一个坚实的基础。代码可以在https://github.com/gengyuanmax/MeVTR上下载。