methods: 这篇论文使用了一种混合式 iniciative approach,通过提取坐标轴和标签,将图表的布局 decomposing into four semantic component: mark groups、 spatial relationships、数据编码和图形约束。
results: 在 150 个 rectangle-based SVG 图表上,这种方法可以达到 above 85% 的准确率 для坐标和标签提取,以及 96% 的布局 decomposing 精度。在一个图表重用研究中,参与者可以轻松地将现有的图表应用到新的数据上。Abstract
To facilitate the reuse of existing charts, previous research has examined how to obtain a semantic understanding of a chart by deconstructing its visual representation into reusable components, such as encodings. However, existing deconstruction approaches primarily focus on chart styles, handling only basic layouts. In this paper, we investigate how to deconstruct chart layouts, focusing on rectangle-based ones, as they cover not only 17 chart types but also advanced layouts (e.g., small multiples, nested layouts). We develop an interactive tool, called Mystique, adopting a mixed-initiative approach to extract the axes and legend, and deconstruct a chart's layout into four semantic components: mark groups, spatial relationships, data encodings, and graphical constraints. Mystique employs a wizard interface that guides chart authors through a series of steps to specify how the deconstructed components map to their own data. On 150 rectangle-based SVG charts, Mystique achieves above 85% accuracy for axis and legend extraction and 96% accuracy for layout deconstruction. In a chart reproduction study, participants could easily reuse existing charts on new datasets. We discuss the current limitations of Mystique and future research directions.
摘要
<>传送给我的文本翻译成简化中文。>以便重用现有的图表,前期研究已经研究了如何从图表的视觉表示中提取SemanticComponents,例如编码。然而,现有的分解方法主要集中在图表风格上,只处理基本布局。在这篇论文中,我们调查了如何从图表布局中提取SemanticComponents,特点Rectangle-based布局,因为它们不仅覆盖了17种图表类型,还包括高级布局(例如小多个、嵌套布局)。我们开发了一个交互工具,名为Mystique,采用杂合主义方法来提取轴和标签,并将图表布局分解成四个SemanticComponents:标记组、空间关系、数据编码和图形约束。Mystique使用了一个帮助chart作者通过一系列步骤指定分解后的组件与他们的数据之间的映射。在150个Rectangle-based SVG图表上,Mystique实现了轴和标签提取的准确率高于85%,布局分解的准确率达96%。在一次图表重制实验中,参与者可以轻松地将现有图表应用到新的数据集上。我们讨论了Mystique的当前限制和未来研究方向。
Model Calibration in Dense Classification with Adaptive Label Perturbation
results: 实验结果显示,ASLP 可以对封闭的 binary 分类模型进行重大的均化,并且可以保持知道的标签准确率。在 both in-distribution 和 out-of-distribution 数据上,ASLP 可以提高模型的准确性和信任度。Abstract
For safety-related applications, it is crucial to produce trustworthy deep neural networks whose prediction is associated with confidence that can represent the likelihood of correctness for subsequent decision-making. Existing dense binary classification models are prone to being over-confident. To improve model calibration, we propose Adaptive Stochastic Label Perturbation (ASLP) which learns a unique label perturbation level for each training image. ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss, which unifies label perturbation processes including stochastic approaches (like DisturbLabel), and label smoothing, to correct calibration while maintaining classification rates. ASLP follows Maximum Entropy Inference of classic statistical mechanics to maximise prediction entropy with respect to missing information. It performs this while: (1) preserving classification accuracy on known data as a conservative solution, or (2) specifically improves model calibration degree by minimising the gap between the prediction accuracy and expected confidence of the target training label. Extensive results demonstrate that ASLP can significantly improve calibration degrees of dense binary classification models on both in-distribution and out-of-distribution data. The code is available on https://github.com/Carlisle-Liu/ASLP.
摘要
Translated into Simplified Chinese:为安全相关应用,生成可靠的深度神经网络的预测结果需要与正确性的可信度相关,以便进行后续决策。现有的密集二分类模型容易过于自信。为了改善模型准确性,我们提议使用 Adaptive Stochastic Label Perturbation (ASLP),它学习每个训练图像的特有标签扰动水平。ASLP使用我们提议的 Self-Calibrating Binary Cross Entropy (SC-BCE) 损失函数,它将标签扰动过程、标签平滑和随机扰动等进行统一处理,以更正准确性。ASLP采用经典统计力学中的最大 entropy 推理来最大化预测结果的不确定性,同时: (1) 保持知道数据上的分类精度作为保守解决方案,或 (2) 特定地改善模型准确性度。广泛的结果表明,ASLP可以大幅提高密集二分类模型的准确性和可靠性。代码可以在 https://github.com/Carlisle-Liu/ASLP 上获取。
Not with my name! Inferring artists’ names of input strings employed by Diffusion Models
results: 实验结果表明,我们的方法可以准确地预测图像的一部分,并且可以作为图像的输入串进行预测。这些结果表明了我们的方法是一个有用的开始,可以用于预测某个图像的完整输入串。Abstract
Diffusion Models (DM) are highly effective at generating realistic, high-quality images. However, these models lack creativity and merely compose outputs based on their training data, guided by a textual input provided at creation time. Is it acceptable to generate images reminiscent of an artist, employing his name as input? This imply that if the DM is able to replicate an artist's work then it was trained on some or all of his artworks thus violating copyright. In this paper, a preliminary study to infer the probability of use of an artist's name in the input string of a generated image is presented. To this aim we focused only on images generated by the famous DALL-E 2 and collected images (both original and generated) of five renowned artists. Finally, a dedicated Siamese Neural Network was employed to have a first kind of probability. Experimental results demonstrate that our approach is an optimal starting point and can be employed as a prior for predicting a complete input string of an investigated image. Dataset and code are available at: https://github.com/ictlab-unict/not-with-my-name .
摘要
Diffusion Models (DM) 是非常有效的生成高质量、真实的图像。然而,这些模型缺乏创造力,只是根据它们的训练数据,遵循文本输入提供于创建时,生成输出。是否可以使用艺术家的名字来生成图像?这意味着如果 DM 能够复制艺术家的作品,那么它们可能已经训练过一些或所有的艺术作品,从而违反版权。在这篇论文中,我们提出了一项初步研究,以确定使用艺术家名字在生成图像的输入串中的概率。为此,我们仅focus在 DALL-E 2 生成的图像和五位著名艺术家的原始和生成图像上。最后,我们使用专门的 Siamese Neural Network 来获得一种首个概率。实验结果表明,我们的方法是一个优秀的起点,可以作为预测完整的输入串的先天预测。数据集和代码可以在:https://github.com/ictlab-unict/not-with-my-name 中找到。
HeightFormer: Explicit Height Modeling without Extra Data for Camera-only 3D Object Detection in Bird’s Eye View
paper_authors: Yiming Wu, Ruixiang Li, Zequn Qin, Xinhai Zhao, Xi Li for:height-based bird’s eye view (BEV) representation for autonomous drivingmethods:explicitly modeling heights in the BEV space using a self-recursive approachresults:achieves state-of-the-art (SOTA) performance compared to camera-only methods without using extra data like LiDARAbstract
Vision-based Bird's Eye View (BEV) representation is an emerging perception formulation for autonomous driving. The core challenge is to construct BEV space with multi-camera features, which is a one-to-many ill-posed problem. Diving into all previous BEV representation generation methods, we found that most of them fall into two types: modeling depths in image views or modeling heights in the BEV space, mostly in an implicit way. In this work, we propose to explicitly model heights in the BEV space, which needs no extra data like LiDAR and can fit arbitrary camera rigs and types compared to modeling depths. Theoretically, we give proof of the equivalence between height-based methods and depth-based methods. Considering the equivalence and some advantages of modeling heights, we propose HeightFormer, which models heights and uncertainties in a self-recursive way. Without any extra data, the proposed HeightFormer could estimate heights in BEV accurately. Benchmark results show that the performance of HeightFormer achieves SOTA compared with those camera-only methods.
摘要
《bird's eye view(BEV)表示方式是自动驾驶视觉形态的一种emerging概念。核心挑战是在多camera视图中构建BEV空间,这是一个一对多的ILL-posed问题。我们对所有的BEV表示生成方法进行了检查,发现大多数都 fall into两类:在图像视图中模型深度或在BEV空间中模型高度,大多数是通过隐式方式来实现。在这种工作中,我们提议在BEV空间中直接模型高度,无需额外数据如LiDAR,并且可以适应任何相机装置和类型。从理论角度来看,我们证明了高度基于方法和深度基于方法之间的等价性。考虑到等价性和高度模型的一些优点,我们提议HeightFormer,它在自我循环方式中模型高度和不确定性。无需额外数据,提议的HeightFormer可以在BEV空间中估计高度的准确性。 benchmark结果表明,HeightFormer的性能与camera-only方法相比,达到了最高的SOTA水平。》Note: "SOTA" stands for "State of the Art", which means the highest level of performance currently achieved.
NormAUG: Normalization-guided Augmentation for Domain Generalization
results: 在多个 benchmark datasets 上进行了广泛的实验, validate the effectiveness of our proposed method, and show that it can effectively improve the performance of deep learning models in supervised learning tasks.Abstract
Deep learning has made significant advancements in supervised learning. However, models trained in this setting often face challenges due to domain shift between training and test sets, resulting in a significant drop in performance during testing. To address this issue, several domain generalization methods have been developed to learn robust and domain-invariant features from multiple training domains that can generalize well to unseen test domains. Data augmentation plays a crucial role in achieving this goal by enhancing the diversity of the training data. In this paper, inspired by the observation that normalizing an image with different statistics generated by different batches with various domains can perturb its feature, we propose a simple yet effective method called NormAUG (Normalization-guided Augmentation). Our method includes two paths: the main path and the auxiliary (augmented) path. During training, the auxiliary path includes multiple sub-paths, each corresponding to batch normalization for a single domain or a random combination of multiple domains. This introduces diverse information at the feature level and improves the generalization of the main path. Moreover, our NormAUG method effectively reduces the existing upper boundary for generalization based on theoretical perspectives. During the test stage, we leverage an ensemble strategy to combine the predictions from the auxiliary path of our model, further boosting performance. Extensive experiments are conducted on multiple benchmark datasets to validate the effectiveness of our proposed method.
摘要
根据图像不同域生成的不同统计数据的观察,我们提出了一种简单 yet 有效的方法,即 NormAUG(normalization-guided Augmentation)。我们的方法包括两个路径:主路径和辅助(扩展)路径。在训练阶段,辅助路径包括多个子路径,每个子路径对应一个域或一个随机组合多个域的批normalization。这引入了多样化的信息水平,从而提高主路径的泛化性。此外,我们的 NormAUG 方法有效地降低了基于理论上的最大界限,以提高泛化性。在测试阶段,我们利用了一种协同策略,将辅助路径的预测结果 ensemble,进一步提高表现。我们在多个标准 benchmark 数据集上进行了广泛的实验,以验证我们的提议的效果。
results: 这个模型在5-way ImageNet几何检测测试 benchmark 上取得了最佳结果,在线上1/5/10-shot情况下高于8/3/1%,并在线上20-way几何VOC中运行所有类别,在新类别上表现最好。Abstract
We propose Cos R-CNN, a simple exemplar-based R-CNN formulation that is designed for online few-shot object detection. That is, it is able to localise and classify novel object categories in images with few examples without fine-tuning. Cos R-CNN frames detection as a learning-to-compare task: unseen classes are represented as exemplar images, and objects are detected based on their similarity to these exemplars. The cosine-based classification head allows for dynamic adaptation of classification parameters to the exemplar embedding, and encourages the clustering of similar classes in embedding space without the need for manual tuning of distance-metric hyperparameters. This simple formulation achieves best results on the recently proposed 5-way ImageNet few-shot detection benchmark, beating the online 1/5/10-shot scenarios by more than 8/3/1%, as well as performing up to 20% better in online 20-way few-shot VOC across all shots on novel classes.
摘要
我们提出了Cos R-CNN,一种简单的示例基于的R-CNN形式,用于在线少量示例Object检测。即可以在图像中检测到未经调整的新类别 объек。Cos R-CNN将检测视为学习比较任务,未seen类是用示例图像表示,并基于这些示例图像来检测对象。cosine类型的分类头允许在示例嵌入空间进行动态适应分类参数,并促进类别的嵌入空间减少,不需要手动调整距离度量参数。这种简单的形式在最新的5种ImageNet几shot检测benchmark上达到了最佳结果,在在线1/5/10-shot场景中超过8/3/1%,并在在线20-way几shotVOC中所有陌生类上达到了20%的提升。
results: 我们在人体和动物数据集上进行了评估,与状态静的无监督方法相比,我们的方法具有更高的性能,甚至与完全监督方法相当。在更复杂的 Mixamo 数据集上进行测试,我们的方法能够正确地处理具有不同骨架和衣物的骨架。跨数据集评估表明了我们的方法具有强大的总体化能力。Abstract
The main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies. We thus propose a novel weakly-supervised keypoint-based framework to overcome these difficulties. Specifically, we use a topology-agnostic keypoint detector with inverse kinematics to compute transformations between the source and target meshes. Our method only requires supervision on the keypoints, can be applied to meshes with different topologies and is shape-invariant for the target which allows extraction of pose-only information from the target meshes without transferring shape information. We further design a cycle reconstruction to perform self-supervised pose transfer without the need for ground truth deformed mesh with the same pose and shape as the target and source, respectively. We evaluate our approach on benchmark human and animal datasets, where we achieve superior performance compared to the state-of-the-art unsupervised approaches and even comparable performance with the fully supervised approaches. We test on the more challenging Mixamo dataset to verify our approach's ability in handling meshes with different topologies and complex clothes. Cross-dataset evaluation further shows the strong generalization ability of our approach.
摘要
主要3D姿态传输挑战包括:1)缺乏不同人物表演同一姿态的对称训练数据;2)分离姿态信息和形状信息于目标网格;3)应用于不同顶点数的网格上。我们因此提出了一种新的弱型监督基点方法来解决这些挑战。我们使用不同顶点数的网格上的 topology-agnostic 基点检测器,并使用 inverse kinematics 计算源和目标网格之间的变换。我们的方法只需要监督基点,可以应用于不同顶点数的网格上,并且具有形状不变的特性,允许从目标网格中提取姿态信息而不是形状信息。我们进一步设计了一种自我监督的循环重建来实现无监督 pose transfer,不需要与target和source网格具有同样的姿态和形状的ground truth扭曲网格。我们在人类和动物数据集上评估了我们的方法,与无监督方法相比,我们达到了更高的性能,甚至与完全监督方法相比具有相似的性能。我们在更加具有挑战性的 Mixamo 数据集上进行了测试,以验证我们的方法可以处理不同顶点数的网格和复杂的衣服。cross-dataset评估还表明了我们的方法具有强大的泛化能力。
An Explainable Model-Agnostic Algorithm for CNN-based Biometrics Verification
results: 这 paper 实现了对 face biometrics 中的两个 CNN 模型(基于 MobileNetv2 和 ResNet50)的解释性。Abstract
This paper describes an adaptation of the Local Interpretable Model-Agnostic Explanations (LIME) AI method to operate under a biometric verification setting. LIME was initially proposed for networks with the same output classes used for training, and it employs the softmax probability to determine which regions of the image contribute the most to classification. However, in a verification setting, the classes to be recognized have not been seen during training. In addition, instead of using the softmax output, face descriptors are usually obtained from a layer before the classification layer. The model is adapted to achieve explainability via cosine similarity between feature vectors of perturbated versions of the input image. The method is showcased for face biometrics with two CNN models based on MobileNetv2 and ResNet50.
摘要
A signal processing interpretation of noise-reduction convolutional neural networks
results: 这篇论文通过这种新的理论框架,可以帮助理解深度卷积神经网络的内部工作机制,并且可以用于设计更加有效率的新型神经网络架构。Abstract
Encoding-decoding CNNs play a central role in data-driven noise reduction and can be found within numerous deep-learning algorithms. However, the development of these CNN architectures is often done in ad-hoc fashion and theoretical underpinnings for important design choices is generally lacking. Up to this moment there are different existing relevant works that strive to explain the internal operation of these CNNs. Still, these ideas are either scattered and/or may require significant expertise to be accessible for a bigger audience. In order to open up this exciting field, this article builds intuition on the theory of deep convolutional framelets and explains diverse ED CNN architectures in a unified theoretical framework. By connecting basic principles from signal processing to the field of deep learning, this self-contained material offers significant guidance for designing robust and efficient novel CNN architectures.
摘要
encoding-decoding CNNs 在数据驱动噪声reduction中扮演中心角色,可以在多种深度学习算法中找到。然而, develop these CNN architectures 通常是done in ad-hoc fashion,lacking theoretical underpinnings for important design choices。 Until now, there are different existing relevant works that strive to explain the internal operation of these CNNs,but these ideas are either scattered and/or may require significant expertise to be accessible for a bigger audience。 In order to open up this exciting field, this article builds intuition on the theory of deep convolutional framelets and explains diverse ED CNN architectures in a unified theoretical framework。By connecting basic principles from signal processing to the field of deep learning,this self-contained material offers significant guidance for designing robust and efficient novel CNN architectures。Here's the text with the traditional Chinese characters:Encoding-Decoding CNNs 在数据驱动噪音reduction中扮演中心角色,可以在多种深度学习算法中找到。然而,开发这些CNN架构通常是done in ad-hoc fashion,lacking theoretical underpinnings for important design choices。 Until now, there are different existing relevant works that strive to explain the internal operation of these CNNs,but these ideas are either scattered and/or may require significant expertise to be accessible for a bigger audience。 In order to open up this exciting field, this article builds intuition on the theory of deep convolutional framelets and explains diverse ED CNN architectures in a unified theoretical framework。By connecting basic principles from signal processing to the field of deep learning,this self-contained material offers significant guidance for designing robust and efficient novel CNN architectures。
Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation
results: 研究结果显示,提案的框架可以实现2.57倍的效率提升比高性能的GPU设计,并且可以实现3.94倍的高效率数据频谱比较一般的FPGA-based CNN加速器。Abstract
The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular approach to support diverse CNN modes without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. The proposed framework yields hardware designs that achieve an average of 2.57x performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94x higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.
摘要
“ convolutional neural networks (CNNs) 在各种人工智能任务中的无前例精度,使得它们在 mobil 和嵌入式设定中广泛应用。为了实现高性能和能效的推察,研究人员对 FPGA 基于 CNN 加速器的设计进行了很大的投入。在这个上下文中,单一 computation engine 成为了广泛使用的方法,以支持多种 CNN 模式,而不需要组织预设的组件重新配置。然而,这种灵活性通常会带来内存维护层的性能下降和资源处理不当用,从而导致某些层的对应不佳。在这个研究中,我们调查了这种问题的影响,并提出了一个 novel CNN 推察系统,称为 unzipFPGA。这个架构包括一个新的 CNN 硬件架构,其中包括一个可以在 run time 中实现 weights 的生成 module,以解决由限制的带宽所导致的负面影响。我们还将 unzipFPGA 扩展到一个自动化的硬件感知方法,以适应目标 CNN-device 组合,从而获得更好的精度-性能平衡。最后,我们引入了一个输入选择处理元素 (PE) 设计,以对不同的 PE 进行负载均衡。提案的架构可以实现与高度优化的 GPU 设计相同的性能效率,同时具有更高的性能密度。”
Scoring Cycling Environments Perceived Safety using Pairwise Image Comparisons
results: 研究发现,城市环境和自行车情况对人们对自行车安全性的感受产生了重要影响。这种方法可以帮助城市规划师设计更有效的措施,以促进自行车模式的普及。此外,这种方法可以continuously评估自行车环境的改进,并快速评估措施的效果。Abstract
Today, many cities seek to transition to more sustainable transportation systems. Cycling is critical in this transition for shorter trips, including first-and-last-mile links to transit. Yet, if individuals perceive cycling as unsafe, they will not cycle and choose other transportation modes. This study presents a novel approach to identifying how the perception of cycling safety can be analyzed and understood and the impact of the built environment and cycling contexts on such perceptions. We base our work on other perception studies and pairwise comparisons, using real-world images to survey respondents. We repeatedly show respondents two road environments and ask them to select the one they perceive as safer for cycling. We compare several methods capable of rating cycling environments from pairwise comparisons and classify cycling environments perceived as safe or unsafe. Urban planning can use this score to improve interventions' effectiveness and improve cycling promotion campaigns. Furthermore, this approach facilitates the continuous assessment of changing cycling environments, allows for a short-term evaluation of measures, and is efficiently deployed in different locations or contexts.
摘要
Towards Unifying Anatomy Segmentation: Automated Generation of a Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines
paper_authors: Alexander Jaus, Constantin Seibold, Kelsey Hermann, Alexandra Walter, Kristina Giske, Johannes Haubold, Jens Kleesiek, Rainer Stiefelhagen
results: 我们的方法不需要手动标注数据,并在BTCV数据集上达到85%的dice分数。此外,我们还对数据集进行了可扩展的自动检查和高质量专家检查,以确保数据集的可靠性和医学有效性。Abstract
In this study, we present a method for generating automated anatomy segmentation datasets using a sequential process that involves nnU-Net-based pseudo-labeling and anatomy-guided pseudo-label refinement. By combining various fragmented knowledge bases, we generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage which experts have approved. Our proposed procedure does not rely on manual annotation during the label aggregation stage. We examine its plausibility and usefulness using three complementary checks: Human expert evaluation which approved the dataset, a Deep Learning usefulness benchmark on the BTCV dataset in which we achieve 85% dice score without using its training dataset, and medical validity checks. This evaluation procedure combines scalable automated checks with labor-intensive high-quality expert checks. Besides the dataset, we release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
摘要
在这个研究中,我们提出了一种方法用于自动生成医学影像分割数据集,该方法包括基于nnU-Net的假标注和骨化假标注纠正。通过将各种碎片化知识库集成起来,我们生成了一个整体CT扫描图像的数据集,包含142个块级标签,对533个卷积提供了全面的解剖学覆盖。我们的提posed方法不需要手动标注 durante el etiquetado de etiquetas stage。我们使用三种 complementary checks to evaluate the plausibility and usefulness of our method: expert evaluation by human, deep learning usefulness benchmark on the BTCV dataset, and medical validity checks. This evaluation procedure combines scalable automated checks with labor-intensive high-quality expert checks. In addition to the dataset, we release our trained unified anatomical segmentation model, capable of predicting 142 anatomical structures on CT data.
Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation
results: 在R2R和UrbanWalk数据集上实现了视力语言导航中的导航指令生成性能的状态作者Abstract
We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.
摘要
我们介绍了一种新的说话模型,即KEFA模型,用于生成导航指令。现有的视语导航中的说话模型受到不同环境中视觉特征的域外差和时间固定不足的挑战。为解决这些挑战,我们提出了知识精化模块,用于增强特征表示,以及适应时间对齐方法,用于确保生成的指令和观察序列之间的细腻对齐。此外,我们提出了一个新的评价指标SPICE-D,用于评价导航指令的正确性。实验结果表明,我们提出的KEFA说话模型在R2R和UrbanWalk数据集上实现了导航指令生成性能的州际之最。
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
results: 我们的方法在三个 benchmark(ScanRefer和Nr3D/Sr3D)中的总体性能比所有现状最佳方法高。Abstract
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other objects and its position within the whole scene. In this work, we propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3DRP-Net), which can effectively capture the relative spatial relationships between objects and enhance object attributes. Specifically, 1) we propose a 3D Relative Position Multi-head Attention (3DRP-MA) module to analyze relative relations from different directions in the context of object pairs, which helps the model to focus on the specific object relations mentioned in the sentence. 2) We designed a soft-labeling strategy to alleviate the spatial ambiguity caused by redundant points, which further stabilizes and enhances the learning process through a constant and discriminative distribution. Extensive experiments conducted on three benchmarks (i.e., ScanRefer and Nr3D/Sr3D) demonstrate that our method outperforms all the state-of-the-art methods in general. The source code will be released on GitHub.
摘要
三维视觉定位是目标对象在三维点云中的地址定位,通常通过自由形式的语言描述。这些句子通常提供对目标对象的相对关系和场景中的位置信息。在这种工作中,我们提议一种关注相对关系的一stage框架,名为三维相对位置感知网络(3DRP-Net),可以有效地捕捉对象之间的相对空间关系并增强对象特征。Specifically,我们提出了一个三维相对位置多头注意模块(3DRP-MA)来分析对象之间的相对关系,从不同方向上在对象对中分析这些关系,以帮助模型专注于文本中提到的特定对象关系。另外,我们设计了一种软标注策略,以解决由重复点所引起的空间歧义,从而使模型更加稳定和精准。我们在三个 benchmark(即ScanRefer和Nr3D/Sr3D)进行了广泛的实验,结果显示,我们的方法在总体上超过了所有现有的方法。我们将代码发布到 GitHub。
Of Mice and Pose: 2D Mouse Pose Estimation from Unlabelled Data and Synthetic Prior
methods: 我们提出了一种使用无标注图像来估算鼠体姿 pose的方法,基于最近的自动学习人体 pose估算方法,使用单张图像和一组无对应的2Dpose图像,通过GAN框架来生成empirical prior of 2Dpose。我们适应了这种方法到鼠的肢体结构,并生成了synthetic 3D鼠模型来生成empirical prior。
results: 我们在一个新的鼠视频数据集上进行了实验,并与手动获取的ground truth进行比较。我们还与一种已有的supervised state-of-the-art方法进行比较,并显示了Promising results,即使没有paired training data。此外,我们还使用了一个马图像集来展示这种设置的潜在应用性。Abstract
Numerous fields, such as ecology, biology, and neuroscience, use animal recordings to track and measure animal behaviour. Over time, a significant volume of such data has been produced, but some computer vision techniques cannot explore it due to the lack of annotations. To address this, we propose an approach for estimating 2D mouse body pose from unlabelled images using a synthetically generated empirical pose prior. Our proposal is based on a recent self-supervised method for estimating 2D human pose that uses single images and a set of unpaired typical 2D poses within a GAN framework. We adapt this method to the limb structure of the mouse and generate the empirical prior of 2D poses from a synthetic 3D mouse model, thereby avoiding manual annotation. In experiments on a new mouse video dataset, we evaluate the performance of the approach by comparing pose predictions to a manually obtained ground truth. We also compare predictions with those from a supervised state-of-the-art method for animal pose estimation. The latter evaluation indicates promising results despite the lack of paired training data. Finally, qualitative results using a dataset of horse images show the potential of the setting to adapt to other animal species.
摘要
许多领域,如生态学、生物学和神经科学,通过动物记录跟踪和测量动物行为。随着时间的推移,这些数据的量已经很大,但一些计算机视觉技术无法探索它们,因为缺乏注释。为解决这个问题,我们提出了一种方法,通过使用生成的经验性姿势先验来估算无注释图像中的2D鼠体姿势。我们的提议基于最近的自动适应人体 pose 估算方法,该方法使用单张图像和一组无对应的2D姿势集来生成一个GAN框架中的经验性姿势先验。我们适应了鼠的四肢结构,并生成了一个synthetic 3D鼠模型中的经验性姿势先验,从而避免手动注释。在一个新的鼠视频数据集上进行了实验,我们评估了方法的性能,并与一种已有的supervised状态的动物pose估算方法进行比较。后者的评估结果表明,我们的方法在缺乏对应数据的情况下可以获得承诺的结果。此外,使用一个马图像集来表征其他动物种类的可能性的质性结果也提供了。
Prior Based Online Lane Graph Extraction from Single Onboard Camera Image
results: 对 NuScenes 和 Argoverse 两个标准数据集进行测试,结果显示提议方法与现有方法相比有显著改善。Abstract
The local road network information is essential for autonomous navigation. This information is commonly obtained from offline HD-Maps in terms of lane graphs. However, the local road network at a given moment can be drastically different than the one given in the offline maps; due to construction works, accidents etc. Moreover, the autonomous vehicle might be at a location not covered in the offline HD-Map. Thus, online estimation of the lane graph is crucial for widespread and reliable autonomous navigation. In this work, we tackle online Bird's-Eye-View lane graph extraction from a single onboard camera image. We propose to use prior information to increase quality of the estimations. The prior is extracted from the dataset through a transformer based Wasserstein Autoencoder. The autoencoder is then used to enhance the initial lane graph estimates. This is done through optimization of the latent space vector. The optimization encourages the lane graph estimation to be logical by discouraging it to diverge from the prior distribution. We test the method on two benchmark datasets, NuScenes and Argoverse. The results show that the proposed method significantly improves the performance compared to state-of-the-art methods.
摘要
地方路网信息是自动导航的关键。这些信息通常来自于离线高级地图,表示为几何图。然而,当地方路网在给定时刻发生重大变化,比如建筑工程或意外等,那么离线地图中提供的信息可能不准确。此外,自动汽车可能位于离线地图中没有覆盖的位置。因此,在线计算路网图是自动导航的关键。在这种情况下,我们解决了在单个摄像头图像上进行在线鸟瞰视图lane图Estimation。我们提议使用先前信息来提高估计质量。这些先前信息通过一种基于trasnformer的 Wasserstein Autoencoder提取于dataset中。然后,这种autoencoder用于提高初始lane图估计。这是通过对潜在空间向量进行优化来实现的,这种优化抑制了lane图估计与先前分布的偏离。我们对NuScenes和Argoverse两个标准数据集进行测试,结果表明,我们提posed方法与当前最佳方法相比,有 significan improvement。
Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks
paper_authors: Cheeun Hong, Kyoung Mu Lee for:* This paper aims to address the distribution mismatch problem in image super-resolution (SR) networks, which can lead to severe accuracy loss when using low-bit quantization.methods:* The proposed method, called ODM, uses a new quantization-aware training framework that regularizes the variance in features during training to reduce the distribution mismatch problem.* ODM also introduces distribution offsets to layers with a significant mismatch, which either scales or shifts channel-wise features.results:* ODM effectively outperforms existing SR quantization approaches with similar or fewer computations, demonstrating the importance of reducing the distribution mismatch problem.Abstract
Quantization is a promising approach to reduce the high computational complexity of image super-resolution (SR) networks. However, compared to high-level tasks like image classification, low-bit quantization leads to severe accuracy loss in SR networks. This is because feature distributions of SR networks are significantly divergent for each channel or input image, and is thus difficult to determine a quantization range. Existing SR quantization works approach this distribution mismatch problem by dynamically adapting quantization ranges to the variant distributions during test time. However, such dynamic adaptation incurs additional computational costs that limit the benefits of quantization. Instead, we propose a new quantization-aware training framework that effectively Overcomes the Distribution Mismatch problem in SR networks without the need for dynamic adaptation. Intuitively, the mismatch can be reduced by directly regularizing the variance in features during training. However, we observe that variance regularization can collide with the reconstruction loss during training and adversely impact SR accuracy. Thus, we avoid the conflict between two losses by regularizing the variance only when the gradients of variance regularization are cooperative with that of reconstruction. Additionally, to further reduce the distribution mismatch, we introduce distribution offsets to layers with a significant mismatch, which either scales or shifts channel-wise features. Our proposed algorithm, called ODM, effectively reduces the mismatch in distributions with minimal computational overhead. Experimental results show that ODM effectively outperforms existing SR quantization approaches with similar or fewer computations, demonstrating the importance of reducing the distribution mismatch problem. Our code is available at https://github.com/Cheeun/ODM.
摘要
量化是一种有前途的方法,可以降低图像超分辨率网络的计算复杂性。然而,相比高级任务如图像分类,低位数量化会导致SR网络的准确性丢失。这是因为SR网络的特征分布非常分散,难以确定量化范围。现有的SR量化方法会在测试时动态适应量化范围,以适应变化的特征分布。然而,这种动态适应带来额外的计算成本,限制了量化的优点。相反,我们提出了一个新的量化意识训练框架,可以有效地超越分布匹配问题。我们发现,可以在训练时直接规范特征变量,以减少分布匹配问题。然而,我们发现,变量规范可能会与重建损失冲突,影响SR准确性。因此,我们避免了这两个损失之间的冲突,通过在重建损失的梯度下规范变量。此外,为了进一步减少分布匹配问题,我们引入了分布偏移,以调整不同特征的分布。我们提出的ODM算法,可以有效地减少分布匹配问题,而且计算成本很低。实验结果表明,ODM可以有效地超越现有的SR量化方法,并且需要相同或更少的计算资源,这说明了分布匹配问题的重要性。我们的代码可以在https://github.com/Cheeun/ODM上获取。
results: 该方法在多个benchmark上达到了新的状态方法,特别是在每个像素和组件级别评估中减少了false positives率60%。Abstract
Anomaly segmentation is a critical task for driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects' boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies in masks: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; and iii) a mask refinement solution to reduce false positives. Mask2Anomaly achieves new state-of-the-art results across a range of benchmarks, both in the per-pixel and component-level evaluations. In particular, Mask2Anomaly reduces the average false positives rate by 60% wrt the previous state-of-the-art. Github page: https://github.com/shyam671/Mask2Anomaly-Unmasking-Anomalies-in-Road-Scene-Segmentation.
摘要
traditional driving application中的异常分割问题是一个关键任务,通常是以每个像素为单位进行分类的。然而,不考虑每个像素的语义上下文会导致对象边界的高度不确定性和多个假阳性。我们提议一种思路转变,即从每个像素分类转变为Mask分类。我们的Mask2异常方法在Mask分类架构中实现了异常检测方法的集成。Mask2异常包括了一些技术创新,用于改进异常检测在Mask中的精度:1. 全局掩码注意力模块,用于对前景和背景区域进行各自焦点处理。2. 掩码对比学习,以提高异常和已知类之间的边界差距。3. 掩码修正解决方案,以减少假阳性。Mask2异常实现了新的状态anner-of-the-art结果,并在像素级和组件级评估中都达到了新的高度。具体来说,Mask2异常相比前一个状态anner-of-the-art,减少了平均假阳性率60%。GitHub页面:https://github.com/shyam671/Mask2Anomaly-Unmasking-Anomalies-in-Road-Scene-Segmentation。
Mitigating Cross-client GANs-based Attack in Federated Learning
For: The paper aims to improve the security of federated learning (FL) schemes by mitigating the cross-client generative adversarial networks (GANs) attack, which can reconstruct samples from other clients.* Methods: The paper proposes a technique called Federated Ensemble Data-free Knowledge Distillation (Fed-EDKD) to resist the C-GANs attack. Fed-EDKD involves each client submitting a local model to the server for obtaining an ensemble global model, and then using data-free knowledge distillation techniques to transfer knowledge from the ensemble global model to a compressed model.* Results: The experimental results demonstrate that Fed-EDKD significantly mitigates the C-GANs attack while only incurring a slight accuracy degradation of FL.Here are the three key points in Simplified Chinese text:* For: 该论文目标是提高联合学习(FL)方案的安全性,防止跨客户端生成对抗网络(GANs)攻击,该攻击可以从其他客户端中重建样本。* Methods: 该论文提出了联合集成数据自由知识传播技术(Fed-EDKD)来防止GANs攻击。Fed-EDKD方法是每个客户端将本地模型提交到服务器,从服务器获取ensemble全球模型,然后使用数据自由知识传播技术将ensemble全球模型中的知识传播到压缩模型。* Results: 实验结果表明,Fed-EDKD有效防止GANs攻击,同时仅带来FL的精度下降。Abstract
Machine learning makes multimedia data (e.g., images) more attractive, however, multimedia data is usually distributed and privacy sensitive. Multiple distributed multimedia clients can resort to federated learning (FL) to jointly learn a global shared model without requiring to share their private samples with any third-party entities. In this paper, we show that FL suffers from the cross-client generative adversarial networks (GANs)-based (C-GANs) attack, in which a malicious client (i.e., adversary) can reconstruct samples with the same distribution as the training samples from other clients (i.e., victims). Since a benign client's data can be leaked to the adversary, this attack brings the risk of local data leakage for clients in many security-critical FL applications. Thus, we propose Fed-EDKD (i.e., Federated Ensemble Data-free Knowledge Distillation) technique to improve the current popular FL schemes to resist C-GANs attack. In Fed-EDKD, each client submits a local model to the server for obtaining an ensemble global model. Then, to avoid model expansion, Fed-EDKD adopts data-free knowledge distillation techniques to transfer knowledge from the ensemble global model to a compressed model. By this way, Fed-EDKD reduces the adversary's control capability over the global model, so Fed-EDKD can effectively mitigate C-GANs attack. Finally, the experimental results demonstrate that Fed-EDKD significantly mitigates C-GANs attack while only incurring a slight accuracy degradation of FL.
摘要
机器学习使 multimedia 数据更加吸引人,然而 multimedia 数据通常是分布式并且敏感。多个分布式 multimedia 客户可以使用联邦学习(FL)来共同学习全局共享模型,而不需要将私人样本分享给任何第三方机构。在这篇论文中,我们表明了 FL 受到跨客户生成 adversarial networks(GANs) Attack,在这种攻击中,一个邪恶客户(即敌对者)可以从其他客户(即受害者)中重建样本的分布。由于benign客户的数据可以被敌对者泄露,这种攻击可能导致客户端的本地数据泄露。因此,我们提出了 Fed-EDKD(即联邦ensemble数据free知识distillation)技术,以提高当前流行的 FL 方案,抵御 C-GANs 攻击。在 Fed-EDKD 中,每个客户提交本地模型到服务器,以获取ensemble全局模型。然后,为了避免模型扩展,Fed-EDKD 采用数据free知识distillation技术,将知识从ensemble全局模型传递到压缩模型。通过这种方式,Fed-EDKD 降低了敌对者对全局模型的控制能力,因此可以有效抵御 C-GANs 攻击。最后,实验结果表明,Fed-EDKD 可以有效抵御 C-GANs 攻击,仅受到轻度的 FL 减少。
CT-Net: Arbitrary-Shaped Text Detection via Contour Transformer
results: 经过EXTensive experiments on four challenging datasets,CT-Net表现出了较高的准确率和效率,比如CT-Net在CTW1500和Total-Text datasets上的F-measure分别达到了86.1和87.8。Abstract
Contour based scene text detection methods have rapidly developed recently, but still suffer from inaccurate frontend contour initialization, multi-stage error accumulation, or deficient local information aggregation. To tackle these limitations, we propose a novel arbitrary-shaped scene text detection framework named CT-Net by progressive contour regression with contour transformers. Specifically, we first employ a contour initialization module that generates coarse text contours without any post-processing. Then, we adopt contour refinement modules to adaptively refine text contours in an iterative manner, which are beneficial for context information capturing and progressive global contour deformation. Besides, we propose an adaptive training strategy to enable the contour transformers to learn more potential deformation paths, and introduce a re-score mechanism that can effectively suppress false positives. Extensive experiments are conducted on four challenging datasets, which demonstrate the accuracy and efficiency of our CT-Net over state-of-the-art methods. Particularly, CT-Net achieves F-measure of 86.1 at 11.2 frames per second (FPS) and F-measure of 87.8 at 10.1 FPS for CTW1500 and Total-Text datasets, respectively.
摘要
<>Translate the given text into Simplified Chinese.<>近期,基于 kontour 的Scene文本检测方法有很大的发展,但仍然受到初始 kontour 的不准确、多Stage 的错误积累以及地方信息的不足等限制。为了解决这些局限性,我们提出了一种新的arbitrary-shaped Scene文本检测框架,即CT-Net,通过进行进度ive contour regression with contour transformers。具体来说,我们首先采用一种contour initialization module,该模块可以生成不需要任何后处理的粗糙文本 kontour。然后,我们采用 contour refinement module,该模块可以在循环方式下进行文本 kontour 的细化,以捕捉更多的上下文信息并进行进度ive global kontour 的变换。此外,我们还提出了一种适应性训练策略,使得 kontour transformers 可以学习更多的可能的变换路径,并引入了一种重新分配机制,可以有效地降低假阳性。我们在四个挑战性 datasets 上进行了广泛的实验,结果表明 CT-Net 的准确率和效率比现有方法高。特别是,CT-Net 在 CTW1500 和 Total-Text datasets 上 achieved F-measure of 86.1 at 11.2 frames per second (FPS) and F-measure of 87.8 at 10.1 FPS, respectively.
Mini-PointNetPlus: a local feature descriptor in deep learning model for 3d environment perception
paper_authors: Chuanyu Luo, Nuo Cheng, Sikun Ma, Jun Xiang, Xiaohan Li, Shengguang Lei, Pu Li
for: The paper is written for improving the performance of deep learning models for 3D environment perception, specifically by proposing a novel local feature descriptor called mini-PointNetPlus.
methods: The paper uses pillarization/voxelization methods to convert point cloud data into pillars/voxels, and then applies a 2D/3D convolutional neural network (CNN) to process the data. The proposed descriptor, mini-PointNetPlus, separately projects the data points to individual features, leading to a permutation invariant and fully utilizing the features.
results: The proposed descriptor demonstrates a considerable performance improvement for 3D perception compared to the pioneer work PointNet, as proven in experiments.Abstract
Common deep learning models for 3D environment perception often use pillarization/voxelization methods to convert point cloud data into pillars/voxels and then process it with a 2D/3D convolutional neural network (CNN). The pioneer work PointNet has been widely applied as a local feature descriptor, a fundamental component in deep learning models for 3D perception, to extract features of a point cloud. This is achieved by using a symmetric max-pooling operator which provides unique pillar/voxel features. However, by ignoring most of the points, the max-pooling operator causes an information loss, which reduces the model performance. To address this issue, we propose a novel local feature descriptor, mini-PointNetPlus, as an alternative for plug-and-play to PointNet. Our basic idea is to separately project the data points to the individual features considered, each leading to a permutation invariant. Thus, the proposed descriptor transforms an unordered point cloud to a stable order. The vanilla PointNet is proved to be a special case of our mini-PointNetPlus. Due to fully utilizing the features by the proposed descriptor, we demonstrate in experiment a considerable performance improvement for 3D perception.
摘要
常用的深度学习模型 для 3D 环境识别常使用柱化/体积化方法将点云数据转换为柱/体积,然后使用 2D/3D 卷积神经网络(CNN)进行处理。点云网络(PointNet)是深度学习模型中的一个开创性的工作,广泛应用于当地特征描述器,用于提取点云特征。这是通过使用对称的最大汇聚操作来实现的,该操作提供了唯一的柱/体积特征。然而,由于忽略大多数点,最大汇聚操作会导致信息损失,从而降低模型性能。为解决这个问题,我们提出了一种新的本地特征描述器,mini-PointNetPlus,作为PointNet的替换。我们的基本想法是分别将数据点 proyect 到各自的特征上,每个特征带来一种排序不变的变换。因此,我们的描述器将无序点云转换为稳定的排序。vanilla PointNet 被证明是 mini-PointNetPlus 的特殊情况。由于完全利用特征,我们在实验中证明了使用我们的描述器可以获得3D 识别的显著性能提升。
High-Resolution Volumetric Reconstruction for Clothed Humans
results: 比state-of-the-art方法减少mean point-to-surface(P2S)精度 more than 50%,实现约2mm的准确性,并且图像从我们的文本模型中得到更高的PSNR值Abstract
We present a novel method for reconstructing clothed humans from a sparse set of, e.g., 1 to 6 RGB images. Despite impressive results from recent works employing deep implicit representation, we revisit the volumetric approach and demonstrate that better performance can be achieved with proper system design. The volumetric representation offers significant advantages in leveraging 3D spatial context through 3D convolutions, and the notorious quantization error is largely negligible with a reasonably large yet affordable volume resolution, e.g., 512. To handle memory and computation costs, we propose a sophisticated coarse-to-fine strategy with voxel culling and subspace sparse convolution. Our method starts with a discretized visual hull to compute a coarse shape and then focuses on a narrow band nearby the coarse shape for refinement. Once the shape is reconstructed, we adopt an image-based rendering approach, which computes the colors of surface points by blending input images with learned weights. Extensive experimental results show that our method significantly reduces the mean point-to-surface (P2S) precision of state-of-the-art methods by more than 50% to achieve approximately 2mm accuracy with a 512 volume resolution. Additionally, images rendered from our textured model achieve a higher peak signal-to-noise ratio (PSNR) compared to state-of-the-art methods.
摘要
我们提出了一种新的方法,用于从稀疏的RGB图像集(例如1-6张)中重建披露人体。尽管最近的研究已经取得了很好的成果,我们还是返回到了Volume representation的方法,并证明了更好的性能可以通过合适的系统设计实现。Volume representation具有利用3D空间上下文的3D卷积的优势,而且量化误差在合理的卷积分辨率(例如512)下是极其忽略不起的。为了处理内存和计算成本,我们提议了一种复杂的粗化-细化策略,包括voxel culling和子空间稀疏卷积。我们的方法首先使用离散的视觉封顶来计算粗略的形状,然后将注意力集中在粗略形状附近进行细化。一旦形状重建完成,我们采用了基于图像的渲染方法,该方法通过权重混合输入图像来计算表面点的颜色。我们的实验结果表明,我们的方法可以在512卷积分辨率下将平均点到表面精度(P2S)降低至少于50%,并且图像从我们的纹理模型中获得的PSNR值高于状态艺术方法。
GaitFormer: Revisiting Intrinsic Periodicity for Gait Recognition
results: 我们基于TPA策略提出了一种简单有效的基eline方法,并在三个常用的公共数据集(CASIA-B、OU-MVLP、GREW)上进行了广泛的实验。结果表明,我们的提议方法在多个benchmark测试中达到了当前最佳性能。Abstract
Gait recognition aims to distinguish different walking patterns by analyzing video-level human silhouettes, rather than relying on appearance information. Previous research on gait recognition has primarily focused on extracting local or global spatial-temporal representations, while overlooking the intrinsic periodic features of gait sequences, which, when fully utilized, can significantly enhance performance. In this work, we propose a plug-and-play strategy, called Temporal Periodic Alignment (TPA), which leverages the periodic nature and fine-grained temporal dependencies of gait patterns. The TPA strategy comprises two key components. The first component is Adaptive Fourier-transform Position Encoding (AFPE), which adaptively converts features and discrete-time signals into embeddings that are sensitive to periodic walking patterns. The second component is the Temporal Aggregation Module (TAM), which separates embeddings into trend and seasonal components, and extracts meaningful temporal correlations to identify primary components, while filtering out random noise. We present a simple and effective baseline method for gait recognition, based on the TPA strategy. Extensive experiments conducted on three popular public datasets (CASIA-B, OU-MVLP, and GREW) demonstrate that our proposed method achieves state-of-the-art performance on multiple benchmark tests.
摘要
走姿识别目标是通过分析视频级别的人体擦抹图来分辨不同的步态模式,而不是仅仅依靠外观信息。过去的研究中,大多数关于走姿识别的研究都是提取局部或全局的时空特征,而忽略了走姿序列中的自然 périodic 特征,这些特征可以在完全利用时,可以帮助提高性能。在这种工作中,我们提出了一种插件式策略,即时间周期对齐策略(TPA),该策略利用走姿序列中的自然时间周期特征,以及步态模式中的细致时间相关性,从而增强识别性。TPA策略包括两个关键组件。首先是适应 Fourier 变换位置编码(AFPE),该组件可以将特征和离散时间信号转换成敏感于走姿序列时间周期特征的嵌入。其次是时间聚合模块(TAM),该模块可以将嵌入分解成趋势和季节性组件,并提取有用的时间相关性,以识别主要组件,同时滤除随机噪声。我们提出了一种简单而有效的基线方法,基于 TPA 策略,并在三个流行的公共数据集(CASIA-B、OU-MVLP 和 GREW)上进行了广泛的实验,结果表明,我们的提出的方法在多个benchmark测试中达到了当前领域的状态的最佳性能。
Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network
results: 我们的方法在多个标准 benchmark dataset上取得了稳定的state-of-the-art表现,包括 FashionAI、DARN、DeepFashion 和 Zappos50K。与之前的方法不同,我们的方法不受 benchmark dataset 的影响,表现一直优良。Abstract
Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific attributes), and we demonstrate its effectiveness through a diverse visualization example. Secondly, we leverage the vision transformer for the first time to a fine-grained image retrieval task and present a simple yet effective framework compared to existing methods. Unlike previous studies where performance varied depending on the benchmark dataset, our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.
摘要
很多研究在视觉任务中尝试创建有效的嵌入空间,以便在图像中预测单个对象。然而,在实际情况下,大多数对象具有多个特定属性,如形状、颜色和长度,每个属性包含多个类。要将模型应用到实际场景中,需要能够分别识别对象的细腻特征。传统的嵌入多个特定属性到单个网络中的方法经常导致杂化,无法分别识别每个属性的细腻特征。为解决这个问题,我们提议一个名为Conditional Cross-Attention Network的方法,它可以生成独立的多个空间嵌入,用于不同的特定属性。首先,我们使用交叉注意机制将条件(特定属性)的信息融合和转换。我们通过多种视觉化示例展示了其效果。其次,我们是第一次应用视Transformer于细化图像检索任务,并提出了一个简单而有效的框架,与现有方法相比。与过去的研究不同,我们的提议方法在FashionAI、DARN、DeepFashion和Zappos50K benchmark dataset上实现了一致的状态空间性表现。
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
results: 在TGIF-QA、MSVD-QA和MSRVTT-QA等多个数据集上进行了广泛的实验,证明了KRST的超过多种状态艺术方法的优越性。Abstract
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.
摘要
主要挑战在视频问答(VideoQA)是捕捉和理解问题中的复杂空间和时间关系。现有的图学方法通常忽略问题中的关键词并使用简单的图来汇聚特征,这可能会导致性能下降。在这篇论文中,我们提出了关键词意识的相对空间时间(KRST)图网络来解决这个问题。首先,为了让问题特征意识到关键词,我们使用注意力机制来在问题编码中分配高权重到关键词。关键词意识的问题特征然后用来导引视频图建构。其次,因为关系是相对的,我们将相对关系模型纳入更好地捕捉视频中对象节点之间的空间时间动态。此外,我们将空间时间理解分解成对象级别的空间图和帧级别的时间图,这会减少空间和时间关系的相互影响。我们在TGIF-QA、MSVD-QA和MSRVTT-QA datasets上进行了广泛的实验,并证明了我们的KRST方法在多个现状顶峰方法之上。
Multi-Granularity Prediction with Learnable Fusion for Scene Text Recognition
For: The paper is written for scene text recognition (STR), which is an active research topic in computer vision. The authors aim to tackle the challenging problem of STR by incorporating linguistic knowledge into the model.* Methods: The authors use a vision STR model built upon the Vision Transformer (ViT) and a tailored Adaptive Addressing and Aggregation (A$^3$) module. They also propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model, using subword representations (BPE and WordPiece) in addition to the conventional character level representation.* Results: The proposed MGP-STR algorithm achieves an average recognition accuracy of 94% on standard benchmarks for scene text recognition, and also achieves state-of-the-art results on widely-used handwritten benchmarks and more challenging scene text datasets.Here are the three key points in Simplified Chinese text:* For: 这篇论文是为了Scene Text Recognition(STR)做出一种新的方法。* Methods: 作者使用了基于Vision Transformer(ViT)的视觉STR模型,并提出了一种适应性地址和聚合(A$^3$)模块。另外,他们还提出了一种多级预测策略,以在模型中注入语言特征。* Results: 提案的MGP-STR算法可以在标准的STR测试集上达到94%的识别率,同时在手写测试集和更加具有挑战性的Scene Text测试集上也达到了状态之最的结果。Abstract
Due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model, which is built upon ViT and a tailored Adaptive Addressing and Aggregation (A$^3$) module. It already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, \ie, subword representations (BPE and WordPiece) widely used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. To produce the final recognition results, two strategies for effectively fusing the multi-granularity predictions are devised. The resultant algorithm (termed MGP-STR) is able to push the performance envelope of STR to an even higher level. Specifically, MGP-STR achieves an average recognition accuracy of $94\%$ on standard benchmarks for scene text recognition. Moreover, it also achieves state-of-the-art results on widely-used handwritten benchmarks as well as more challenging scene text datasets, demonstrating the generality of the proposed MGP-STR algorithm. The source code and models will be available at: \url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR}.
摘要
due to the enormous technical challenges and wide range of applications, scene text recognition (STR) has been an active research topic in computer vision for years. to tackle this tough problem, numerous innovative methods have been successively proposed, and incorporating linguistic knowledge into STR models has recently become a prominent trend. in this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet functionally powerful vision STR model, which is built upon ViT and a tailored Adaptive Addressing and Aggregation (A$^3$) module. it already outperforms most previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. to integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, 例如,使用字符级别表示和wordpiece的subword表示,while no independent language model (LM) is adopted. to produce the final recognition results, two strategies for effectively fusing the multi-granularity predictions are devised. the resultant algorithm (termed MGP-STR) is able to push the performance envelope of STR to an even higher level. specifically, MGP-STR achieves an average recognition accuracy of 94% on standard benchmarks for scene text recognition. moreover, it also achieves state-of-the-art results on widely-used handwritten benchmarks as well as more challenging scene text datasets, demonstrating the generality of the proposed MGP-STR algorithm. the source code and models will be available at: https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR。
methods: 该论文提出了一种层次结构的AI系统,名为时尚矩阵(Fashion Matrix),可以通过语音指令来编辑图像。这个系统使用LLM作为基础支持,并在用户的指令下进行迭代交互。具体来说,该系统使用了多种Semantic Segmentation Models(如Grounded-SAM、MattingAnything等)来定义特定的编辑面罩,然后使用Visual Foundation Models(如Stable Diffusion、ControlNet等)来从文本提示和面罩中生成编辑后的图像。
results: 实验表明,Fashion Matrix可以充分发挥大型自然语言模型在时尚编辑领域的合作潜力。Abstract
The utilization of Large Language Models (LLMs) for the construction of AI systems has garnered significant attention across diverse fields. The extension of LLMs to the domain of fashion holds substantial commercial potential but also inherent challenges due to the intricate semantic interactions in fashion-related generation. To address this issue, we developed a hierarchical AI system called Fashion Matrix dedicated to editing photos by just talking. This system facilitates diverse prompt-driven tasks, encompassing garment or accessory replacement, recoloring, addition, and removal. Specifically, Fashion Matrix employs LLM as its foundational support and engages in iterative interactions with users. It employs a range of Semantic Segmentation Models (e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing masks based on user instructions. Subsequently, Visual Foundation Models (e.g., Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images from text prompts and masks, thereby facilitating the automation of fashion editing processes. Experiments demonstrate the outstanding ability of Fashion Matrix to explores the collaborative potential of functionally diverse pre-trained models in the domain of fashion editing.
摘要
utilization of Large Language Models (LLMs) for the construction of AI systems has garnered significant attention across diverse fields. The extension of LLMs to the domain of fashion holds substantial commercial potential but also inherent challenges due to the intricate semantic interactions in fashion-related generation. To address this issue, we developed a hierarchical AI system called Fashion Matrix dedicated to editing photos by just talking. This system facilitates diverse prompt-driven tasks, encompassing garment or accessory replacement, recoloring, addition, and removal. Specifically, Fashion Matrix employs LLM as its foundational support and engages in iterative interactions with users. It employs a range of Semantic Segmentation Models (e.g., Grounded-SAM, MattingAnything, etc.) to delineate the specific editing masks based on user instructions. Subsequently, Visual Foundation Models (e.g., Stable Diffusion, ControlNet, etc.) are leveraged to generate edited images from text prompts and masks, thereby facilitating the automation of fashion editing processes. Experiments demonstrate the outstanding ability of Fashion Matrix to explore the collaborative potential of functionally diverse pre-trained models in the domain of fashion editing.
Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation
for: Audio-visual segmentation (AVS) task, specifically to segment sounding objects in video frames using audio cues.
methods: Introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features, as well as an audio-aware query-enhanced transformer decoder that explicitly focuses on the segmentation of pinpointed sounding objects based on audio signals.
results: Outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.Abstract
The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.
摘要
目的是对视频帧中的听起来对象进行分割,使用听音信号作为cue。然而,现有的融合方法受到小感知区域和不足的听视特征融合的限制。为了解决这些问题,我们提出了一种新的听音意识 Query-强化 transformer(AuTR)方法。与现有方法不同,我们的方法引入了多Modal transformer架构,允许深度融合和听视特征的积累。此外,我们开发了一种听音意识Query-强化 transformer解码器,具体地帮助模型根据听音信号进行对象分割,而忽略无声却突出的对象。实验结果表明,我们的方法在多音和开放集成enario中表现出色,并且在多音和开放集成enario中具有更好的泛化能力。
methods: 该方法利用tensor decomposition, builds upon recent work TensoRF,并使用cloud of local tensors和classic CANDECOMP/PARAFAC(CP)归一化来分解每个tensor成三个向量,表示本地特征分布 along spatial axes,并压缩表示本地神经场的compact neural field。
results: 该方法可以实现更好的渲染质量,使用相对较少的参数,比如TensoRF和Instant-NGP。Abstract
We propose Strivec, a novel neural representation that models a 3D scene as a radiance field with sparsely distributed and compactly factorized local tensor feature grids. Our approach leverages tensor decomposition, following the recent work TensoRF, to model the tensor grids. In contrast to TensoRF which uses a global tensor and focuses on their vector-matrix decomposition, we propose to utilize a cloud of local tensors and apply the classic CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple vectors that express local feature distributions along spatial axes and compactly encode a local neural field. We also apply multi-scale tensor grids to discover the geometry and appearance commonalities and exploit spatial coherence with the tri-vector factorization at multiple local scales. The final radiance field properties are regressed by aggregating neural features from multiple local tensors across all scales. Our tri-vector tensors are sparsely distributed around the actual scene surface, discovered by a fast coarse reconstruction, leveraging the sparsity of a 3D scene. We demonstrate that our model can achieve better rendering quality while using significantly fewer parameters than previous methods, including TensoRF and Instant-NGP.
摘要
我们提出了Strivec,一种新的神经表示方法,它将三维场景视为一个辐射场,并使用稀疏分布的本地维度特征网格来模型。我们的方法利用了矩阵分解,建立在最近的TensoRF工作之上,通过对每个矩阵进行类似于CP分解(CANDECOMP/PARAFAC),将每个矩阵分解成三个向量,表示地方特征分布在空间轴上,并压缩地表示当地神经场。我们还使用多尺度矩阵网格来探索场景的几何和外观共同点,并利用多个本地尺度的空间同步来提高渲染质量。最终,我们通过将多个本地矩阵的神经特征进行汇聚来预测场景的辐射场性质。我们的三向量矩阵在实际场景表面上稀疏分布,通过快速粗略重建来发现。我们示示了我们的模型可以在使用更少参数的情况下达到更好的渲染质量,比之前的方法,包括TensoRF和Instant-NGP。
Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and other models in Keras
results: 本文对多个数据集进行了评估和比较,以便为研究人员和实践者提供参考结果,帮助他们更好地选择合适的 segmentation 模型。Abstract
Semantic segmentation plays a vital role in computer vision tasks, enabling precise pixel-level understanding of images. In this paper, we present a comprehensive library for semantic segmentation, which contains implementations of popular segmentation models like SegNet, FCN, UNet, and PSPNet. We also evaluate and compare these models on several datasets, offering researchers and practitioners a powerful toolset for tackling diverse segmentation challenges.
摘要
semantic segmentation 在计算机视觉任务中扮演着重要的角色,允许精确地理解图像的每个像素。在这篇论文中,我们提供了一个全面的Semantic Segmentation库,包括流行的 segmentation 模型如 SegNet、FCN、UNet 和 PSPNet。我们还对这些模型在多个数据集上进行了评估和比较,为研究人员和实践者提供了一套强大的工具集,用于解决多样化的 segmentation 挑战。
GeoTransformer: Fast and Robust Point Cloud Registration with Geometric Transformer
results: 实验结果表明,GeoTransformer可以达到高精度匹配,无需进行RANSAC,从而提高了匹配精度和注册精度。特别是在3DLoMatch benchmark上,我们的方法提高了匹配率18%到31%和注册精度7点以上。Abstract
We study the problem of extracting accurate correspondences for point cloud registration. Recent keypoint-free methods have shown great potential through bypassing the detection of repeatable keypoints which is difficult to do especially in low-overlap scenarios. They seek correspondences over downsampled superpoints, which are then propagated to dense points. Superpoints are matched based on whether their neighboring patches overlap. Such sparse and loose matching requires contextual features capturing the geometric structure of the point clouds. We propose Geometric Transformer, or GeoTransformer for short, to learn geometric feature for robust superpoint matching. It encodes pair-wise distances and triplet-wise angles, making it invariant to rigid transformation and robust in low-overlap cases. The simplistic design attains surprisingly high matching accuracy such that no RANSAC is required in the estimation of alignment transformation, leading to $100$ times acceleration. Extensive experiments on rich benchmarks encompassing indoor, outdoor, synthetic, multiway and non-rigid demonstrate the efficacy of GeoTransformer. Notably, our method improves the inlier ratio by $18{\sim}31$ percentage points and the registration recall by over $7$ points on the challenging 3DLoMatch benchmark. Our code and models are available at \url{https://github.com/qinzheng93/GeoTransformer}.
摘要
我们研究点云注册问题中的准确匹配问题。最近的关键点无法方法已经表现出了很大的潜力,它们通过绕过复现关键点的检测而实现了更加简单的匹配方式。它们在下采样后的超点上寻找匹配,然后将匹配推广到密集点云。超点的匹配基于他们邻近的补丁 overlap。这种稀疏和松散的匹配需要捕捉点云的几何结构。我们提出了Geometric Transformer,简称为GeoTransformer,用于学习几何特征以实现Robust superpoint匹配。它编码了对称的距离和三角形的角度,使其对于平移变换不变和低 overlap情况下具有抗锁定性。我们的简单设计听起来 surprisingly high匹配精度,无需RANSAC,从而实现了100倍的加速。我们的实验结果表明,GeoTransformer在含有室内、外部、 sintetic、多方和非RIGID的丰富benchmark上都具有remarkable的效果。其中,我们的方法提高了3DLoMatchbenchmark上的准确比例by 18-31个百分点和注册记忆by more than 7个点。我们的代码和模型可以在以下链接中找到:https://github.com/qinzheng93/GeoTransformer。
An Investigation into Glomeruli Detection in Kidney H&E and PAS Images using YOLO
paper_authors: Kimia Hemmatirad, Morteza Babaie, Jeffrey Hodgin, Liron Pantanowitz, H. R. Tizhoosh for:This paper aims to assist pathologists in detecting glomeruli in human kidney images using computerized solutions, specifically using the YOLO-v4 object detector.methods:The YOLO-v4 model was used to detect glomeruli in human kidney images, and the model was trained on whole slide images. The model was fine-tuned using a private dataset from the University of Michigan, and tested on the same dataset using two different stains (H&E and PAS).results:The results show that the YOLO-v4 model can achieve high specificity and sensitivity in detecting glomeruli in human kidney images, with an average specificity and sensitivity for all experiments. The model’s performance was also compared to existing segmentation methods on the same datasets, and the results show that the YOLO-v4 model outperforms these methods.Abstract
Context: Analyzing digital pathology images is necessary to draw diagnostic conclusions by investigating tissue patterns and cellular morphology. However, manual evaluation can be time-consuming, expensive, and prone to inter- and intra-observer variability. Objective: To assist pathologists using computerized solutions, automated tissue structure detection and segmentation must be proposed. Furthermore, generating pixel-level object annotations for histopathology images is expensive and time-consuming. As a result, detection models with bounding box labels may be a feasible solution. Design: This paper studies. YOLO-v4 (You-Only-Look-Once), a real-time object detector for microscopic images. YOLO uses a single neural network to predict several bounding boxes and class probabilities for objects of interest. YOLO can enhance detection performance by training on whole slide images. YOLO-v4 has been used in this paper. for glomeruli detection in human kidney images. Multiple experiments have been designed and conducted based on different training data of two public datasets and a private dataset from the University of Michigan for fine-tuning the model. The model was tested on the private dataset from the University of Michigan, serving as an external validation of two different stains, namely hematoxylin and eosin (H&E) and periodic acid-Schiff (PAS). Results: Average specificity and sensitivity for all experiments, and comparison of existing segmentation methods on the same datasets are discussed. Conclusions: Automated glomeruli detection in human kidney images is possible using modern AI models. The design and validation for different stains still depends on variability of public multi-stain datasets.
摘要
Context: 分析数字 PATHOLOGY 图像是必要的,以便从 Investigate 组织趋势和细胞形态中得出诊断结论。然而,手动评估可能会占用大量时间和成本,并且可能会存在Inter-和 intra-观察者的差异。目的:通过计算机化解决方案,自动检测和分类组织结构。此外,生成 Histopathology 图像的像素级对象标注是昂贵的和时间consuming。因此,使用 bounding box 标签的检测模型可能是一个可行的解决方案。设计:本文研究了 YOLO-v4(You-Only-Look-Once),一种实时物体检测器,用于微scopic 图像。YOLO 使用单个神经网络预测多个 bounding box 和对象类概率。YOLO 可以通过训练整个扫描图像来提高检测性能。本文使用 YOLO-v4 进行人肾图像中glomeruli 检测。多个实验基于不同的训练数据,包括两个公共数据集和大学Michigan 私有数据集进行了微调。模型在大学Michigan 私有数据集上进行了测试,并作为对 H&E 和 PAS 两种染料的外部验证。结果:本文提出了一些均衡性和敏感性的平均值,并与其他分 segmentation 方法在同一个数据集上进行了比较。结论:使用现代 AI 模型,自动检测人肾图像中的glomeruli 是可能的。不同的染料设计仍然取决于多个公共多种染料数据集的变化。
Does Progress On Object Recognition Benchmarks Improve Real-World Generalization?
methods: 这个论文使用了两个 datasets of objects from households across the globe,并进行了广泛的实验研究,包括对nearly 100种视觉模型进行了评估。
results: 研究发现,通过标准的基本模型训练方法,模型在不同地区的性能差距较大,Foundation CLIP 模型也存在大量地区性能差距。此外,论文还发现,通过简单地在最后一层添加更 represervative 的数据进行再训练,可以减少地区性能差距。Abstract
For more than a decade, researchers have measured progress in object recognition on ImageNet-based generalization benchmarks such as ImageNet-A, -C, and -R. Recent advances in foundation models, trained on orders of magnitude more data, have begun to saturate these standard benchmarks, but remain brittle in practice. This suggests standard benchmarks, which tend to focus on predefined or synthetic changes, may not be sufficient for measuring real world generalization. Consequently, we propose studying generalization across geography as a more realistic measure of progress using two datasets of objects from households across the globe. We conduct an extensive empirical evaluation of progress across nearly 100 vision models up to most recent foundation models. We first identify a progress gap between standard benchmarks and real-world, geographical shifts: progress on ImageNet results in up to 2.5x more progress on standard generalization benchmarks than real-world distribution shifts. Second, we study model generalization across geographies by measuring the disparities in performance across regions, a more fine-grained measure of real world generalization. We observe all models have large geographic disparities, even foundation CLIP models, with differences of 7-20% in accuracy between regions. Counter to modern intuition, we discover progress on standard benchmarks fails to improve geographic disparities and often exacerbates them: geographic disparities between the least performant models and today's best models have more than tripled. Our results suggest scaling alone is insufficient for consistent robustness to real-world distribution shifts. Finally, we highlight in early experiments how simple last layer retraining on more representative, curated data can complement scaling as a promising direction of future work, reducing geographic disparity on both benchmarks by over two-thirds.
摘要
To address this, we propose studying generalization across geography as a more realistic measure of progress. We evaluate nearly 100 vision models, including the most recent foundation models, on two datasets of objects from households around the world. Our results show that there is a significant gap between progress on ImageNet and real-world geographical shifts. While progress on ImageNet results in up to 2.5 times more progress on standard generalization benchmarks, it does not improve geographic disparities and often exacerbates them. In fact, the geographic disparities between the least performant models and today's best models have more than tripled.Our findings suggest that scaling alone is not sufficient for consistent robustness to real-world distribution shifts. However, we do find that simple last layer retraining on more representative, curated data can complement scaling and reduce geographic disparity on both benchmarks by over two-thirds. These results highlight the importance of considering real-world geographical variations when evaluating progress in object recognition.
simPLE: a visuotactile method learned in simulation to precisely pick, localize, regrasp, and place objects
paper_authors: Maria Bauza, Antonia Bronars, Yifan Hou, Ian Taylor, Nikhil Chavan-Dafle, Alberto Rodriguez for: 这篇论文旨在解决机器人抓取和安放精度问题。methods: 本论文提出了一种基于模拟和感知的方法,称为simPLE,可以帮助机器人在不知道任务的情况下,准确地抓取和安放多种不同形状的物体。results: 在使用 dual-arm 机器人和视听感知系统的实验中,simPLE 能够成功地将 15 种不同形状的物体安放到有序排列中,成功率高达 90% 以上,并且在 6 种物体上达到 1mm 的准确性。Abstract
Existing robotic systems have a clear tension between generality and precision. Deployed solutions for robotic manipulation tend to fall into the paradigm of one robot solving a single task, lacking precise generalization, i.e., the ability to solve many tasks without compromising on precision. This paper explores solutions for precise and general pick-and-place. In precise pick-and-place, i.e. kitting, the robot transforms an unstructured arrangement of objects into an organized arrangement, which can facilitate further manipulation. We propose simPLE (simulation to Pick Localize and PLacE) as a solution to precise pick-and-place. simPLE learns to pick, regrasp and place objects precisely, given only the object CAD model and no prior experience. We develop three main components: task-aware grasping, visuotactile perception, and regrasp planning. Task-aware grasping computes affordances of grasps that are stable, observable, and favorable to placing. The visuotactile perception model relies on matching real observations against a set of simulated ones through supervised learning. Finally, we compute the desired robot motion by solving a shortest path problem on a graph of hand-to-hand regrasps. On a dual-arm robot equipped with visuotactile sensing, we demonstrate pick-and-place of 15 diverse objects with simPLE. The objects span a wide range of shapes and simPLE achieves successful placements into structured arrangements with 1mm clearance over 90% of the time for 6 objects, and over 80% of the time for 11 objects. Videos are available at http://mcube.mit.edu/research/simPLE.html .
摘要
现有的机器人系统存在明确的一致性和精度之间的矛盾。已部署的机器人 manipulate 解决方案通常处于单一任务的解决方案,缺乏精度,即能够解决多个任务而不失去精度。本文探讨精度和通用性的pick-and-place解决方案。在精度的pick-and-place中,机器人将无结构的物品变换为结构化的安排,可以促进进一步的操作。我们提出了simPLE(从 simulate 到 Pick Localize 和 PLacE)作为精度和通用性的pick-and-place解决方案。simPLE通过学习,可以准确地找到、重新抓取并将物品放置在正确的位置,只需要物品 CAD 模型,没有先前经验。我们开发了三个主要 ком成分:任务意识 grasping、视听感知和重新抓 планиuning。任务意识 grasping 计算物品的可行性,包括稳定、可见和放置的优势。视听感知模型通过对实际观察与 simulations 进行比较,通过超参数学习来学习。最后,我们解决了一个最短路径问题,以计算手动重新抓取的desired robot 动作。在配备视听感知的双手机器人上,我们通过 simPLE 成功地完成了15种不同的物品的pick-and-place。物品的形状范围广泛,simPLE 在90% 的时间内成功地将物品放置到结构化的安排中,距离1毫米。视频可以在http://mcube.mit.edu/research/simPLE.html 上查看。
Deep Learning Approaches for Data Augmentation in Medical Imaging: A Review
results: 这篇论文评论了这些模型在不同的下游任务中的表现,包括分类、分 segmentation 和 Cross-modal Translation,并且评估了这些模型的优点和缺点,并提出了未来研究的方向。Abstract
Deep learning has become a popular tool for medical image analysis, but the limited availability of training data remains a major challenge, particularly in the medical field where data acquisition can be costly and subject to privacy regulations. Data augmentation techniques offer a solution by artificially increasing the number of training samples, but these techniques often produce limited and unconvincing results. To address this issue, a growing number of studies have proposed the use of deep generative models to generate more realistic and diverse data that conform to the true distribution of the data. In this review, we focus on three types of deep generative models for medical image augmentation: variational autoencoders, generative adversarial networks, and diffusion models. We provide an overview of the current state of the art in each of these models and discuss their potential for use in different downstream tasks in medical imaging, including classification, segmentation, and cross-modal translation. We also evaluate the strengths and limitations of each model and suggest directions for future research in this field. Our goal is to provide a comprehensive review about the use of deep generative models for medical image augmentation and to highlight the potential of these models for improving the performance of deep learning algorithms in medical image analysis.
摘要
深度学习已经成为医疗图像分析中广泛使用的工具,但是培训数据的有限性仍然是一个主要挑战,特别是在医疗领域,数据收集可能是成本高昂的并且受到隐私法规限制。数据扩充技术可以人工地增加培训样本数量,但这些技术通常会生成有限和不真实的结果。为解决这个问题,一些研究在医疗图像增强中使用深度生成模型,以生成更加真实和多样的数据,这些数据遵循实际数据的分布。在本文中,我们关注了三种深度生成模型,即变量自动编码器、生成对抗网络和扩散模型,并对它们在不同的下游任务中的当前状态进行了概述。我们还评估了每种模型的优缺点,并提出了未来研究的方向。我们的目标是提供一篇全面的深度生成模型在医疗图像增强中的评review,并高亮这些模型在医疗图像分析中的潜在优势。
Automatic Infant Respiration Estimation from Video: A Deep Flow-based Algorithm and a Novel Public Benchmark
paper_authors: Sai Kumar Reddy Manne, Shaotong Zhu, Sarah Ostadabbas, Michael Wan for:This paper aims to develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings, with the goal of providing fully automatic, continuous, and contactless respiratory monitoring for infants.methods:The proposed method, called AIRFlowNet, combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. The model is trained using a novel spectral bandpass loss function and a public annotated infant respiration dataset (AIR-125) with 125 videos drawn from eight infant subjects.results:Compared to other state-of-the-art methods, AIRFlowNet significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute.Abstract
Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings. Our automated infant respiration flow-based network (AIRFlowNet) combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. We support our model with the first public annotated infant respiration dataset with 125 videos (AIR-125), drawn from eight infant subjects, set varied pose, lighting, and camera conditions. We include manual respiration annotations and optimize AIRFlowNet training on them using a novel spectral bandpass loss function. When trained and tested on the AIR-125 infant data, our method significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute, compared to $\sim$4.7--6.2 for other public models designed for adult subjects and more uniform environments.
摘要
呼吸是新生儿的生命指标之一,不间断的呼吸监测对新生儿 particurlary 重要。然而,新生儿强健和触感型感测器存在舒适性、卫生性和皮肤健康等问题,特别是对幼儿。为了实现完全自动、无接触、不间断的呼吸监测,我们开发了一种深度学习方法,可以从普通的视频流中提取呼吸速率和呼吸波形。我们称之为婴儿呼吸流基网络(AIRFlowNet),它将视频提取的光流输入和空间时间卷积处理结合,特制 для婴儿领域。我们为这种模型提供了首个公共标注 infant 呼吸数据集(AIR-125),包括8名婴儿的125个视频,具有多种姿势、照明和摄像头条件。我们还包括手动呼吸注释和使用新的spectral bandpass损失函数来优化AIRFlowNet 的训练。当我们在AIR-125 infant数据集上训练和测试AIRFlowNet时,它与其他公共模型相比,在呼吸速率估计方面显著超越,具有$\sim$2.9 breaths per minute的平均绝对误差,与$\sim$4.7--6.2的其他公共模型设计 для成人主题和更加均匀的环境相比。
for: 本研究的目的是 simultaneously detect 多个不同的 OOD 场景,以提高 ML 系统的安全性和可靠性。
methods: 我们提出了一种通用的 weakly-supervised OOD detection 框架,called WOOD,它结合了一个二分类器和一个对比学习组件,以便充分利用两者的优点。我们采用了 Hinge loss 来约束 ID 和 OOD 样本的准确性。
results: 我们在多个实际世界数据集上测试了提出的 WOOD 模型,并得到了比现状态方法更高的 OOD 检测精度。特别是,我们的方法可以同时在三个不同的 OOD 场景中具有高准确性。Abstract
Out-of-distribution (OOD) detection identifies test samples that differ from the training data, which is critical to ensuring the safety and reliability of machine learning (ML) systems. While a plethora of methods have been developed to detect uni-modal OOD samples, only a few have focused on multi-modal OOD detection. Current contrastive learning-based methods primarily study multi-modal OOD detection in a scenario where both a given image and its corresponding textual description come from a new domain. However, real-world deployments of ML systems may face more anomaly scenarios caused by multiple factors like sensor faults, bad weather, and environmental changes. Hence, the goal of this work is to simultaneously detect from multiple different OOD scenarios in a fine-grained manner. To reach this goal, we propose a general-purpose weakly-supervised OOD detection framework, called WOOD, that combines a binary classifier and a contrastive learning component to reap the benefits of both. In order to better distinguish the latent representations of in-distribution (ID) and OOD samples, we adopt the Hinge loss to constrain their similarity. Furthermore, we develop a new scoring metric to integrate the prediction results from both the binary classifier and contrastive learning for identifying OOD samples. We evaluate the proposed WOOD model on multiple real-world datasets, and the experimental results demonstrate that the WOOD model outperforms the state-of-the-art methods for multi-modal OOD detection. Importantly, our approach is able to achieve high accuracy in OOD detection in three different OOD scenarios simultaneously. The source code will be made publicly available upon publication.
摘要
外部数据(OOD)检测可以识别测试样本与训练数据之间的差异,这是机器学习(ML)系统的安全性和可靠性的关键。虽然许多方法已经开发了用于检测单modal OOD样本,但只有一些关注了多modal OOD检测。现有的对比学习基于方法主要研究了一个给定的图像和其相应的文本描述来自新领域的多modal OOD检测场景。但实际世界中部署的ML系统可能会面临更多的异常场景,如感知器故障、坏天气和环境变化。因此,我们的目标是同时从多个不同的OOD场景中同精细地检测OOD样本。为达到这个目标,我们提出了一个通用强制监督OOD检测框架,called WOOD,它将对比学习和二分类器的优点相互融合。为了更好地分解ID和OOD样本的准确表示,我们采用了缩限损失来约束它们之间的相似性。此外,我们开发了一个新的分数指标,以集成binary分类器和对比学习的预测结果,以便更好地识别OOD样本。我们在多个实际世界数据集上测试了提议的WOOD模型,实验结果表明,WOOD模型在多modal OOD检测中超过了现有方法的性能。重要的是,我们的方法能够同时在三个不同的OOD场景中同精细地检测OOD样本。代码将在出版时公开。
On the characteristics of natural hydraulic dampers: An image-based approach to study the fluid flow behaviour inside the human meniscal tissue
results: 研究发现,股骨内部的流体流动与结构参数(扭曲度、连接度、孔隙率、孔径size)存在 statistically significant 相关性。一些通道的Re值可达1400,并且在输入速度为1.6m/s时出现了非达尔cy的 regime。location-dependent permeability ranges from 20-32 Darcy。在高输入速度下,流体速度和扭曲度之间存在强相关性,以及与通道径 diameter 的相关性。Abstract
The meniscal tissue is a layered material with varying properties influenced by collagen content and arrangement. Understanding the relationship between structure and properties is crucial for disease management, treatment development, and biomaterial design. The internal layer of the meniscus is softer and more deformable than the outer layers, thanks to interconnected collagen channels that guide fluid flow. To investigate these relationships, we propose a novel approach that combines Computational Fluid Dynamics (CFD) with Image Analysis (CFD-IA). We analyze fluid flow in the internal architecture of the human meniscus across a range of inlet velocities (0.1mm/s to 1.6m/s) using high-resolution 3D micro-computed tomography scans. Statistical correlations are observed between architectural parameters (tortuosity, connectivity, porosity, pore size) and fluid flow parameters (Re number distribution, permeability). Some channels exhibit Re values of 1400 at an inlet velocity of 1.6m/s, and a transition from Darcy's regime to a non-Darcian regime occurs around an inlet velocity of 0.02m/s. Location-dependent permeability ranges from 20-32 Darcy. Regression modelling reveals a strong correlation between fluid velocity and tortuosity at high inlet velocities, as well as with channel diameter at low inlet velocities. At higher inlet velocities, flow paths deviate more from the preferential direction, resulting in a decrease in the concentration parameter by an average of 0.4. This research provides valuable insights into the fluid flow behaviour within the meniscus and its structural influences.
摘要
人门韧带组织是一种层次结构,其特性受到含氧残基的含量和排列方式的影响。理解这些结构和性能之间的关系是疾病管理、治疗开发和生物材料设计的关键。人门韧带内部层次结构比外层更软和可变形,这是因为充满气流的彩虹涂层通道导致的。为了研究这些关系,我们提出了一种结合计算流动力学(CFD)和图像分析(CFD-IA)的新方法。我们使用高分辨率3D微型计算机断层扫描器来分析人门韧带内部的液体流动情况,并对输入速度(0.1mm/s至1.6m/s)进行了 Statistical correlations were observed between architectural parameters (tortuosity, connectivity, porosity, pore size) and fluid flow parameters (Re number distribution, permeability). Some channels exhibited Re values of 1400 at an inlet velocity of 1.6m/s, and a transition from Darcy's regime to a non-Darcian regime occurred around an inlet velocity of 0.02m/s. Location-dependent permeability ranged from 20-32 Darcy. Regression modeling revealed a strong correlation between fluid velocity and tortuosity at high inlet velocities, as well as with channel diameter at low inlet velocities. At higher inlet velocities, flow paths deviated more from the preferential direction, resulting in a decrease in the concentration parameter by an average of 0.4. This research provides valuable insights into the fluid flow behavior within the meniscus and its structural influences. investigation of fluid flow behavior within the meniscus across a range of inlet velocities. We found that the internal architecture of the meniscus has a significant impact on fluid flow, and that there is a strong correlation between architectural parameters and fluid flow parameters. Our findings provide valuable insights into the relationship between structure and properties in the meniscus, and have important implications for disease management, treatment development, and biomaterial design.
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
paper_authors: Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr for: This paper provides a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models, including multimodal-to-text generation models, image-text matching models, and text-to-image generation models.methods: The paper discusses various prompting methods for vision-language models, including manually created natural language instructions and automatically generated prompts as natural language instructions or vector representations.results: The paper summarizes and discusses the results of prompt engineering on vision-language models, including the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. The paper also discusses the commonalities and differences between prompting on vision-language models, language models, and vision models, as well as the challenges, future directions, and research opportunities in this field.Here is the information in Simplified Chinese text:for: 这篇论文提供了三种视觉语言模型的前沿研究报告,包括多模态文本生成模型、图像文本匹配模型以及文本图像生成模型。methods: 论文讨论了不同类型的提示方法,包括手动创建的自然语言指令以及自动生成的提示。results: 论文总结并讨论了视觉语言模型上的提示工程结果,包括基于提示进行预测而无需更新模型参数,以及使用大型预训练模型在实际任务中更加容易应用。论文还讨论了视觉语言模型、语言模型和视模型之间的相似性和不同点,以及这一领域的挑战、未来发展和研究机遇。Abstract
Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.
摘要
广泛应用工程技术是一种方法,它利用大型预训练模型,通过添加任务特定的提示(即提示),以适应新任务。提示可以手动创建为自然语言指令,或者生成自然语言指令或者向量表示。广泛应用工程技术允许基于提示进行预测,而不需要更新模型参数,并且可以轻松地应用大型预训练模型在实际任务中。在过去几年中,广泛应用工程技术在自然语言处理领域得到了广泛的研究。在最近几年中,它也在视觉语言模型中得到了广泛的研究。然而,目前没有一篇系统的概述了广泛应用工程技术在预训练视觉语言模型上的研究。这篇论文旨在提供了广泛应用工程技术在三种类型的预训练视觉语言模型上的全面概述:多模态到文本生成模型(例如FLAMINGO)、图像文本匹配模型(例如CLIP)和文本到图像生成模型(例如稳定扩散)。对于每种模型,我们将 briefly描述模型的概述、提示方法、基于提示的应用和相应的责任和道德问题。此外,我们还将讨论广泛应用工程技术在视觉语言模型、语言模型和视觉模型之间的相似和不同。 finally,我们将 SUMMARIZE 未来的挑战、未来方向和研究机会,以促进未来在这个领域的研究。
DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
results: 实验结果表明,DFA3D可以提高nuScenes数据集上的平均精度+1.41%,并且在高质量深度信息可用时可以达到+15.1%的提高。Abstract
In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformer-like architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41\% mAP on average, and up to +15.1\% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEA-Research/3D-deformable-attention.git.
摘要
在这篇论文中,我们提出了一新的操作符,即3D DeFormable Attention(DFA3D),用于2D-to-3D特征提升,该操作将多视图2D图像特征转换到一个统一的3D空间中,用于3D对象检测。现有的特征提升方法,如Lift-Splat-based和2D attention-based,可以通过利用估计的深度来获得pseudo LiDAR特征,然后将其扩展到3D空间,这是一个一旦性操作而不包含特征细化,或者忽略深度并通过2D attention机制提升特征,这可以达到更细的 semantics,但是受到深度抽象问题困扰。相比之下,我们的DFA3D-based方法首先利用估计的深度来扩展每个视图的2D特征图到3D,然后通过DFA3D机制来聚合来自扩展的3D特征图中的特征。通过DFA3D的帮助,可以有效解决深度抽象问题,并且可以逐层进行特征细化, благо于Transformer-like架构。此外,我们还提出了DFA3D的数学等效实现方式,可以显著提高内存利用率和计算速度。我们将DFA3D integrate到了使用2D attention-based特征提升的一些方法中,只需要在代码中做一些微调,并对nuScenes数据集进行评估。实验结果表明,DFA3D可以提供+1.41\% mAP的平均提升,并且在高质量深度信息可用时可以达到+15.1\% mAP的最大提升,这说明DFA3D的超越、可应用性和巨大的潜力。代码可以在https://github.com/IDEA-Research/3D-deformable-attention.git中找到。
Volcanic ash delimitation using Artificial Intelligence based on Pix2Pix
results: 试验结果表明,该方法可以准确地定义 ash 云,并且可以在任何地区应用。这种方法可以帮助预测和 mitigate 火山喷发的影响,成为一种有用的工具。Abstract
Volcanic eruptions emit ash that can be harmful to human health and cause damage to infrastructure, economic activities and the environment. The delimitation of ash clouds allows to know their behavior and dispersion, which helps in the prevention and mitigation of this phenomenon. Traditional methods take advantage of specialized software programs to process the bands or channels that compose the satellite images. However, their use is limited to experts and demands a lot of time and significant computational resources. In recent years, Artificial Intelligence has been a milestone in the computational treatment of complex problems in different areas. In particular, Deep Learning techniques allow automatic, fast and accurate processing of digital images. The present work proposes the use of the Pix2Pix model, a type of generative adversarial network that, once trained, learns the mapping of input images to output images. The architecture of such a network consisting of a generator and a discriminator provides the versatility needed to produce black and white ash cloud images from multispectral satellite images. The evaluation of the model, based on loss and accuracy plots, a confusion matrix, and visual inspection, indicates a satisfactory solution for accurate ash cloud delineation, applicable in any area of the world and becomes a useful tool in risk management.
摘要
Learning Dense Correspondences between Photos and Sketches
paper_authors: Xuanchen Lu, Xiaolong Wang, Judith E Fan
For: The paper aims to support the ability of artificial systems to understand visual images at different levels of abstraction, with a focus on sketch-photo correspondence.* Methods: The paper introduces a new sketch-photo correspondence benchmark called $\textit{PSC6k}$, which contains 150K annotations of 6250 sketch-photo pairs across 125 object categories. The authors also propose a self-supervised method for learning dense correspondences between sketch-photo pairs, using a spatial transformer network to estimate the warp flow between latent representations of a sketch and photo.* Results: The authors found that their approach outperformed several strong baselines and produced predictions that were quantitatively consistent with other warp-based methods. However, their benchmark also revealed systematic differences between predictions of the suite of models they tested and those of humans.Abstract
Humans effortlessly grasp the connection between sketches and real-world objects, even when these sketches are far from realistic. Moreover, human sketch understanding goes beyond categorization -- critically, it also entails understanding how individual elements within a sketch correspond to parts of the physical world it represents. What are the computational ingredients needed to support this ability? Towards answering this question, we make two contributions: first, we introduce a new sketch-photo correspondence benchmark, $\textit{PSC6k}$, containing 150K annotations of 6250 sketch-photo pairs across 125 object categories, augmenting the existing Sketchy dataset with fine-grained correspondence metadata. Second, we propose a self-supervised method for learning dense correspondences between sketch-photo pairs, building upon recent advances in correspondence learning for pairs of photos. Our model uses a spatial transformer network to estimate the warp flow between latent representations of a sketch and photo extracted by a contrastive learning-based ConvNet backbone. We found that this approach outperformed several strong baselines and produced predictions that were quantitatively consistent with other warp-based methods. However, our benchmark also revealed systematic differences between predictions of the suite of models we tested and those of humans. Taken together, our work suggests a promising path towards developing artificial systems that achieve more human-like understanding of visual images at different levels of abstraction. Project page: https://photo-sketch-correspondence.github.io
摘要
人类可以轻松地理解绘图和实际世界之间的连接,即使绘图不够真实。此外,人类绘图理解不仅是分类,而且还包括理解绘图中的个体元素与物理世界中的部件之间的对应关系。为解答这个问题,我们提出了两个贡献:首先,我们 introduce a new sketch-photo correspondence benchmark, $\textit{PSC6k}$,包含150,000个绘图-照片对的125个物品类别中的6,250个对。我们将现有的Sketchy数据集补充了细化的对应 metadata。其次,我们提出了一种自动学习的方法,用于学习绘图-照片对的密集对应关系。我们基于现有的对应学习方法,使用一个空间变换网络来估计绘图和照片中latent表示的截面流。我们发现这种方法在多个强基elines上表现出色,并且生成的预测与其他截面基eline相比具有较高的准确性。然而,我们的benchmark还发现系统性的差异 между模型的预测和人类的预测。总之,我们的工作表明了在不同层次的视觉图像理解方面可以开发出更人类化的人工系统。我们的研究可能会为视觉计算机科学和机器学习领域的发展提供新的思路和方法。项目页面:https://photo-sketch-correspondence.github.io
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
for: This paper focuses on the task of text-to-video retrieval, specifically addressing the issue of neglecting audio information in previous methods.
methods: The proposed method, TEFAL, uses two independent cross-modal attention blocks to enable the text to attend to the audio and video representations separately, producing both audio and video representations conditioned on the text query.
results: The proposed method achieves better than state-of-the-art performance consistently across four benchmark datasets, including MSR-VTT, LSMDC, VATEX, and Charades, demonstrating its efficacy in capturing complementary audio and video information pertinent to the text query.Here’s the simplified Chinese version of the three key points:
results: 提议的方法在四个标准测试集MSR-VTT、LSMDC、VATEX和Charades上取得了比前STATE-OF-THE-ART性能更好的结果, demonstarting its efficacy in capturing相关的音频和视频信息。Abstract
Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by ECLIPSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation.
摘要
Text-to-video遥感系统在最近几年内已经取得了重要进步,通过使用预训练模型,这些模型在大规模的图像-文本对中训练。然而,大多数最新的方法主要关注视频模式,而忽略了声音信号。然而,ECLIPSE的最新进展已经改进了长距离文本-视频遥感。不过,文本-视频遥感任务的目标是捕捉文本查询中相关的声音和视频信息,而不仅仅是实现更好的声音和视频对齐。为解决这个问题,我们提出了TEFAL方法,它是一种基于文本查询的特征对齐方法,它生成了基于文本查询的声音和视频表示。相比使用仅仅的 audiovisual注意块,我们的方法使用两个独立的跨模态注意块,这些注意块使得文本能够独立地对声音和视频表示进行注意。我们的提议方法在四个标准数据集上进行了评估,这些数据集包括声音:MSR-VTT、LSMDC、VATEX和Charades,并在这些数据集上达到了比前方的性能。这是因为我们的方法添加了基于文本查询的声音表示,这个表示提供了与文本查询conditioned的视频表示相 complementary的信息。
HOOD: Real-Time Robust Human Presence and Out-of-Distribution Detection with Low-Cost FMCW Radar
results: 在使用60GHz短距离FMCW雷达进行数据收集后,该方法在HOOD测试数据集上 achieve an average AUROC of 94.36%。此外,对比之前的State-of-the-art(SOTA)异常检测方法,HOOD方法在常见的异常检测指标上表现更高。实时实验结果可以在https://muskahya.github.io/HOOD中查看。Abstract
Human presence detection in indoor environments using millimeter-wave frequency-modulated continuous-wave (FMCW) radar is challenging due to the presence of moving and stationary clutters in indoor places. This work proposes "HOOD" as a real-time robust human presence and out-of-distribution (OOD) detection method by exploiting 60 GHz short-range FMCW radar. We approach the presence detection application as an OOD detection problem and solve the two problems simultaneously using a single pipeline. Our solution relies on a reconstruction-based architecture and works with radar macro and micro range-Doppler images (RDIs). HOOD aims to accurately detect the "presence" of humans in the presence or absence of moving and stationary disturbers. Since it is also an OOD detector, it aims to detect moving or stationary clutters as OOD in humans' absence and predicts the current scene's output as "no presence." HOOD is an activity-free approach that performs well in different human scenarios. On our dataset collected with a 60 GHz short-range FMCW Radar, we achieve an average AUROC of 94.36%. Additionally, our extensive evaluations and experiments demonstrate that HOOD outperforms state-of-the-art (SOTA) OOD detection methods in terms of common OOD detection metrics. Our real-time experiments are available at: https://muskahya.github.io/HOOD
摘要
人体存在检测在室内环境中使用毫米波频率调制连续波(FMCW)雷达是具有挑战性,因为室内存在移动和静止干扰物。这项工作提出了“HOOD”实时可靠人体存在和异常检测方法,通过利用60GHz短距离FMCW雷达。我们将存在检测应用作为异常检测问题,并将两个问题同时解决在单一管道中。我们的解决方案基于重建 architecture,并与雷达宽 macro和微范围Doppler图像(RDI)结合。HOOD hopes to accurately detect the "presence" of humans in the presence or absence of moving and stationary disturbers。此外,它还 hopes to检测移动或静止干扰物作为异常,并预测当前场景的输出为“无存”。HOOD是一种活动无关的方法,在不同的人类场景中表现良好。根据我们收集的60GHz短距离FMCW雷达数据集,我们实现了平均AUROC为94.36%。此外,我们的广泛评估和实验表明,HOOD在常见异常检测指标上表现出色,超过了现状顶峰(SOTA)异常检测方法。实时实验可以在:https://muskahya.github.io/HOOD
Dyn-E: Local Appearance Editing of Dynamic Neural Radiance Fields
results: 经过广泛的评估,本方法可以准确地编辑动态场景中的NeRF外观,并且可以保持空间和时间上的一致性。Abstract
Recently, the editing of neural radiance fields (NeRFs) has gained considerable attention, but most prior works focus on static scenes while research on the appearance editing of dynamic scenes is relatively lacking. In this paper, we propose a novel framework to edit the local appearance of dynamic NeRFs by manipulating pixels in a single frame of training video. Specifically, to locally edit the appearance of dynamic NeRFs while preserving unedited regions, we introduce a local surface representation of the edited region, which can be inserted into and rendered along with the original NeRF and warped to arbitrary other frames through a learned invertible motion representation network. By employing our method, users without professional expertise can easily add desired content to the appearance of a dynamic scene. We extensively evaluate our approach on various scenes and show that our approach achieves spatially and temporally consistent editing results. Notably, our approach is versatile and applicable to different variants of dynamic NeRF representations.
摘要
近些时候,神经辐射场(NeRF)的编辑技术已经吸引了广泛的关注,但大多数前一些作品都是静止场景的,关于动态场景的外观编辑研究相对较少。在这篇论文中,我们提出了一种新的框架,用于编辑动态NeRF的本地外观。具体来说,我们引入了一种修改区域的本地表面表示,可以在训练视频帧中插入并与原始NeRF和扭曲学习的运动表示网络一起渲染。通过我们的方法,用户无需专业技能就可以轻松地添加愿望的内容到动态场景的外观中。我们对多个场景进行了广泛的评估,并证明了我们的方法可以在空间和时间上实现一致的编辑结果。值得一提的是,我们的方法可以应用于不同的动态NeRF表示方式。