2023-10-06

cs.CV

cs.CV - 2023-10-06

An Algorithm to Train Unrestricted Sequential Discrete Morphological Neural Networks

paper_url: http://arxiv.org/abs/2310.04584
repo_url: None
paper_authors: Diego Marcondes, Mariana Feldman, Junior Barrera
for: 这个论文是为了描述一种基于深度学习的数学 morphology（MM）操作 insertion into convolutional neural networks（CNN）的方法，以及这种方法的应用在二值图像转换中。
methods: 这种方法使用的是一种新的离散 morphological neural networks（DMNN），它可以表示特定的 W-操作类型，并使用机器学习算法来学习这些操作的参数。
results: 这种方法可以在二值图像转换中提供更高的性能，并且可以在不具备域知识的情况下进行应用。

Abstract
With the advent of deep learning, there have been attempts to insert mathematical morphology (MM) operators into convolutional neural networks (CNN), and the most successful endeavor to date has been the morphological neural networks (MNN). Although MNN have performed better than CNN in solving some problems, they inherit their black-box nature. Furthermore, in the case of binary images, they are approximations, which loose the Boolean lattice structure of MM operators and, thus, it is not possible to represent a specific class of W-operators with desired properties. In a recent work, we proposed the Discrete Morphological Neural Networks (DMNN) for binary image transformation to represent specific classes of W-operators and estimate them via machine learning. We also proposed a stochastic lattice gradient descent algorithm (SLGDA) to learn the parameters of Canonical Discrete Morphological Neural Networks (CDMNN), whose architecture is composed only of operators that can be decomposed as the supremum, infimum, and complement of erosions and dilations. In this paper, we propose an algorithm to learn unrestricted sequential DMNN (USDMNN), whose architecture is given by the composition of general W-operators. We consider the representation of a W-operator by its characteristic Boolean function, and then learn it via a SLGDA in the Boolean lattice of functions. Although both the CDMNN and USDMNN have the Boolean lattice structure, USDMNN are not as dependent on prior information about the problem at hand, and may be more suitable in instances in which the practitioner does not have strong domain knowledge. We illustrate the algorithm in a practical example.

摘要
Deep learning 技术的出现，有人尝试插入数学形态（MM）运算到卷积神经网络（CNN）中，最成功的尝试是形态神经网络（MNN）。 although MNN 在解决一些问题上表现比 CNN 更好，但它们继承了黑盒模式，无法表示特定的 W-运算器。在二进制图像的情况下，MNN 是一种近似方法，不能保持 Boolean 网格结构，因此无法表示特定的 W-运算器。在我们的最近工作中，我们提出了逻辑分割神经网络（DMNN）来解决这个问题。 DMNN 可以表示特定的 W-运算器，并通过机器学习来参数化。我们还提出了一种Stochastic Lattice Gradient Descent Algorithm（SLGDA）来学习 Canonical Discrete Morphological Neural Networks（CDMNN）的参数，其架构由 supremum、infimum 和扩散、减小的操作组成。在这篇论文中，我们提出了一种算法来学习不受限制的顺序 DMNN（USDMNN）。 USDMNN 的架构由 general W-运算器组成。我们认为 W-运算器的特征 Boolean 函数可以表示它，然后通过 SLGDA 在 Boolean 网格中学习它。虽然 CDMNN 和 USDMNN 都具有 Boolean 网格结构，但 USDMNN 不受具体问题的先验知识的限制，可能更适合在具体问题上使用。我们在实践中 illustrate 了这种算法。

Universal Humanoid Motion Representations for Physics-Based Control

paper_url: http://arxiv.org/abs/2310.04582
repo_url: None
paper_authors: Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu
for: 这种新的运动表示法可以用于physics-based humanoid控制，它可以涵盖各种人工智能控制任务中的各种运动样式。
methods: 这种运动表示法使用了一种含有变量信息瓶颈的encoder-decoder结构，并通过学习自然人类运动数据来塑造运动表示。同时，它还使用了一种优化的采样策略来提高模型表达能力和采样效率。
results: 通过使用这种运动表示法，研究人员可以解决各种生成任务（如攻击和地形越过）和运动跟踪任务使用VR控制器。这种运动表示法可以生成长时间、稳定、多样化的人类运动，并且可以在各种复杂任务中表现出自然和现实的人类行为。

Abstract
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high-dimensionality of humanoid control as well as the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers its applicability in complex tasks. Our work closes this gap, significantly increasing the coverage of motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. Sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using natural and realistic human behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.

摘要
我们提出了一种涵盖广泛人形机器人控制的通用运动表示方法。由于人形机器人控制的维度较高以及学习奖励学习的自然难度，先前的方法通常是从专门的运动数据集中学习一些特定的运动风格（如行走、游戏角色）的技能嵌入。这限制了其应用在复杂任务中。我们的工作将这个差距减少，显著扩大运动表示空间的覆盖率。为了实现这一点，我们首先学习了一个可以模仿所有人类运动的运动模仿器。然后，我们通过变量信息瓶颈的encoder-decoder结构来创建我们的运动表示。此外，我们同时学习了一个受过优化的先天条件，以提高模型表达力和下游任务的采样效率。从这个幽Defaults中，我们可以生成长、稳定、多样化的人类运动。使用这个潜在空间进行层次RL，我们展示了我们的策略可以通过自然和现实的人类行为解决任务。我们通过生成任务（如击打、地形穿越）和使用VR控制器进行运动跟踪来证明了我们的运动表示的效果。

VTON-IT: Virtual Try-On using Image Translation

paper_url: http://arxiv.org/abs/2310.04558
repo_url: https://github.com/shuntos/viton-it
paper_authors: Santosh Adhikari, Bishnu Bhusal, Prashant Ghimire, Anil Shrestha
for: 这项研究旨在提供一种基于生成对抗网络的虚拟试穿服务（Virtual Try-On），以帮助用户在线上快速适应不同的服装。
methods: 该研究使用 semantic segmentation 和基于生成对抗网络的图像翻译网络，以生成高品质的虚拟试穿图像。
results: 研究发现，该方法可以生成高分辨率的自然图像，并保留人体图像中的细节 texture。

Abstract
Virtual Try-On (trying clothes virtually) is a promising application of the Generative Adversarial Network (GAN). However, it is an arduous task to transfer the desired clothing item onto the corresponding regions of a human body because of varying body size, pose, and occlusions like hair and overlapped clothes. In this paper, we try to produce photo-realistic translated images through semantic segmentation and a generative adversarial architecture-based image translation network. We present a novel image-based Virtual Try-On application VTON-IT that takes an RGB image, segments desired body part, and overlays target cloth over the segmented body region. Most state-of-the-art GAN-based Virtual Try-On applications produce unaligned pixelated synthesis images on real-life test images. However, our approach generates high-resolution natural images with detailed textures on such variant images.

摘要
虚拟试穿（虚拟尝试服装）是生成对抗网络（GAN）的应用之一，但是将想要的服装 Item onto 人体部位的任务是困难的，因为人体大小、姿势和遮盾物如头发和覆盖的衣服会导致困难。在这篇论文中，我们尝试通过 semantic segmentation 和基于生成对抗网络的图像翻译网络来生成高品质的译文图像。我们提出了一个名为 VTON-IT 的新型虚拟试穿应用，它从 RGB 图像中提取欲要的体部，并将目标衣服覆盖到提取的体部上。大多数现状的 GAN-based Virtual Try-On 应用程序在实际测试图像上生成不一致的Pixelated Synthesis图像，但我们的方法可以在 variant 图像上生成高分辨率的自然图像，具有细腻的文件。

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2310.04551
repo_url: None
paper_authors: Muhammad Osama Khan, Junbang Liang, Chun-Kai Wang, Shan Yang, Yu Lou
for:* The paper aims to improve the performance of monocular depth estimation models by proposing a comprehensive framework called MeSa, which leverages the complementary strengths of masked, geometric, and supervised pre-training.methods:* The paper uses a combination of pre-training techniques, including masked pre-training, geometric pre-training, and supervised pre-training, to improve the representations of the later layers of the model.* The authors use a layer-wise analysis technique called CKA to evaluate the effectiveness of their pre-training strategy.results:* The paper demonstrates performance improvements in both the in-distribution and out-of-distribution settings on the NYUv2 and IBims-1 datasets, compared to the SOTA SSL method.* The authors show that their approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE, and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

Abstract
Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

摘要

Iris Liveness Detection Competition (LivDet-Iris) – The 2023 Edition

paper_url: http://arxiv.org/abs/2310.04541
repo_url: None
paper_authors: Patrick Tinsley, Sandip Purnapatra, Mahsa Mitcheff, Aidan Boyd, Colton Crum, Kevin Bowyer, Patrick Flynn, Stephanie Schuckers, Adam Czajka, Meiling Fang, Naser Damer, Xingyu Liu, Caiyong Wang, Xianyun Sun, Zhaohua Chang, Xinyue Li, Guangzhe Zhao, Juan Tapia, Christoph Busch, Carlos Aravena, Daniel Schulz
for: 这个研究报告描述了2023年度的’’LivDet’’系列眼睛展示攻击检测（PAD）竞赛的结果。
methods: 这次竞赛新增了基于生成 adversarial network（GAN）生成的眼睛图像作为攻击工具（PAI）的一类，以及人工智能的检测PAI的参考准则。
results: Clarkson University和 Notre Dame大学提供了竞赛中使用的图像集，包括7种不同的PAI类型的样本，以及基eline PAD算法。 Fraunhofer IGD、北京市市政工程学院和 Höchschule Darmstadt提交了共8个PAD算法的结果。根据不同的PAI类型，分析了准确率结果，并与人工准确率进行比较。总的来说，Fraunhofer IGD的算法（使用注意力基于像素精度网络）获得了最佳权重准确率（37.31%的平均分类错误率），而北京市市政工程学院的算法（使用等权重）获得了平均分类率（22.15%）。这些结果表明，眼睛PAD仍然是一个具有挑战性的问题。

Abstract
This paper describes the results of the 2023 edition of the ''LivDet'' series of iris presentation attack detection (PAD) competitions. New elements in this fifth competition include (1) GAN-generated iris images as a category of presentation attack instruments (PAI), and (2) an evaluation of human accuracy at detecting PAI as a reference benchmark. Clarkson University and the University of Notre Dame contributed image datasets for the competition, composed of samples representing seven different PAI categories, as well as baseline PAD algorithms. Fraunhofer IGD, Beijing University of Civil Engineering and Architecture, and Hochschule Darmstadt contributed results for a total of eight PAD algorithms to the competition. Accuracy results are analyzed by different PAI types, and compared to human accuracy. Overall, the Fraunhofer IGD algorithm, using an attention-based pixel-wise binary supervision network, showed the best-weighted accuracy results (average classification error rate of 37.31%), while the Beijing University of Civil Engineering and Architecture's algorithm won when equal weights for each PAI were given (average classification rate of 22.15%). These results suggest that iris PAD is still a challenging problem.

摘要

The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric

paper_url: http://arxiv.org/abs/2310.05986
repo_url: https://github.com/dsevero/linear-autoregressive-similarity-index
paper_authors: Daniel Severo, Lucas Theis, Johannes Ballé
for: 这个论文旨在构建一种基于推理时的可视系统嵌入，不需要训练数据或深度神经网络特征。
methods: 该方法使用了一个权重最小二乘（WLS）问题，定义在像素级别，并在推理时解决，以捕捉全像和局部图像特征。
results: 实验表明，使用这种方法可以与基于学习的深度特征方法（如LPIPS和PIM）竞争，而且与手动设计的方法（如MS-SSIM）具有相似的计算成本。

Abstract
We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a perceptual similarity metric which we call LASI: Linear Autoregressive Similarity Index. Experiments on full-reference image quality assessment datasets show LASI performs competitively with learned deep feature based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020), at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang et al., 2003). We found that increasing the dimensionality of the embedding space consistently reduces the WLS loss while increasing performance on perceptual tasks, at the cost of increasing the computational complexity. LASI is fully differentiable, scales cubically with the number of embedding dimensions, and can be parallelized at the pixel-level. A Maximum Differentiation (MAD) competition (Wang & Simoncelli, 2008) between LASI and LPIPS shows that both methods are capable of finding failure points for the other, suggesting these metrics can be combined.

摘要
我们展示了如何在推理时构建视觉系统的感知嵌入，无需训练数据或深度神经网络特征。我们的感知嵌入是解定 weights 最小二乘（WLS）问题的解，定义在像素级别，并在推理时解决，可以捕捉全局和本地图像特征。在嵌入空间中的距离被用来定义一个感知相似度指标，我们称之为线性感知相似度指标（LASI）。我们的实验表明，LASI在全参照图像质量评估数据集上与基于深度学习特征的方法如LPIPS（Zhang et al., 2018）和PIM（Bhardwaj et al., 2020）竞争，同时与手工设计的方法如MS-SSIM（Wang et al., 2003）相似的计算成本。我们发现，增加嵌入空间维度可以逐渐降低WLS损失，同时提高感知任务的性能，但是会增加计算复杂度。LASI是完全导数的，可以在像素级别并行化，并且呈 кубические增长与嵌入维度相关。在MAD竞赛（Wang & Simoncelli, 2008）中，LASI和LPIPS之间进行了竞争，表明这两个指标可以相互找到失败点，这些指标可以结合使用。

URLOST: Unsupervised Representation Learning without Stationarity or Topology

paper_url: http://arxiv.org/abs/2310.04496
repo_url: None
paper_authors: Zeyu Yun, Juexiao Zhang, Bruno Olshausen, Yann LeCun, Yubei Chen
for: 该论文旨在开发一种可以从高维数据中学习有意义表示的无监督学习方法，不受数据模式性和结构的限制。
methods: 该模型结合学习自组织层、密度调整spectral clustering和伪讯采样层。
results: 与现有无监督学习方法相比，该模型在多种数据模式下可以学习有意义表示，并且在许多数据集上表现出色。

Abstract
Unsupervised representation learning has seen tremendous progress but is constrained by its reliance on data modality-specific stationarity and topology, a limitation not found in biological intelligence systems. For instance, human vision processes visual signals derived from irregular and non-stationary sampling lattices yet accurately perceives the geometry of the world. We introduce a novel framework that learns from high-dimensional data lacking stationarity and topology. Our model combines a learnable self-organizing layer, density adjusted spectral clustering, and masked autoencoders. We evaluate its effectiveness on simulated biological vision data, neural recordings from the primary visual cortex, and gene expression datasets. Compared to state-of-the-art unsupervised learning methods like SimCLR and MAE, our model excels at learning meaningful representations across diverse modalities without depending on stationarity or topology. It also outperforms other methods not dependent on these factors, setting a new benchmark in the field. This work represents a step toward unsupervised learning methods that can generalize across diverse high-dimensional data modalities.

摘要
无监督表征学学习在很大程度上得到了进步，但它受到数据类型特有的静止和结构的限制，这与生物智能系统不同。例如，人类视觉处理视觉信号来自不规则和不静止的抽样网络，然而准确地感知世界的几何结构。我们提出了一种新的框架，可以从缺乏静止和结构的高维数据学习有意义的表示。我们的模型结合可学习的自组织层、适应率调整的спектраль clustering和掩码自适应器。我们对用于生物视觉数据、视觉核心区域神经记录和基因表达数据进行评估，与现状的无监督学习方法SimCLR和MAE相比，我们的模型在不同的数据模式下学习有意义的表示，不依赖于静止和结构。它还超过了其他不依赖于这些因素的方法，创造了新的benchmark在这个领域。这种工作代表了无监督学习方法在多种高维数据模式下的总结。

Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic

paper_url: http://arxiv.org/abs/2310.04416
repo_url: None
paper_authors: Xiaoxiao Sun, Yue Yao, Shengjin Wang, Hongdong Li, Liang Zheng
for: 本研究的目的是提供一个大规模的实验数据集和评估协议，以便研究从合成数据学习摄像头识别（re-ID）领域的新方法。
methods: 本研究使用了现有的PersonX和VehicleX作为合成源领域，并收集了两个具有挑战性的实际世界目标数据集：AlicePerson和AliceVehicle。
results: 本研究提供了一个大规模的实验数据集和评估协议，以便研究从合成数据学习摄像头识别领域的新方法。在本研究中，我们还提供了一个线上服务器，让社区可以便捷地评估和比较不同的方法。

Abstract
For object re-identification (re-ID), learning from synthetic data has become a promising strategy to cheaply acquire large-scale annotated datasets and effective models, with few privacy concerns. Many interesting research problems arise from this strategy, e.g., how to reduce the domain gap between synthetic source and real-world target. To facilitate developing more new approaches in learning from synthetic data, we introduce the Alice benchmarks, large-scale datasets providing benchmarks as well as evaluation protocols to the research community. Within the Alice benchmarks, two object re-ID tasks are offered: person and vehicle re-ID. We collected and annotated two challenging real-world target datasets: AlicePerson and AliceVehicle, captured under various illuminations, image resolutions, etc. As an important feature of our real target, the clusterability of its training set is not manually guaranteed to make it closer to a real domain adaptation test scenario. Correspondingly, we reuse existing PersonX and VehicleX as synthetic source domains. The primary goal is to train models from synthetic data that can work effectively in the real world. In this paper, we detail the settings of Alice benchmarks, provide an analysis of existing commonly-used domain adaptation methods, and discuss some interesting future directions. An online server will be set up for the community to evaluate methods conveniently and fairly.

摘要
<>输入文本中文转化为简化字符串。<>对象重新标识（re-ID）学习从合成数据中得到了一种可靠的扩大大规模标注数据和效果模型，而且具有少量隐私问题。这种策略引发了许多有趣的研究问题，例如如何减少合成源和实际世界目标之间的领域差异。为了推动学习合成数据中的新方法，我们介绍了Alice数据集，这是一个大规模的数据集，同时提供了评估协议和评价标准。在Alice数据集中，我们提供了两个对象重新标识任务：人体和车辆重新标识。我们收集和标注了一些复杂的实际世界目标数据：AlicePerson和AliceVehicle，这些数据包括不同的照明、图像分辨率等。作为我们实际目标的一个重要特点，我们不手动确保了它的训练集的含义，以便更加接近实际领域适应测试enario。因此，我们 reuse existing PersonX和VehicleX作为合成源领域。我们的主要目标是通过合成数据来训练可以在实际世界中工作的模型。在这篇论文中，我们详细介绍了Alice数据集的设置，分析了现有的通用领域适应方法，并讨论了一些有趣的未来方向。我们将设立一个在线服务器，以便社区可以便捷地评估方法并公平地评估。

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

paper_url: http://arxiv.org/abs/2310.04414
repo_url: https://github.com/sxzrt/CIFAR-10-W
paper_authors: Xiaoxiao Sun, Xingjian Leng, Zijian Wang, Yang Yang, Zi Huang, Liang Zheng
for: 本研究目的是探讨机器学习模型在不同的未知环境中表现的研究问题。
methods: 本文引入了CIFAR-10-Warehouse测试环境，包含180个由搜索图像引擎和扩散模型不同方式生成的数据集。
results: 对CIFAR-10-W进行了广泛的 benchmarking 和比较实验，并显示了这些任务中的新和 interessante 结论。

Abstract
Analyzing model performance in various unseen environments is a critical research problem in the machine learning community. To study this problem, it is important to construct a testbed with out-of-distribution test sets that have broad coverage of environmental discrepancies. However, existing testbeds typically either have a small number of domains or are synthesized by image corruptions, hindering algorithm design that demonstrates real-world effectiveness. In this paper, we introduce CIFAR-10-Warehouse, consisting of 180 datasets collected by prompting image search engines and diffusion models in various ways. Generally sized between 300 and 8,000 images, the datasets contain natural images, cartoons, certain colors, or objects that do not naturally appear. With CIFAR-10-W, we aim to enhance the evaluation and deepen the understanding of two generalization tasks: domain generalization and model accuracy prediction in various out-of-distribution environments. We conduct extensive benchmarking and comparison experiments and show that CIFAR-10-W offers new and interesting insights inherent to these tasks. We also discuss other fields that would benefit from CIFAR-10-W.

摘要
研究机器学习模型在不同环境下的性能分析是机器学习社区中的一个关键问题。为了研究这个问题，构建一个包含多个环境差异的测试环境是非常重要的。然而，现有的测试环境通常只有一小部分的领域，或者通过图像损害Synthesized，这限制了算法设计的实际效果。在这篇论文中，我们介绍了CIFAR-10-Warehouse，包含180个数据集，通过图像搜索引擎和扩散模型的不同方法收集。这些数据集的大小通常在300和8,000个图像之间，包含自然图像、动漫、特定颜色或者不自然出现的对象。通过CIFAR-10-W，我们想要提高评估和深入理解两个总结任务：领域总结和模型在多个不同环境下的准确率预测。我们进行了广泛的比较和实验，并显示了CIFAR-10-W提供了新和有趣的总结预测和模型性能评估的视角。此外，我们还讨论了其他领域可以从CIFAR-10-W中受益。

FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

paper_url: http://arxiv.org/abs/2310.04412
repo_url: https://github.com/ucsc-vlaa/fedconv
paper_authors: Peiran Xu, Zeyu Wang, Jieru Mei, Liangqiong Qu, Alan Yuille, Cihang Xie, Yuyin Zhou
for: 这篇论文旨在探讨 Federated Learning (FL) 中不同设备上的数据不一致性问题，以及如何使用不同的架构元素来改善 FL 的性能。
methods: 该论文采用了一系列的实验研究，以探讨不同架构元素（如激活函数和归一化层）对 FL 性能的影响。
results: 研究发现，通过灵活地修改架构元素，纯 CNN 可以在处理不同数据客户端的 FL 中达到与 ViT 相当或甚至超过其 robustness 水平。此外，该方法可以与现有 FL 技术结合使用，并在多种 FL 标准准样上达到状态 arts 解决方案。

Abstract
Federated learning (FL) is an emerging paradigm in machine learning, where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL. Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv

摘要
federated learning (FL) 是一种emerging paradigm在机器学习中，where a shared model is collaboratively learned using data from multiple devices to mitigate the risk of data leakage. While recent studies posit that Vision Transformer (ViT) outperforms Convolutional Neural Networks (CNNs) in addressing data heterogeneity in FL, the specific architectural components that underpin this advantage have yet to be elucidated. In this paper, we systematically investigate the impact of different architectural elements, such as activation functions and normalization layers, on the performance within heterogeneous FL. Through rigorous empirical analyses, we are able to offer the first-of-its-kind general guidance on micro-architecture design principles for heterogeneous FL. Intriguingly, our findings indicate that with strategic architectural modifications, pure CNNs can achieve a level of robustness that either matches or even exceeds that of ViTs when handling heterogeneous data clients in FL. Additionally, our approach is compatible with existing FL techniques and delivers state-of-the-art solutions across a broad spectrum of FL benchmarks. The code is publicly available at https://github.com/UCSC-VLAA/FedConv.

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

paper_url: http://arxiv.org/abs/2310.04378
repo_url: https://github.com/luosiallen/latent-consistency-model
paper_authors: Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
for: 本研究旨在提高潜在扩散模型（LDM）的渐进生成速度，通过借鉴一致模型（song et al.）提出了潜在一致模型（LCM），可以快速地进行推理，并且可以在任何已经训练过LDM的基础上进行快速的高质量插值。
methods: 本研究使用了一种新的方法——潜在一致模型（LCM），它是通过解决一个扩展的概率流ODE（PF-ODE）来直接预测潜在空间中的解决方案，从而消除了许多迭代过程，使得渐进生成速度加快。
results: 根据LAION-5B-Aesthetics dataset的评估结果，LCMs可以在几步中实现高质量的文本到图像生成，并且与已有的潜在扩散模型（LDM）相比，LCMs可以快速地进行推理，从而提高渐进生成的速度。

Abstract
Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/

摘要
Latent Diffusion models (LDMs) 已经取得了高分辨率图像合成的惊人成绩。然而，iterative sampling过程 computationally intensive，导致生成过程慢。 inspirited by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), 可以快速地进行推理，只需要 minimal steps，在任何预训练LDMs上，包括Stable Diffusion (rombach et al)。通过视为 solves an augmented probability flow ODE (PF-ODE)，LCMs是设计用来直接预测 latent space中的解决方案，从而减少了许多迭代和允许快速、高精度的抽象。从预训练的 classifier-free guided diffusion models中高质量地备取32 A100 GPU hours for training。此外，我们介绍了Latent Consistency Fine-tuning (LCF)，一种适用于精度地 fine-tune LCMs on customized image dataset。对于 LAION-5B-Aesthetics dataset的评估表明，LCMs可以在几步推理中实现state-of-the-art的文本到图像生成性能。项目页面：https://latent-consistency-models.github.io/

SwimXYZ: A large-scale dataset of synthetic swimming motions and videos

paper_url: http://arxiv.org/abs/2310.04360
repo_url: None
paper_authors: Fiche Guénolé, Sevestre Vincent, Gonzalez-Barral Camila, Leglaive Simon, Séguier Renaud
for: 本研究旨在提供一个可靠的Synthetic dataset of swimming motions和视频，以便进行人体动作分析和评估。
methods: 研究人员通过使用计算机视觉技术，从多个视频源中提取了3.4万帧的动作数据，并对其进行了二维和三维关节标注。
results: 研究人员通过对SwimXYZ dataset进行分析和应用，实现了人体动作的分类和2D姿态估计。

Abstract
Technologies play an increasingly important role in sports and become a real competitive advantage for the athletes who benefit from it. Among them, the use of motion capture is developing in various sports to optimize sporting gestures. Unfortunately, traditional motion capture systems are expensive and constraining. Recently developed computer vision-based approaches also struggle in certain sports, like swimming, due to the aquatic environment. One of the reasons for the gap in performance is the lack of labeled datasets with swimming videos. In an attempt to address this issue, we introduce SwimXYZ, a synthetic dataset of swimming motions and videos. SwimXYZ contains 3.4 million frames annotated with ground truth 2D and 3D joints, as well as 240 sequences of swimming motions in the SMPL parameters format. In addition to making this dataset publicly available, we present use cases for SwimXYZ in swimming stroke clustering and 2D pose estimation.

摘要
科技在体育中扮演着越来越重要的角色，成为运动员获得优势的重要工具。其中，基于动作捕捉的技术在不同体育运动中增加了优势。然而，传统的动作捕捉系统费用高且限制性强。近些年来，基于计算机视觉的方法也在某些运动中表现不佳，如游泳，因为水下环境会导致计算机视觉的准确率下降。一个导致这种差距的原因是游泳动作的标注数据缺乏。为解决这个问题，我们介绍了SwimXYZ，一个人工生成的游泳动作数据集。SwimXYZ包含340万帧的标注2D和3D关节点，以及240个游泳动作序列在SMPL参数格式下。此外，我们还公布了SwimXYZ数据集，并提供了游泳动作分组和2D姿态估计的应用场景。

Distributed Deep Joint Source-Channel Coding with Decoder-Only Side Information

paper_url: http://arxiv.org/abs/2310.04311
repo_url: None
paper_authors: Selim F. Yilmaz, Ezgi Ozyilkan, Deniz Gunduz, Elza Erkip
for: 这个论文主要是关于优化低延迟图像传输过无线频道时，当接收端只有相关的侧信息时（温erner-Ziv场景）。methods: 作者使用了数据驱动的联合源-频道编码（JSCC）方法，该方法在实际 finite blocklength régime中已经显示出performanse superiority，并且可以提供恰当的衰减。在接收端，作者提出了一种新的神经网络架构，该架构将多个阶段的解码器侧信息integrated into the network.results: 作者的研究结果表明，提案的方法可以有效地integrate侧信息，在所有频道噪听水平下，对多种损失函数进行了改进，特别是在低频率信号噪听水平和小带宽比下表现更好。同时，作者还提供了源代码，以便进一步的研究和复现结果。

Abstract
We consider low-latency image transmission over a noisy wireless channel when correlated side information is present only at the receiver side (the Wyner-Ziv scenario). In particular, we are interested in developing practical schemes using a data-driven joint source-channel coding (JSCC) approach, which has been previously shown to outperform conventional separation-based approaches in the practical finite blocklength regimes, and to provide graceful degradation with channel quality. We propose a novel neural network architecture that incorporates the decoder-only side information at multiple stages at the receiver side. Our results demonstrate that the proposed method succeeds in integrating the side information, yielding improved performance at all channel noise levels in terms of the various distortion criteria considered here, especially at low channel signal-to-noise ratios (SNRs) and small bandwidth ratios (BRs). We also provide the source code of the proposed method to enable further research and reproducibility of the results.

摘要
我们考虑在噪声无线通道上传输低延迟图像，其中接收方具有相关的侧信息（悉尼-зи夫场景）。我们的研究关注使用数据驱动的联合源-通道编码（JSCC）方法，这种方法在实际的有限块长度尺度下已经被证明可以超越传统的分离基于方法，并且可以在通道质量下提供温饱的适应。我们提出了一种新的神经网络架构，该架构在接收方多个阶段都包含了解oder侧的副本信息。我们的结果表明，提案的方法可以有效地集成侧信息，在所有通道噪声水平下都提高了性能，特别是在低通道信噪比（SNR）和小带宽比（BR）下表现更佳。我们还提供了提案的源代码，以便进一步的研究和复现结果。

Convergent ADMM Plug and Play PET Image Reconstruction

paper_url: http://arxiv.org/abs/2310.04299
repo_url: None
paper_authors: Florent Sureau, Mahdi Latreche, Marion Savanier, Claude Comtat
for: 这个论文旨在研究基于模型基本的变量重建法和独立学习深度神经网络算法的混合PET重建方法。
methods: 该方法使用ADMM插件和玩家框架，并在学习过程中添加了一个约束来保证网络参数的稳定性。
results: 实验表明，当不在学习过程中 enforcing该约束时，ADMM算法不会 converges。而在 enforcing该约束时，方法实际上可以达到意义的稳定点。

Abstract
In this work, we investigate hybrid PET reconstruction algorithms based on coupling a model-based variational reconstruction and the application of a separately learnt Deep Neural Network operator (DNN) in an ADMM Plug and Play framework. Following recent results in optimization, fixed point convergence of the scheme can be achieved by enforcing an additional constraint on network parameters during learning. We propose such an ADMM algorithm and show in a realistic [18F]-FDG synthetic brain exam that the proposed scheme indeed lead experimentally to convergence to a meaningful fixed point. When the proposed constraint is not enforced during learning of the DNN, the proposed ADMM algorithm was observed experimentally not to converge.

摘要
在这个工作中，我们研究了基于模型基于变量重建和独立学习深度神经网络操作符（DNN）的混合PET重建算法，并在ADMM插件和撤退框架中应用。根据最近的优化结果，我们提出了一种强制在学习过程中加入网络参数的约束，以实现Fixed point convergence的方案。我们提出的ADMM算法，并在一个实际的 [18F]-FDG Synthetic brain exam中展示了，该方案实际上可以导致实际上达到一个意义的Fixed point。当不加入学习DNN的约束时，我们发现了ADMM算法在学习过程中不会 converge。

Graph learning in robotics: a survey

paper_url: http://arxiv.org/abs/2310.04294
repo_url: None
paper_authors: Francesca Pistilli, Giuseppe Averta
for: 本文旨在探讨深度神经网络在 роботех学中的应用，以便充分发挥其潜力。
methods: 本文总结了图像基于模型的基本知识，包括其建立、训练过程和应用。同时，它还讨论了在实际应用中遇到的最新进展和挑战，例如感知、决策和控制的集成。
results: 本文提供了各种机器人应用，如身体和接触模型、机器人操作、动作识别、舰队动力规划等，以及这些应用中图像学习的可能性和局限性。这篇文章旨在为读者提供图像学习在机器人领域的全面了解，并提出未来研究的可能性。

Abstract
Deep neural networks for graphs have emerged as a powerful tool for learning on complex non-euclidean data, which is becoming increasingly common for a variety of different applications. Yet, although their potential has been widely recognised in the machine learning community, graph learning is largely unexplored for downstream tasks such as robotics applications. To fully unlock their potential, hence, we propose a review of graph neural architectures from a robotics perspective. The paper covers the fundamentals of graph-based models, including their architecture, training procedures, and applications. It also discusses recent advancements and challenges that arise in applied settings, related for example to the integration of perception, decision-making, and control. Finally, the paper provides an extensive review of various robotic applications that benefit from learning on graph structures, such as bodies and contacts modelling, robotic manipulation, action recognition, fleet motion planning, and many more. This survey aims to provide readers with a thorough understanding of the capabilities and limitations of graph neural architectures in robotics, and to highlight potential avenues for future research.

摘要
深度神经网络 для图有效地处理复杂非欧几何数据，这种数据在各种应用中日益普遍。然而，虽然机器学习社区对其潜力广泛认可，但图学习在机器人应用中尚未得到广泛探索。为充分发挥其潜力，我们提出了对图神经建筑从机器人视角进行评审的评论。文章覆盖了图基于模型的基础知识，包括建筑、训练过程和应用。它还讨论了在实践中的进展和挑战，如感知、决策和控制的集成。最后，文章提供了许多机器人应用，例如身体和接触模型、机器人操作、动作识别、队伍运策划等，它们均可以从图学习中受益。这篇评论的目标是为读者提供图神经建筑在机器人领域的能力和局限性的全面了解，并高亮未来研究的可能性。

Compositional Servoing by Recombining Demonstrations

paper_url: http://arxiv.org/abs/2310.04271
repo_url: None
paper_authors: Max Argus, Abhijeet Nayak, Martin Büchner, Silvio Galesso, Abhinav Valada, Thomas Brox
for: 本文旨在提高视觉服务器的任务转移能力和多任务能力，并使其更加稳定和高精度。
methods: 本文使用图 traversal 方法来解决视觉服务器任务，并通过分解和重新组合示例来实现多任务能力。
results: 实验和实践结果表明，我们的方法可以提高任务相关的成功率，并且在高精度场景下达到更高的稳定性和效率。

Abstract
Learning-based manipulation policies from image inputs often show weak task transfer capabilities. In contrast, visual servoing methods allow efficient task transfer in high-precision scenarios while requiring only a few demonstrations. In this work, we present a framework that formulates the visual servoing task as graph traversal. Our method not only extends the robustness of visual servoing, but also enables multitask capability based on a few task-specific demonstrations. We construct demonstration graphs by splitting existing demonstrations and recombining them. In order to traverse the demonstration graph in the inference case, we utilize a similarity function that helps select the best demonstration for a specific task. This enables us to compute the shortest path through the graph. Ultimately, we show that recombining demonstrations leads to higher task-respective success. We present extensive simulation and real-world experimental results that demonstrate the efficacy of our approach.

摘要
学习基于图像输入的掌控策略经常表现出任务传递能力不足。相比之下，视觉服务方法可以在高精度场景中实现高效的任务传递，只需要几个示范。在这种工作中，我们提出了一种将视觉服务任务转换为图论探索的框架。我们的方法不仅扩展了视觉服务的稳定性，还允许基于几个任务特定示范的多任务能力。我们将示范图分解成多个示范，并将它们重新组合在一起。为在推理情况下探索示范图，我们利用一个相似性函数来选择最佳示范。这使得我们可以计算最短路径。最终，我们发现将示范重新组合可以获得更高的任务特定成功率。我们在 simulate 和实际实验中展示了我们的方法的有效性。

Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark

paper_url: http://arxiv.org/abs/2310.04253
repo_url: https://github.com/zc199823/bbnet--cocod
paper_authors: Cong Zhang, Hongbo Bi, Tian-Zhu Xiang, Ranwan Wu, Jinghui Tong, Xiufang Wang
for: 这个论文是为了研究一种新的隐身物体检测任务（CoCOD），该任务的目标是从一组相关图像中同时检测具有同样属性的隐身物体。
methods: 作者提出了一种基eline模型，名为 bilateral-branch network (BBNet)，该模型在单个图像和图像组之间进行协同探索和综合隐身征 clue，以实现准确的隐身物体检测。
results: 作者在提出的 CoCOD8K dataset上进行了广泛的实验，并与 18 种现有模型进行比较。结果表明，提出的方法和模型在 CoCOD 任务中表现出了显著的优异性。

Abstract
In this paper, we provide a comprehensive study on a new task called collaborative camouflaged object detection (CoCOD), which aims to simultaneously detect camouflaged objects with the same properties from a group of relevant images. To this end, we meticulously construct the first large-scale dataset, termed CoCOD8K, which consists of 8,528 high-quality and elaborately selected images with object mask annotations, covering 5 superclasses and 70 subclasses. The dataset spans a wide range of natural and artificial camouflage scenes with diverse object appearances and backgrounds, making it a very challenging dataset for CoCOD. Besides, we propose the first baseline model for CoCOD, named bilateral-branch network (BBNet), which explores and aggregates co-camouflaged cues within a single image and between images within a group, respectively, for accurate camouflaged object detection in given images. This is implemented by an inter-image collaborative feature exploration (CFE) module, an intra-image object feature search (OFS) module, and a local-global refinement (LGR) module. We benchmark 18 state-of-the-art models, including 12 COD algorithms and 6 CoSOD algorithms, on the proposed CoCOD8K dataset under 5 widely used evaluation metrics. Extensive experiments demonstrate the effectiveness of the proposed method and the significantly superior performance compared to other competitors. We hope that our proposed dataset and model will boost growth in the COD community. The dataset, model, and results will be available at: https://github.com/zc199823/BBNet--CoCOD.

摘要
在这篇论文中，我们提供了一项全面的研究，探讨一种新的任务 called 协同掩饰物体检测（CoCOD），该任务的目标是从一组相关图像中同时检测掩饰物体，并且这些物体具有同样的属性。为了实现这一目标，我们在这篇论文中 méticulously 构建了首个大规模数据集，名为 CoCOD8K，该数据集包含 8,528 高质量和精心选择的图像，以及对象 маску 注解，涵盖 5 个超类和 70 个 subclass。该数据集覆盖了自然和人工掩饰场景，图像中的物体外观和背景具有多样性，使得这个数据集对 CoCOD 非常吃力。此外，我们提出了首个 CoCOD 基线模型，名为 bilateral-branch 网络（BBNet），该模型在单个图像和图像集内进行协同掩饰缓解（CFE）、图像内对象搜索（OFS）和本地-全球精度调整（LGR），以确定准确的掩饰物体。我们对 18 个状态机器学习模型进行了 benchmarking，其中包括 12 COD 算法和 6 CoSOD 算法，并在 5 个常用的评价指标下进行了评估。广泛的实验表明我们提出的方法和模型在 CoCOD 中表现出色，与其他竞争对手相比显著性更高。我们希望我们的提出的数据集、模型和结果将能够推动 COD 社区的发展。数据集、模型和结果将在 GitHub 上公开：https://github.com/zc199823/BBNet--CoCOD。

Semantic segmentation of longitudinal thermal images for identification of hot and cool spots in urban areas

paper_url: http://arxiv.org/abs/2310.04247
repo_url: None
paper_authors: Vasantha Ramani, Pandarasamy Arjunan, Kameshwar Poolla, Clayton Miller
For: This paper aims to analyze thermal images collected at the neighborhood scale to identify hot and cool spots in urban areas, with the goal of helping urban planners develop strategies to mitigate the urban heat island (UHI) effect, improve building energy efficiency, and maximize outdoor thermal comfort.* Methods: The authors use state-of-the-art deep learning models to segment various urban features such as buildings, vegetation, sky, and roads from thermal images. They train the models using a subset of the thermal image dataset and compare the performance of different models, including U-Net, DeepLabV3, DeeplabV3+, FPN, and PSPnet.* Results: The authors find that the U-Net segmentation model with a `resnet34’ CNN backbone achieves the highest mIoU score of 0.99 on the test dataset, and that the masks generated using the segmentation models accurately extract the temperature from thermal images and closely match the temperature extracted using ground truth masks. The authors use the masks to identify hot and cool spots in the urban feature at various instances of time.

Abstract
This work presents the analysis of semantically segmented, longitudinally, and spatially rich thermal images collected at the neighborhood scale to identify hot and cool spots in urban areas. An infrared observatory was operated over a few months to collect thermal images of different types of buildings on the educational campus of the National University of Singapore. A subset of the thermal image dataset was used to train state-of-the-art deep learning models to segment various urban features such as buildings, vegetation, sky, and roads. It was observed that the U-Net segmentation model with `resnet34' CNN backbone has the highest mIoU score of 0.99 on the test dataset, compared to other models such as DeepLabV3, DeeplabV3+, FPN, and PSPnet. The masks generated using the segmentation models were then used to extract the temperature from thermal images and correct for differences in the emissivity of various urban features. Further, various statistical measure of the temperature extracted using the predicted segmentation masks is shown to closely match the temperature extracted using the ground truth masks. Finally, the masks were used to identify hot and cool spots in the urban feature at various instances of time. This forms one of the very few studies demonstrating the automated analysis of thermal images, which can be of potential use to urban planners for devising mitigation strategies for reducing the urban heat island (UHI) effect, improving building energy efficiency, and maximizing outdoor thermal comfort.

摘要

Enhancing the Authenticity of Rendered Portraits with Identity-Consistent Transfer Learning

paper_url: http://arxiv.org/abs/2310.04194
repo_url: None
paper_authors: Luyuan Wang, Yiqian Wu, Yongliang Yang, Chen Liu, Xiaogang Jin
for: 该论文旨在提高计算机图形学中的虚拟人像生成质量，并减少’’uncanny valley’’效应。
methods: 该论文使用了传输学习来学习一个映射，从虚拟人像的特征空间传递到真实人像的特征空间。
results: 该论文通过对 DRFHQ 数据集进行精心适应，使用 StyleGAN2 生成器，实现了提高虚拟人像的真实感和减少’’uncanny valley’’效应。

Abstract
Despite rapid advances in computer graphics, creating high-quality photo-realistic virtual portraits is prohibitively expensive. Furthermore, the well-know ''uncanny valley'' effect in rendered portraits has a significant impact on the user experience, especially when the depiction closely resembles a human likeness, where any minor artifacts can evoke feelings of eeriness and repulsiveness. In this paper, we present a novel photo-realistic portrait generation framework that can effectively mitigate the ''uncanny valley'' effect and improve the overall authenticity of rendered portraits. Our key idea is to employ transfer learning to learn an identity-consistent mapping from the latent space of rendered portraits to that of real portraits. During the inference stage, the input portrait of an avatar can be directly transferred to a realistic portrait by changing its appearance style while maintaining the facial identity. To this end, we collect a new dataset, Daz-Rendered-Faces-HQ (DRFHQ), that is specifically designed for rendering-style portraits. We leverage this dataset to fine-tune the StyleGAN2 generator, using our carefully crafted framework, which helps to preserve the geometric and color features relevant to facial identity. We evaluate our framework using portraits with diverse gender, age, and race variations. Qualitative and quantitative evaluations and ablation studies show the advantages of our method compared to state-of-the-art approaches.

摘要
尽管计算机图形技术得到了快速的进步，但创建高质量的图像化人脸仍然是非常昂贵的。此外，在rendered portrait中的“uncanny valley”效应也对用户体验产生了显著的影响，特别是当描述的人脸非常真实时，任何小误差都可能引起 eeriness和repulsiveness的感受。在这篇论文中，我们提出了一种新的图像化人脸框架，可以有效减少“uncanny valley”效应，提高渲染人脸的 authenticity。我们的关键思想是通过转移学习学习一个人脸的概率空间中的mapping，以便在渲染人脸的过程中保持人脸的facial identity。在推理阶段，输入的人脸可以直接被转换为真实的人脸，只需要改变其外观风格，而不会失去人脸的特征。为了实现这一目标，我们收集了一个新的数据集，DRFHQ（Daz-Rendered-Faces-HQ），这个数据集专门用于渲染风格的人脸。我们利用这个数据集来精心调整StyleGAN2生成器，使其保持人脸的几何和颜色特征。我们对这种方法进行了质量和量化的评估，以及减少方法的ablation study，以证明我们的方法与当前的方法相比有优势。

Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

paper_url: http://arxiv.org/abs/2310.04189
repo_url: None
paper_authors: Xinpeng Liu, Yong-Lu Li, Ailing Zeng, Zizheng Zhou, Yang You, Cewu Lu
for: 这篇论文的目的是建立一个可靠的动作 Semantic 和动作之间的映射关系，但是这是一个复杂的多对多问题。
methods: 我们提出了 Kinematic Phrases (KP) 作为一种 mediator，使得可以统一动作知识库并建立动作理解系统。KP 可以自动将动作转换为文本描述，无需主观偏见，这也附生出了一种新的自动动作生成比赛指标——Kinematic Prompt Generation (KPG)。
results: 在广泛的实验中，我们的方法表现出了超过其他方法的优势。

Abstract
The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality characteristics. Based on KP as a mediator, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions and to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel automatic motion generation benchmark. In extensive experiments, our approach shows superiority over other methods. Our code and data would be made publicly available at https://foruck.github.io/KP.

摘要
目的是建立有可靠映射的动作 Semantics 和动作 Meaning 之间的映射，这是一个复杂的多对多问题。抽象的动作 semantics（例如走向前）可以通过多种感知多样的动作（走动手指或抓握）进行表达，而一个动作可以在不同的上下文和意图下具有不同的 semantics。这使得找到一个简洁的映射变得困难。前一些尝试采用了直接映射 paradigms，但其可靠性有限。此外，当前自动度量不能提供有效的动作 Semantics 和动作之间的一致性评估。我们认为这一问题的来源是动作和 Semantics 之间的巨大差距。为了缓解这个差距，我们提出了机械学术短语（KP），它可以对人体动作的 объекively 知识进行抽象、可读性和一般特征。基于 KP 作为中介，我们可以统一动作知识库和建立动作理解系统。此外， KP 可以自动从动作中转换成文本描述，无需主观偏见，这 inspirits 动机Prompt Generation（KPG）作为一种新的自动动作生成标准。在广泛的实验中，我们的方法表现出了其他方法的优越性。我们的代码和数据将在上公开。

Whole Slide Multiple Instance Learning for Predicting Axillary Lymph Node Metastasis

paper_url: http://arxiv.org/abs/2310.04187
repo_url: https://github.com/glejdis/whole-slide-mil-for-predicting-axillary-lymph-node-metastasis
paper_authors: Glejdis Shkëmbi, Johanna P. Müller, Zhe Li, Katharina Breininger, Peter Schüffler, Bernhard Kainz
for: 本研究旨在开发一种基于深度学习（深度学习）的分类管道，用于从数字核心针刺样本（CNB）图像中提取临床信息，比现有方法减少一步。
methods: 本研究使用了一个公共可用的数据集，包含1058名患者的数据，以评估不同基线状态的深度学习模型在分类肿瘤 метастаisis的状况基于CNB图像。此外，还进行了一项广泛的数据扩充研究。
results: 研究发现，使用不同的数据扩充技术可以提高模型的性能，并且手动肿瘤 segmentation 和注释步骤进行了评估。

Abstract
Breast cancer is a major concern for women's health globally, with axillary lymph node (ALN) metastasis identification being critical for prognosis evaluation and treatment guidance. This paper presents a deep learning (DL) classification pipeline for quantifying clinical information from digital core-needle biopsy (CNB) images, with one step less than existing methods. A publicly available dataset of 1058 patients was used to evaluate the performance of different baseline state-of-the-art (SOTA) DL models in classifying ALN metastatic status based on CNB images. An extensive ablation study of various data augmentation techniques was also conducted. Finally, the manual tumor segmentation and annotation step performed by the pathologists was assessed.

摘要
乳癌是女性健康的主要问题， axillary lymph node（ALN） метастази的识别对诊断评估和治疗指导是关键。这篇论文介绍了一种深度学习（DL）分类管道，用于从数字核心针刺影像中提取临床信息，比现有方法少一步。使用了1058名患者的公共可用数据集来评估不同基eline状态的DL模型在分类ALN肿瘤状态基于CNB影像的性能。此外，还进行了广泛的数据增强技术的抽象研究。最后，评估了病理医生手动肿瘤分割和标注步骤。

DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions

paper_url: http://arxiv.org/abs/2310.04181
repo_url: None
paper_authors: Sanket Kalwar, Mihir Ungarala, Shruti Jain, Aaron Monis, Krishna Reddy Konda, Sourav Garg, K Madhava Krishna
for: 这篇论文目的是提高自动驾驶系统在不良天气情况下的Semantic segmentation能力。
methods: 这篇论文提出了一种新的可微分的视觉和latent prompting机制，可以扩展现有的adaptor在基础模型中的学习能力。我们的提案的 $\nabla$HFC图像处理封页在不良天气情况下表现出色，而传统方法往往无法应对。
results: 我们的方法可以将visual prompts和latent prompts联合训练，实现了在Out-of-distribution情况下的表现优化。我们的可微分的视觉提醒可以充分利用平行和串行架构，实现更好地提高object segmentation任务的性能。经过了一系列的实验和评估，我们提供了实践证据支持我们的方法的有效性。

Abstract
Semantic segmentation in adverse weather scenarios is a critical task for autonomous driving systems. While foundation models have shown promise, the need for specialized adaptors becomes evident for handling more challenging scenarios. We introduce DiffPrompter, a novel differentiable visual and latent prompting mechanism aimed at expanding the learning capabilities of existing adaptors in foundation models. Our proposed $\nabla$HFC image processing block excels particularly in adverse weather conditions, where conventional methods often fall short. Furthermore, we investigate the advantages of jointly training visual and latent prompts, demonstrating that this combined approach significantly enhances performance in out-of-distribution scenarios. Our differentiable visual prompts leverage parallel and series architectures to generate prompts, effectively improving object segmentation tasks in adverse conditions. Through a comprehensive series of experiments and evaluations, we provide empirical evidence to support the efficacy of our approach. Project page at https://diffprompter.github.io.

摘要
“严阵天气下的 semantic segmentation 是自动驾驶系统中的一个重要任务。 Foundation model 已经显示了承认的能力，但对于更加具体的enario 需要特殊的 adaptor 来扩展学习能力。我们介绍 DiffPrompter，一种新的可 differentiable 的visual 和 latent 启发机制，用于扩展现有 adaptor 的学习能力。我们的 proposed $\nabla$HFC 图像处理封页在恶劣天气下表现特别出色， conventional 方法通常在这些情况下失败。此外，我们调查了同时训练 visual 和 latent 启发的共同优点，并证明这种结合方法可以在非常规情况下明显提高性能。我们的可 differentiable 的 visual 启发使用并行和串行架构来生成启发，实际地改善了对于恶劣天气的object segmentation任务。通过了一系列的实验和评估，我们提供了实践证据支持我们的方法的有效性。Project page 为 https://diffprompter.github.io。”

paper_url: http://arxiv.org/abs/2310.04180
repo_url: https://github.com/i2-multimedia-lab/dsat
paper_authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang
for: 提出了一种基于Transformer的盲超分辨率网络模型，以适应各种不确定噪声的环境。
methods: 该模型 integrates CNN和Transformer两种组件，首先使用CNN模ulated by degradation information来EXTRACT LOCAL FEATURES，然后employs degradation-aware Transformer来EXTRACT GLOBAL SEMANTIC FEATURES。
results: 对多个流行的大规模 benchmark dataset进行测试，实现了与现有方法相比的最佳性能，包括Urban100 dataset的PSNR提高0.94 dB和26.62 dB。Source code可以在https://github.com/I2-Multimedia-Lab/DSAT/tree/main中获取。

Abstract
Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main.

摘要
Comparing to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their ability to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. 源代码可以在 GitHub 上获取：https://github.com/I2-Multimedia-Lab/DSAT/tree/main.

Entropic Score metric: Decoupling Topology and Size in Training-free NAS

paper_url: http://arxiv.org/abs/2310.04179
repo_url: None
paper_authors: Niccolò Cavagnero, Luca Robbiano, Francesca Pistilli, Barbara Caputo, Giuseppe Averta
for: 这个研究旨在提高适用于边缘应用的高性能卷积神经网络设计，特别是面临资源受限的实际应用情况下。
methods: 本研究提出了一个新的训练自由度量表（Entropic Score），用于估算神经网络的表达能力，以及一种循环搜索算法来独立地搜索神经网络的结构和大小。
results: 本研究获得了在 less than 1 GPU 小时内完全设计高性能的 Hybrid Transformers 模型，并在 ImageNet 类别任务上获得了最高精度和最快速的 NAS 方法。

Abstract
Neural Networks design is a complex and often daunting task, particularly for resource-constrained scenarios typical of mobile-sized models. Neural Architecture Search is a promising approach to automate this process, but existing competitive methods require large training time and computational resources to generate accurate models. To overcome these limits, this paper contributes with: i) a novel training-free metric, named Entropic Score, to estimate model expressivity through the aggregated element-wise entropy of its activations; ii) a cyclic search algorithm to separately yet synergistically search model size and topology. Entropic Score shows remarkable ability in searching for the topology of the network, and a proper combination with LogSynflow, to search for model size, yields superior capability to completely design high-performance Hybrid Transformers for edge applications in less than 1 GPU hour, resulting in the fastest and most accurate NAS method for ImageNet classification.

摘要
neural networks 设计是一个复杂和具有挑战性的任务，特别是在移动设备上进行训练的小型模型 scenario 下。 neuronal architecture search 是一种有前途的方法，可以自动化这个过程，但现有的竞争性方法具有大量训练时间和计算资源，以生成高精度模型。为了突破这些限制，这篇论文做出了以下贡献：1. 一种新的训练时间无关的指标， named Entropic Score，可以通过汇集 activations 的元素级 entropy 来估算模型表达能力。2. 一种循环搜索算法，可以分别 yet synergistically 搜索模型的结构和大小。 Entropic Score 表现出了remarkable 的能力来搜索模型的结构，而与 LogSynflow 的组合可以在 less than 1 GPU 小时内完全设计高性能的 Hybrid Transformers 模型，并在 ImageNet 预测中达到最快和最准确的 NAS 方法。

Improving Neural Radiance Field using Near-Surface Sampling with Point Cloud Generation

paper_url: http://arxiv.org/abs/2310.04152
repo_url: None
paper_authors: Hye Bin Yoo, Hyun Min Han, Sung Soo Hwang, Il Yong Chun
for: 提高NeRF的渲染质量和减少训练时间
methods: 采用近表面抽象法，使用训练集中的深度图像来估算3D对象的表面，并且在这个表面附近进行采样。同时，该方法还提出了一种3D点云生成方法和一种简单的修正方法来获取novel view中的深度信息。
results: 实验结果显示，提出的近表面采样NeRF框架可以显著提高NeRF的渲染质量，并且可以减少NeRF模型的训练时间。

Abstract
Neural radiance field (NeRF) is an emerging view synthesis method that samples points in a three-dimensional (3D) space and estimates their existence and color probabilities. The disadvantage of NeRF is that it requires a long training time since it samples many 3D points. In addition, if one samples points from occluded regions or in the space where an object is unlikely to exist, the rendering quality of NeRF can be degraded. These issues can be solved by estimating the geometry of 3D scene. This paper proposes a near-surface sampling framework to improve the rendering quality of NeRF. To this end, the proposed method estimates the surface of a 3D object using depth images of the training set and sampling is performed around there only. To obtain depth information on a novel view, the paper proposes a 3D point cloud generation method and a simple refining method for projected depth from a point cloud. Experimental results show that the proposed near-surface sampling NeRF framework can significantly improve the rendering quality, compared to the original NeRF and a state-of-the-art depth-based NeRF method. In addition, one can significantly accelerate the training time of a NeRF model with the proposed near-surface sampling framework.

摘要
神经辐射场（NeRF）是一种崛起的视图合成方法，它在三维空间中随机 sampling 点并估算它们的存在和颜色概率。NeRF 的缺点是它需要训练时间很长，因为它需要随机 sampling 大量的三维点。此外，如果从遮盖区域或不可能存在的空间中随机 sampling 点，NeRF 的渲染质量将受到降低。这些问题可以通过估算三维场景的geometry来解决。这篇论文提议一种靠近表面 sampling 框架，以改善 NeRF 的渲染质量。为此，提议方法使用训练集的深度图像来估算三维 объек 的表面，然后在那里进行随机 sampling。要在新视图中获取深度信息，论文提议一种三维点云生成方法和一种简单的修正方法。实验结果表明，提议的靠近表面 sampling NeRF 框架可以 significatively 改善 NeRF 的渲染质量，相比于原始 NeRF 和一种状态流行的深度基于 NeRF 方法。此外，可以通过提议的靠近表面 sampling 框架快速加速 NeRF 模型的训练时间。

TiC: Exploring Vision Transformer in Convolution

paper_url: http://arxiv.org/abs/2310.04134
repo_url: https://github.com/zs670980918/msa-conv
paper_authors: Song Zhang, Qingzhong Wang, Jiang Bian, Haoyi Xiong
for: 提高transformer模型在不同尺度图像处理中的灵活性和计算效率，即使不需要重新训练或resize图像。
methods: 提出了Multi-Head Self-Attention Convolution（MSA-Conv），它将自我注意力 incorporated into generalized convolutions，包括标准、扩展和深度的卷积。
results: 提出了Vision Transformer in Convolution（TiC），并实现了两种可能性提高策略：Multi-Directional Cyclic Shifted Mechanism和Inter-Pooling Mechanism。通过实验证明了TiC的总效果，并通过精准权重分析证明了MSA-Conv和两种可能性提高策略的性能提升。

Abstract
While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.

摘要
“ mentre i modelli derivati dai Vision Transformers (ViTs) hanno avuto un'espansione fenomenale, i modelli pre-tramati non possono adattarsi facilmente alle immagini di risoluzione arbitraria senza modificare l'architettura e la configurazione, come la sampling del coding posizionale, limitando la loro flessibilità per diverse task di visione. Ad esempio, il modello Segment Anything Model (SAM) basato su ViT-Huge richiede che tutte le immagini di input siano resezzate a 1024x1024. Per superare questa limitazione, propongo la Multi-Head Self-Attention Convolution (MSA-Conv) che incorpora l'Auto-Attention all'interno delle convolutioni generalizzate, compresse standard, diluate e depthwise. In questo modo, i transformers possono gestire immagini di dimensioni diverse senza dover rinunciare o ridimensionare, riducendo i costi computazionali rispetto all'attenzione globale in ViT, che cresce costoso con l'aumentare delle dimensioni dell'immagine. Successivamente, presento il Vision Transformer in Convolution (TiC) come una prova di concetto per la classificazione di immagini con MSA-Conv, dove due strategie di miglioria della capacità, ossia la Multi-Directional Cyclic Shifted Mechanism e l'Inter-Pooling Mechanism, sono state proposte, attraverso la creazione di connessioni a distanza lunga tra i token e l'aumento del campo rettangolare efficace. Sono state eseguite estese esperienze per validare l'efficacia generale di TiC. Inoltre, gli studi di ablazione hanno confermato l'improvemento delle prestazioni ottenuto da MSA-Conv e dalle due strategie di miglioria della capacità separate. Nota che la nostra proposta si rivolge allo studio di un'alternativa all'attenzione globale utilizzata in ViT, mentre MSA-Conv incontra il nostro obiettivo facendo di TiC comparabile ai migliori risultati su ImageNet-1K. Il codice verrà rilasciato sul sito GitHub https://github.com/zs670980918/MSA-Conv.”

VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification

paper_url: http://arxiv.org/abs/2310.04122
repo_url: None
paper_authors: Han Huang, Yan Huang, Liang Wang
for:VI-ReID task with single-modality labeled datamethods:unpaired image-to-image translation techniques, diffusion model (VI-Diff)results:outperforms existing diffusion and GAN models, promising solution for VI-ReID task with single-modality labeled data.

Abstract
Visible-Infrared person re-identification (VI-ReID) in real-world scenarios poses a significant challenge due to the high cost of cross-modality data annotation. Different sensing cameras, such as RGB/IR cameras for good/poor lighting conditions, make it costly and error-prone to identify the same person across modalities. To overcome this, we explore the use of single-modality labeled data for the VI-ReID task, which is more cost-effective and practical. By labeling pedestrians in only one modality (e.g., visible images) and retrieving in another modality (e.g., infrared images), we aim to create a training set containing both originally labeled and modality-translated data using unpaired image-to-image translation techniques. In this paper, we propose VI-Diff, a diffusion model that effectively addresses the task of Visible-Infrared person image translation. Through comprehensive experiments, we demonstrate that VI-Diff outperforms existing diffusion and GAN models, making it a promising solution for VI-ReID with single-modality labeled data. Our approach can be a promising solution to the VI-ReID task with single-modality labeled data and serves as a good starting point for future study. Code will be available.

摘要
visible-infrared人识别（VI-ReID）在实际场景中存在 significanthigh cost of cross-modality数据标注问题。不同的感知镜头，如RGB/IR镜头 для不同的照明条件，使得在不同感知模式之间进行人识别变得昂贵和容易出错。为了解决这个问题，我们研究了使用单模态标注数据来进行VI-ReID任务，这更加经济实用。我们将人员标注在一个模式（例如可见图像）中，然后在另一个模式（例如红外图像）中进行检索。我们希望通过不同的图像对应关系技术来创建一个包含原始标注和模式翻译数据的训练集。在这篇论文中，我们提出了VI-Diff，一种难涨模型，可以有效地解决可见红外人像翻译任务。通过广泛的实验，我们证明了VI-Diff在 diffusion和GAN模型之上表现出色，使其成为VI-ReID任务中单模态标注数据的可靠解决方案。我们的方法可以在VI-ReID任务中提供可靠的解决方案，并作为未来研究的开端。代码将可以提供。

Aorta Segmentation from 3D CT in MICCAI SEG.A. 2023 Challenge

paper_url: http://arxiv.org/abs/2310.04114
repo_url: https://github.com/Project-MONAI/MONAI
paper_authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu
for: 这个研究是为了提出一种自动化的血管分割方法，以帮助早期发现和监测血管疾病。
methods: 这个研究使用了一种名为Auto3DSeg的自动化分割方法，可以在MONAI中使用。
results: 这个方法在3D CT图像中对血管进行分割，得到了平均的 dice分数0.920和95%的 Hausdorff 距离6.013，这与其他参赛者相比，得到了第一名和赢得了SEG.A. 2023挑战。

Abstract
Aorta provides the main blood supply of the body. Screening of aorta with imaging helps for early aortic disease detection and monitoring. In this work, we describe our solution to the Segmentation of the Aorta (SEG.A.231) from 3D CT challenge. We use automated segmentation method Auto3DSeg available in MONAI. Our solution achieves an average Dice score of 0.920 and 95th percentile of the Hausdorff Distance (HD95) of 6.013, which ranks first and wins the SEG.A. 2023 challenge.

摘要
“冠状动脉提供身体主要血液供应。实时侦测冠状动脉可以早期检测和监控冠状疾病。这里我们介绍我们对三维CT图像中的冠状动脉分类挑战的解决方案。我们使用自动分类方法Auto3DSeg，该方法在MONAI中可用。我们的解决方案得到了0.920的 dice分数和6.013的 Hausdorff距离（HD95）的95%分布，它在SEG.A. 2023挑战中排名第一，获得了首奖。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Dense Random Texture Detection using Beta Distribution Statistics

paper_url: http://arxiv.org/abs/2310.04111
repo_url: None
paper_authors: Soeren Molander
for: 检测粗糙随机文本使用完全连接点 sampling on image edges
methods: 使用完全连接点 sampling on image edges，计算点对点的L2距离，并对每个点进行边检查，如果 intersects with image edge，则添加 unity value，否则添加 zero。从而计算出完全连接边图的edge excess index，该指标在[1.0..2.0]范围内，表示不存在边的情况。
results: 该方法应用于实时SLAM-based moving object detection中，点受限于跟踪框（ROIs）。

Abstract
This note describes a method for detecting dense random texture using fully connected points sampled on image edges. An edge image is randomly sampled with points, the standard L2 distance is calculated between all connected points in a neighbourhood. For each point, a check is made if the point intersects with an image edge. If this is the case, a unity value is added to the distance, otherwise zero. From this an edge excess index is calculated for the fully connected edge graph in the range [1.0..2.0], where 1.0 indicate no edges. The ratio can be interpreted as a sampled Bernoulli process with unknown probability. The Bayesian posterior estimate of the probability can be associated with its conjugate prior which is a Beta($\alpha$, $\beta$) distribution, with hyper parameters $\alpha$ and $\beta$ related to the number of edge crossings. Low values of $\beta$ indicate a texture rich area, higher values less rich. The method has been applied to real-time SLAM-based moving object detection, where points are confined to tracked boxes (rois).

摘要
这份备忘录详细介绍了一种用于检测紧密随机文本的方法，该方法基于完全连接点在图像边缘上进行采样。首先，一张边像被随机采样点，然后计算所有连接点的标准L2距离。对于每个点，检查该点是否与图像边缘交叠。如果交叠，则将unity值添加到距离中，否则为零。根据这些距离，计算一个完全连接边图的Edge Excess Index，该指标在[1.0..2.0]范围内，其中1.0表示没有边。这个指标可以被解释为一个随机 Bernoulli 过程的采样，其中unknown probability。使用 conjugate prior 的 Bayesian posterior estimator，其中 conjugate prior 是一个 Beta（α，β）分布，其中 α 和 β 参数与边 crossing 相关。低值 beta 指标表示繁殖的文本区域，高值则表示缺乏文本。该方法已经应用于基于 SLAM 实时移动 объек特点检测，其中点被限制在跟踪的盒子（ROI）中。

Automated 3D Segmentation of Kidneys and Tumors in MICCAI KiTS 2023 Challenge

paper_url: http://arxiv.org/abs/2310.04110
repo_url: https://github.com/Project-MONAI/MONAI
paper_authors: Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu
for: 本文参加2023年度的肾茵减减挑战（KiTS），用于比较多种解决方案的肾茵分割问题。
methods: 本文使用MONAI中的Auto3DSeg自动分割工具进行肾茵分割。
results: 本文的解决方案在KiTS 2023挑战中得到了平均 dice 值为0.835和表面 dice 值为0.723，并获得了肾茵减减挑战的冠军。

Abstract
Kidney and Kidney Tumor Segmentation Challenge (KiTS) 2023 offers a platform for researchers to compare their solutions to segmentation from 3D CT. In this work, we describe our submission to the challenge using automated segmentation of Auto3DSeg available in MONAI. Our solution achieves the average dice of 0.835 and surface dice of 0.723, which ranks first and wins the KiTS 2023 challenge.

摘要
“干织肿瘤分类挑战（KiTS）2023 提供了一个研究者可以比较他们的解决方案的平台。在这个工作中，我们描述了我们对于自动分类的 Auto3DSeg 可用于 MONAI 的解决方案。我们的解决方案实现了平均 dice 0.835 和表面 dice 0.723，排名第一，获得 KiTS 2023 挑战的冠军。”Note that "KiTS" is short for "Kidney and Kidney Tumor Segmentation Challenge", and "MONAI" is a medical imaging analysis platform.

ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer

paper_url: http://arxiv.org/abs/2310.04099
repo_url: None
paper_authors: Yifan Xu, Pourya Shamsolmoali, Jie Yang
for: 这篇论文的目的是提出一个新的方法来解决视觉地点识别（VPR）中的缺失和重复信息问题。
methods: 这篇论文使用了一个新的架构，即汇集基于权重的transformer网络（CWTNet），并引入了一个新的优化后VLAD层（OptLAD）以降低模型的维度和提高效率。
results: 实验结果显示，这篇论文的模型在四个VPR数据集上表现较好，并且比较简单。

Abstract
Visual place recognition (VPR) is a highly challenging task that has a wide range of applications, including robot navigation and self-driving vehicles. VPR is particularly difficult due to the presence of duplicate regions and the lack of attention to small objects in complex scenes, resulting in recognition deviations. In this paper, we present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects. Different from existing methods that rely on Convolutional Neural Networks (CNNs) for feature map generation, ClusVPR introduces a unique paradigm called Clustering-based Weighted Transformer Network (CWTNet). CWTNet leverages the power of clustering-based weighted feature maps and integrates global dependencies to effectively address visual deviations encountered in large-scale VPR problems. We also introduce the optimized-VLAD (OptLAD) layer that significantly reduces the number of parameters and enhances model efficiency. This layer is specifically designed to aggregate the information obtained from scale-wise image patches. Additionally, our pyramid self-supervised strategy focuses on extracting representative and diverse information from scale-wise image patches instead of entire images, which is crucial for capturing representative and diverse information in VPR. Extensive experiments on four VPR datasets show our model's superior performance compared to existing models while being less complex.

摘要
“视觉地点识别（VPR）是一项非常具有挑战性的任务，它在 робо特 naviation 和自动驾驶汽车等领域有广泛的应用。VPR 尤其是由于区域重复和小 объек 的忽略，导致识别偏差。在这篇论文中，我们提出了 ClusVPR，一种新的方法，用于解决 VPR 中的特定问题。与现有方法不同，ClusVPR 不仅仅采用 Convolutional Neural Networks（CNNs）来生成特征地图，而是引入了归一化-based Weighted Transformer Network（CWTNet）。CWTNet 利用了归一化-based 权重特征地图，并考虑到全局依赖关系，以有效地解决 VPR 中的视觉偏差。我们还提出了优化后的 VLAD（OptLAD）层，它可以减少参数的数量，提高模型的效率。这层特地设计用于聚合 scale-wise 图像块中的信息。此外，我们还提出了一种适应性自我超vised 策略，它会从 scale-wise 图像块中提取代表性和多样化的信息，而不是整个图像，这是关键的 для捕捉 VPR 中的代表性和多样化信息。我们在四个 VPR 数据集上进行了广泛的实验，结果显示我们的模型在与现有模型的比较中表现出优于性，同时具有较低的复杂度。”

End-to-End Chess Recognition

paper_url: http://arxiv.org/abs/2310.04086
repo_url: None
paper_authors: Athanasios Masouris, Jan van Gemert
for: 识别棋盘配置，即从棋盘图像中识别棋子的配置。
methods: 我们采用深度学习模型，并提出了两种新的方法来直接从整个图像中预测棋盘配置，从而避免了传统的顺序处理方法中的错误积累和中间注解的需求。
results: 我们使用新建的 Chess Recognition Dataset (ChessReD) 进行训练和测试，并证明了我们的方法在这个新的标准数据集上的表现，其中 board recognition accuracy 为 15.26%（相比现有的状态的艺术而言，这是大约7倍的提升）。

Abstract
Chess recognition refers to the task of identifying the chess pieces configuration from a chessboard image. Contrary to the predominant approach that aims to solve this task through the pipeline of chessboard detection, square localization, and piece classification, we rely on the power of deep learning models and introduce two novel methodologies to circumvent this pipeline and directly predict the chessboard configuration from the entire image. In doing so, we avoid the inherent error accumulation of the sequential approaches and the need for intermediate annotations. Furthermore, we introduce a new dataset, Chess Recognition Dataset (ChessReD), specifically designed for chess recognition that consists of 10,800 images and their corresponding annotations. In contrast to existing synthetic datasets with limited angles, this dataset comprises a diverse collection of real images of chess formations captured from various angles using smartphone cameras; a sensor choice made to ensure real-world applicability. We use this dataset to both train our model and evaluate and compare its performance to that of the current state-of-the-art. Our approach in chess recognition on this new benchmark dataset outperforms related approaches, achieving a board recognition accuracy of 15.26% ($\approx$7x better than the current state-of-the-art).

摘要
<>cheshire recognition指的是从棋盘图像中识别棋子的配置。与传统方法不同，我们不是通过棋盘检测、方块定位和棋子类别化的管道来解决这个任务，而是直接将棋盘配置从整个图像中预测。这样可以避免累积错误的问题，并不需要中间注解。此外，我们还提出了两种新的方法ологи，以避免管道式approach的缺点。为了训练和评估我们的模型，我们创建了一个新的数据集，棋盘识别数据集（ChessReD）。这个数据集包含10800个图像和其相应的注解，与现有的 sintetic数据集不同，这个数据集包含了多种角度的真实棋盘形态，通过智能手机摄像头拍摄。我们使用这个数据集来训练和评估我们的模型，并与当前状态的最佳方法进行比较。我们的approach在这个新的benchmark数据集上表现出色，实现了棋盘识别精度15.26%（大约7倍于当前状态的最佳方法）。>>>

In the Blink of an Eye: Event-based Emotion Recognition

paper_url: http://arxiv.org/abs/2310.04043
repo_url: https://github.com/zhanghaiwei1234/single-eye-emotion-recognition
paper_authors: Haiwei Zhang, Jiqing Zhang, Bo Dong, Pieter Peers, Wenwei Wu, Xiaopeng Wei, Felix Heide, Xin Yang
for: 这种眼镜可以识别人们的情绪，特别是在照明条件变化时。
methods: 这种方法使用了生物体现的事件驱动摄像机和一种新型的轻量级神经网络SEEN。
results: 对于单眼事件驱动摄像机和神经网络SEEN，我们在压缩数据集上进行了广泛验证和证明，并证明了该方法的有效性。

Abstract
We introduce a wearable single-eye emotion recognition device and a real-time approach to recognizing emotions from partial observations of an emotion that is robust to changes in lighting conditions. At the heart of our method is a bio-inspired event-based camera setup and a newly designed lightweight Spiking Eye Emotion Network (SEEN). Compared to conventional cameras, event-based cameras offer a higher dynamic range (up to 140 dB vs. 80 dB) and a higher temporal resolution. Thus, the captured events can encode rich temporal cues under challenging lighting conditions. However, these events lack texture information, posing problems in decoding temporal information effectively. SEEN tackles this issue from two different perspectives. First, we adopt convolutional spiking layers to take advantage of the spiking neural network's ability to decode pertinent temporal information. Second, SEEN learns to extract essential spatial cues from corresponding intensity frames and leverages a novel weight-copy scheme to convey spatial attention to the convolutional spiking layers during training and inference. We extensively validate and demonstrate the effectiveness of our approach on a specially collected Single-eye Event-based Emotion (SEE) dataset. To the best of our knowledge, our method is the first eye-based emotion recognition method that leverages event-based cameras and spiking neural network.

摘要
我们介绍了一种穿戴式单眼情感识别设备和一种实时方法来从部分情感识别中robust避免变化的照明条件。我们的方法的核心是基于生物体的bio-inspired事件驱动摄像头设置和一种新设计的轻量级Spiking Eye Emotion Network (SEEN)。相比传统摄像头，事件驱动摄像头可以提供更高的动态范围（达到140 dBvs. 80 dB）和更高的时间分辨率。因此，捕获的事件可以嵌入丰富的时间信息。然而，这些事件缺乏文本信息，从而导致解码时间信息的问题。SEEN解决了这个问题从两个不同的角度。首先，我们采用了 convolutional spiking层来利用快速神经网络的能力来解码相关的时间信息。其次，SEEN学习了提取相应的空间信息，并使用一种新的重复计数套件来传递空间注意力到convolutional spiking层 durante training和inference。我们对特制的Single-eye Event-based Emotion (SEE)数据集进行了广泛验证和证明了我们的方法的有效性。据我们所知，我们的方法是首个通过事件驱动摄像头和快速神经网络进行眼球情感识别的方法。

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

paper_url: http://arxiv.org/abs/2310.03986
repo_url: None
paper_authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif
for: 提高多modalitate下批处程序的总性表现
methods: 利用低级别适应和调制中间特征来补偿缺失modalities
results: 提高多modalitate下的鲁棒性，在一些情况下超越独立的、专门为可用modalitate组合培育的网络In English, this means:
for: To improve the overall performance of downstream tasks in multimodal learning
methods: Using low-rank adaptation and modulation of intermediate features to compensate for missing modalities
results: Improving robustness in multimodal learning, outperforming independent networks trained for available modality combinations in some cases.

Abstract
Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in some correlated modalities. However, we observe that the performance of several existing multimodal networks significantly deteriorates if one or multiple modalities are absent at test time. To enable robustness to missing modalities, we propose simple and parameter-efficient adaptation procedures for pretrained multimodal networks. In particular, we exploit low-rank adaptation and modulation of intermediate features to compensate for the missing modalities. We demonstrate that such adaptation can partially bridge performance drop due to missing modalities and outperform independent, dedicated networks trained for the available modality combinations in some cases. The proposed adaptation requires extremely small number of parameters (e.g., fewer than 0.7% of the total parameters in most experiments). We conduct a series of experiments to highlight the robustness of our proposed method using diverse datasets for RGB-thermal and RGB-Depth semantic segmentation, multimodal material segmentation, and multimodal sentiment analysis tasks. Our proposed method demonstrates versatility across various tasks and datasets, and outperforms existing methods for robust multimodal learning with missing modalities.

摘要
多模态学习旨在利用多种数据来提高下游任务的总性能。可以利用多模态数据的重复性来使多模态系统具有缺失或损坏观测的某些相关模态时的Robustness。然而，我们发现许多现有的多模态网络在测试时缺失一或多个模态时表现出现较差的性能。为实现缺失模态的Robustness，我们提议使用简单和参数有效的适应过程来修改预训练的多模态网络。具体来说，我们利用低级别适应和修改中间特征来补做缺失模态。我们示出，这种适应可以部分弥补由缺失模态导致的性能下降，并在一些情况下超过独立、专门为可用模态组合培 trained的独立网络。我们的提议适应需要非常少的参数（例如， fewer than 0.7% of the total parameters in most experiments）。我们通过多种实验表明了我们提议的方法的Robustness，使用RGB-热成像、RGB-深度semantic segmentation、多模态物体 segmentation和多模态情感分析任务。我们的提议方法具有多任务多数据集的多样性，并在不同任务和数据集上超越现有的robust多模态学习方法。

paper_url: http://arxiv.org/abs/2310.03959
repo_url: None
paper_authors: Shivam Aarya
for: 本研究旨在提高自动驾驶车辆的稳定性和安全性，通过对摄像头数据进行修正来减少摄像头角度变化引起的迹象预测错误。
methods: 本研究使用了一种新的修正模型，该模型可以根据摄像头数据进行修正，以减少由摄像头角度变化引起的迹象预测错误。
results: 在使用公共可用数据集进行评估时，本研究发现该修正模型可以将迹象预测错误率降低至2.3%，从而提高自动驾驶车辆的稳定性和安全性。

Abstract
Vehicle manufacturers are racing to create autonomous navigation and steering control algorithms for their vehicles. These software are made to handle various real-life scenarios such as obstacle avoidance and lane maneuvering. There is some ongoing research to incorporate pothole avoidance into these autonomous systems. However, there is very little research on the effect of hitting a pothole on the autonomous navigation software that uses cameras to make driving decisions. Perturbations in the camera angle when hitting a pothole can cause errors in the predicted steering angle. In this paper, we present a new model to compensate for such angle perturbations and reduce any errors in steering control prediction algorithms. We evaluate our model on perturbations of publicly available datasets and show our model can reduce the errors in the estimated steering angle from perturbed images to 2.3%, making autonomous steering control robust against the dash cam image angle perturbations induced when one wheel of a car goes over a pothole.

摘要
自动驾驶车制造商正在奔腾地开发自动导航和推力控制算法，以适应不同的实际景景，如避免障碍物和车道弯道。然而，关于弹射坑的影响在自动驾驶系统中的研究很少。当车辆过坑时，摄像头角度的偏移会导致驾驶控制预测错误。在这篇论文中，我们提出了一种新的模型，以减少由摄像头角度偏移引起的驾驶控制预测错误。我们使用公共可用的数据集进行评估，并证明我们的模型可以将摄像头角度偏移引起的错误降低至2.3%，使自动驾驶控制更加Robust againstdash cam image angle perturbations induced by potholes.

Understanding prompt engineering may not require rethinking generalization

paper_url: http://arxiv.org/abs/2310.03957
repo_url: None
paper_authors: Victor Akinwande, Yiding Jiang, Dylan Sam, J. Zico Kolter
for: 这篇论文旨在解释逻辑推理模型在不需要训练的情况下，如何具有良好的泛化性。
methods: 该论文使用的方法是通过设计提示来建立分类器，而不需要显式的训练过程。
results: 该论文显示，使用提示的方法可以具有remarkably tight的泛化 bound，并且可以用来 justify the widespread practice of prompt engineering，即通过设计提示来实现良好的测试性能。

Abstract
Zero-shot learning in prompted vision-language models, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly surprising observation: these methods suffer relatively little from overfitting, i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer actually zero-shot), the approach still performs well on held-out test data. In this paper, we show that we can explain such performance well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a language model, results in generalization bounds that are remarkably tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. Furthermore, the resulting bound is well-suited for model selection: the models with the best bound typically also have the best test performance. This work thus provides a possible justification for the widespread practice of prompt engineering, even if it seems that such methods could potentially overfit the training data.

摘要
zero-shot learning 在提示语言模型中，通过手动设计提示来建立分类器而不需要显式训练过程，已经实现了许多场景中的出色表现。这一成功呈现出一个意外的观察：这些方法具有相对较少的过拟合现象，即在手动设计提示以实现训练集的低错误率（这样的方法再不是真正的零shot learning）时，这种方法仍然能够在封闭测试数据上表现良好。在这篇论文中，我们展示了我们可以通过经典的 PAC-Bayes bound 来解释这种表现。具体来说，我们表明了提示的整数性，加上基于语言模型的 PAC-Bayes prior，导致的泛化 bound 是文献中非常紧张的：例如，ImageNet 分类器的泛化 bound 经常在真实测试错误率的几个百分点之间。我们通过实验证明了这一点，并且表明了这种 bound 适用于模型选择：具有最好的 bound 的模型通常也有最好的测试性能。这项工作因此提供了逻辑 justify prompt engineering 的做法，即使这些方法可能会过拟合训练数据。

Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

paper_url: http://arxiv.org/abs/2310.03956
repo_url: None
paper_authors: Sara Fridovich-Keil, Fabrizio Valdivia, Gordon Wetzstein, Benjamin Recht, Mahdi Soltanolkotabi
for: This paper aims to improve the accuracy and reduce artifacts in computed tomography (CT) reconstruction, particularly in the presence of high-density materials such as metal.
methods: The authors propose a direct nonlinear CT reconstruction technique that bypasses the conventional preprocessing step of inverting the nonlinear measurement preprocessing. Instead, they use gradient descent to optimize the nonlinear forward model and reconstruct the underlying signal directly from the raw measurements.
results: The authors demonstrate the effectiveness of their proposed technique through experiments on synthetic and real 3D volumes using cone-beam CT. They show that their approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns, and achieves a near minimal number of random measurements with a geometric rate of convergence. Additionally, they prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal, by enforcing prior structural information about the signal through constraints on the optimization variables.

Abstract
In computed tomography (CT), the forward model consists of a linear Radon transform followed by an exponential nonlinearity based on the attenuation of light according to the Beer-Lambert Law. Conventional reconstruction often involves inverting this nonlinearity as a preprocessing step and then solving a convex inverse problem. However, this nonlinear measurement preprocessing required to use the Radon transform is poorly conditioned in the vicinity of high-density materials, such as metal. This preprocessing makes CT reconstruction methods numerically sensitive and susceptible to artifacts near high-density regions. In this paper, we study a technique where the signal is directly reconstructed from raw measurements through the nonlinear forward model. Though this optimization is nonconvex, we show that gradient descent provably converges to the global optimum at a geometric rate, perfectly reconstructing the underlying signal with a near minimal number of random measurements. We also prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal. This is achieved by enforcing prior structural information about the signal through constraints on the optimization variables. We illustrate the benefits of direct nonlinear CT reconstruction with cone-beam CT experiments on synthetic and real 3D volumes. We show that this approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns.

摘要
在计算Tomography（CT）中，前向模型包括线性的朗逊变换，然后是基于减弱光的泽米特律的不对称非线性。常见的重建通常需要在这种非线性预处理中进行逆转，然后解决一个凸 inverse problem。然而，这种非线性测量预处理在高密度材料，如金属附近，是糟糕的conditioned。这种预处理使CT重建方法数值敏感和受到artifacts的影响。在这篇论文中，我们研究了一种技术，其中信号直接从Raw Measurements中重建 через非线性前向模型。虽这个优化是非凸的，但我们显示了gradient descent可提able地 converge到全局最优点，完美地重建下面的信号，使用最小量的随机测量。我们也证明了相似的结果在下determined setting中，其中测量量 Significantly smaller than the dimension of the signal。这是通过在优化变量上添加信号的先验结构信息来实现的。我们在Synthetic和实际3Dvolumes上进行了 cone-beam CT实验，并示出了这种方法可以减少金属残余相比于一个商业重建的人骨头with metal dental crowns。

ILSH: The Imperial Light-Stage Head Dataset for Human Head View Synthesis

paper_url: http://arxiv.org/abs/2310.03952
repo_url: None
paper_authors: Jiali Zheng, Youngkyoon Jang, Athanasios Papaioannou, Christos Kampouris, Rolandos Alexandros Potamias, Foivos Paraperas Papantoniou, Efstathios Galanakis, Ales Leonardis, Stefanos Zafeiriou
for: 本研究 introduce Imperial Light-Stage Head (ILSH) dataset, a novel light-stage-captured human head dataset to support view synthesis academic challenges for human heads.
methods: 本研究使用 specifically designed light-stage to capture high-resolution (4K) human head images, and addresses challenges (preprocessing, ethical issues) in collecting high-quality data.
results: 研究 obtained 1,248 close-up head images, border masks, and camera pose pairs from 52 subjects captured using 24 cameras with all 82 lighting sources turned on.

Abstract
This paper introduces the Imperial Light-Stage Head (ILSH) dataset, a novel light-stage-captured human head dataset designed to support view synthesis academic challenges for human heads. The ILSH dataset is intended to facilitate diverse approaches, such as scene-specific or generic neural rendering, multiple-view geometry, 3D vision, and computer graphics, to further advance the development of photo-realistic human avatars. This paper details the setup of a light-stage specifically designed to capture high-resolution (4K) human head images and describes the process of addressing challenges (preprocessing, ethical issues) in collecting high-quality data. In addition to the data collection, we address the split of the dataset into train, validation, and test sets. Our goal is to design and support a fair view synthesis challenge task for this novel dataset, such that a similar level of performance can be maintained and expected when using the test set, as when using the validation set. The ILSH dataset consists of 52 subjects captured using 24 cameras with all 82 lighting sources turned on, resulting in a total of 1,248 close-up head images, border masks, and camera pose pairs.

摘要
这篇论文介绍了帝国光台头（ILSH）数据集，这是一个新的光台捕捉的人头数据集，旨在支持人头视 synthesis学术挑战。ILSH数据集旨在促进多种方法的发展，如场景特定或通用神经渲染、多视图几何、3D视觉和计算机图形等，以提高人头化的图像质量。本文介绍了使用特定设计的光台捕捉高分辨率（4K）人头图像的过程，以及收集数据时遇到的挑战和伦理问题的处理方法。此外，文章还详细介绍了数据集的分区方法，包括训练集、验证集和测试集的分割。我们的目标是设计和支持一个公平的视 synthesis挑战任务，以便在使用测试集时和使用验证集时的表现水平具有相同的稳定性。ILSH数据集包含52名参与者，通过24台摄像头拍摄，共有82个灯光源 turned on，得到了1,248个close-up头像、边框mask和相机pose对。

2023-10-06

An Algorithm to Train Unrestricted Sequential Discrete Morphological Neural Networks

Universal Humanoid Motion Representations for Physics-Based Control

VTON-IT: Virtual Try-On using Image Translation

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Iris Liveness Detection Competition (LivDet-Iris) – The 2023 Edition

The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric

URLOST: Unsupervised Representation Learning without Stationarity or Topology

Alice Benchmarks: Connecting Real World Object Re-Identification with the Synthetic

CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis

FedConv: Enhancing Convolutional Neural Networks for Handling Data Heterogeneity in Federated Learning

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

SwimXYZ: A large-scale dataset of synthetic swimming motions and videos

Distributed Deep Joint Source-Channel Coding with Decoder-Only Side Information

Convergent ADMM Plug and Play PET Image Reconstruction

Graph learning in robotics: a survey

Compositional Servoing by Recombining Demonstrations

Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark

Semantic segmentation of longitudinal thermal images for identification of hot and cool spots in urban areas

Enhancing the Authenticity of Rendered Portraits with Identity-Consistent Transfer Learning

Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

Whole Slide Multiple Instance Learning for Predicting Axillary Lymph Node Metastasis

DiffPrompter: Differentiable Implicit Visual Prompts for Semantic-Segmentation in Adverse Conditions

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

Entropic Score metric: Decoupling Topology and Size in Training-free NAS

Improving Neural Radiance Field using Near-Surface Sampling with Point Cloud Generation

TiC: Exploring Vision Transformer in Convolution

VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification

Aorta Segmentation from 3D CT in MICCAI SEG.A. 2023 Challenge

Dense Random Texture Detection using Beta Distribution Statistics

Automated 3D Segmentation of Kidneys and Tumors in MICCAI KiTS 2023 Challenge

ClusVPR: Efficient Visual Place Recognition with Clustering-based Weighted Transformer

End-to-End Chess Recognition

In the Blink of an Eye: Event-based Emotion Recognition

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Towards Increasing the Robustness of Predictive Steering-Control Autonomous Navigation Systems Against Dash Cam Image Angle Perturbations Due to Pothole Encounters

Understanding prompt engineering may not require rethinking generalization

Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

ILSH: The Imperial Light-Stage Head Dataset for Human Head View Synthesis