2023-09-29

cs.CV

cs.CV - 2023-09-29

MARL: Multi-scale Archetype Representation Learning for Urban Building Energy Modeling

paper_url: http://arxiv.org/abs/2310.00180
repo_url: https://github.com/zixunhuang1997/marl-buildingenergyestimation
paper_authors: Xinwei Zhuang, Zixun Huang, Wentao Zeng, Luisa Caldas
For: 本研究旨在提高城市建筑能源估算的准确性和可靠性，通过提取当地建筑尺度特点，建立代表性的建筑模板。* Methods: 该研究使用了 Representation Learning 技术，具体来说是 VQ-AE 和自适应卷积神经网络，对当地建筑脚印进行了压缩和扩展，提取了建筑尺度特点。* Results: 研究发现，使用 MARL 提取的 geometric feature embeddings 可以提高城市建筑能源估算的准确性和可靠性，并且可以自动生成多scale 区域的建筑模板。 Code、数据集和训练模型都公开可用：https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation。

Abstract
Building archetypes, representative models of building stock, are crucial for precise energy simulations in Urban Building Energy Modeling. The current widely adopted building archetypes are developed on a nationwide scale, potentially neglecting the impact of local buildings' geometric specificities. We present Multi-scale Archetype Representation Learning (MARL), an approach that leverages representation learning to extract geometric features from a specific building stock. Built upon VQ-AE, MARL encodes building footprints and purifies geometric information into latent vectors constrained by multiple architectural downstream tasks. These tailored representations are proven valuable for further clustering and building energy modeling. The advantages of our algorithm are its adaptability with respect to the different building footprint sizes, the ability for automatic generation across multi-scale regions, and the preservation of geometric features across neighborhoods and local ecologies. In our study spanning five regions in LA County, we show MARL surpasses both conventional and VQ-AE extracted archetypes in performance. Results demonstrate that geometric feature embeddings significantly improve the accuracy and reliability of energy consumption estimates. Code, dataset and trained models are publicly available: https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation

摘要
建筑型例，代表性模型的建筑股权，在城市建筑能源模拟中是关键。现有的广泛采用的建筑型例是在国家范围内开发的，可能忽略本地建筑的几何特点。我们提出了多级Archetype Representation Learning（MARL），它利用表示学来提取建筑范围的几何特征。基于VQ-AE，MARL编码建筑范围和约束多个建筑下渠道任务的几何信息，生成受限制的约束 vector。这些适应性的表示值有助于进一步的划分和建筑能源模拟。我们的算法具有适应不同建筑范围大小、自动生成多级区域和保持几何特征的优势。在洛杉矶县五个区域的研究中，我们证明MARL超过了传统和VQ-AE提取的建筑型例，在性能和可靠性方面具有显著优势。结果表明几何特征嵌入有助于提高能源消耗估计的准确性和可靠性。代码、数据集和训练模型在 GitHub 上公开：https://github.com/ZixunHuang1997/MARL-BuildingEnergyEstimation。

SCoRe: Submodular Combinatorial Representation Learning for Real-World Class-Imbalanced Settings

paper_url: http://arxiv.org/abs/2310.00165
repo_url: None
paper_authors: Anay Majee, Suraj Kothawade, Krishnateja Killiamsetty, Rishabh Iyer
For: The paper focuses on the challenge of representation learning in real-world class-imbalanced settings, particularly in deep learning.* Methods: The authors propose a new framework called SCoRe (Submodular Combinatorial Representation Learning) that leverages set-based combinatorial functions to model diversity and cooperation among feature clusters. They also introduce a family of Submodular Combinatorial Loss functions to overcome the pitfalls of class-imbalance in contrastive learning.* Results: The authors show that their proposed objectives outperform state-of-the-art metric learners by up to 7.6% for imbalanced classification tasks and up to 19.4% for object detection tasks on several benchmark datasets.

Abstract
Representation Learning in real-world class-imbalanced settings has emerged as a challenging task in the evolution of deep learning. Lack of diversity in visual and structural features for rare classes restricts modern neural networks to learn discriminative feature clusters. This manifests in the form of large inter-class bias between rare object classes and elevated intra-class variance among abundant classes in the dataset. Although deep metric learning approaches have shown promise in this domain, significant improvements need to be made to overcome the challenges associated with class-imbalance in mission critical tasks like autonomous navigation and medical diagnostics. Set-based combinatorial functions like Submodular Information Measures exhibit properties that allow them to simultaneously model diversity and cooperation among feature clusters. In this paper, we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework and propose a family of Submodular Combinatorial Loss functions to overcome these pitfalls in contrastive learning. We also show that existing contrastive learning approaches are either submodular or can be re-formulated to create their submodular counterparts. We conduct experiments on the newly introduced family of combinatorial objectives on two image classification benchmarks - pathologically imbalanced CIFAR-10, subsets of MedMNIST and a real-world road object detection benchmark - India Driving Dataset (IDD). Our experiments clearly show that the newly introduced objectives like Facility Location, Graph-Cut and Log Determinant outperform state-of-the-art metric learners by up to 7.6% for the imbalanced classification tasks and up to 19.4% for object detection tasks.

摘要
“在实际世界中的分布不均的设定下，深度学习中的 Representation Learning 已成为一个挑战。缺乏罕见类别的视觉和结构特征使现代神经网络很难学习标志性的特征群。这会导致资料集中罕见类别和普见类别之间的大型间隔，以及普见类别中的高度内部矩阵。尽管深度度量学习方法有所进步，但还需要进一步改进，以应对实际中的分布不均问题，如自主 Navigation 和医疗诊断。”“Set-based combinatorial functions like Submodular Information Measures exhibit properties that allow them to simultaneously model diversity and cooperation among feature clusters. In this paper, we introduce the SCoRe (Submodular Combinatorial Representation Learning) framework and propose a family of Submodular Combinatorial Loss functions to overcome these pitfalls in contrastive learning. We also show that existing contrastive learning approaches are either submodular or can be re-formulated to create their submodular counterparts.”“我们在新引入的家族 combinatorial 目标上进行了实验，包括 Facility Location, Graph-Cut 和 Log Determinant。我们的实验结果显示，这些新引入的目标可以与现有的 Metric Learners 相比，在不均分的类别任务上进行改进。 Specifically, we observe up to 7.6% improvement for imbalanced classification tasks and up to 19.4% improvement for object detection tasks.”

PRIME: Prioritizing Interpretability in Failure Mode Extraction

paper_url: http://arxiv.org/abs/2310.00164
repo_url: None
paper_authors: Keivan Rezaei, Mehrdad Saberi, Mazda Moayeri, Soheil Feizi
for: 本研究旨在提供训练过的图像分类模型中出现的失败模式的人类可理解的描述。
methods: 我们提议一种新的方法，强调解释性，来解决这个问题。我们首先从数据集中获取图像的人类可理解的概念（标签），然后根据模型的行为来分析这些标签的组合。我们的方法还确保这些失败模式的描述形成最小的集合，以避免 redundant和噪音的描述。
results: 我们通过在不同的数据集上进行多个实验，成功地 indentified failing modes 并生成了高质量的关于这些失败模式的文本描述。这些结果 highlights 模型失败的解释性的重要性。

Abstract
In this work, we study the challenge of providing human-understandable descriptions for failure modes in trained image classification models. Existing works address this problem by first identifying clusters (or directions) of incorrectly classified samples in a latent space and then aiming to provide human-understandable text descriptions for them. We observe that in some cases, describing text does not match well with identified failure modes, partially owing to the fact that shared interpretable attributes of failure modes may not be captured using clustering in the feature space. To improve on these shortcomings, we propose a novel approach that prioritizes interpretability in this problem: we start by obtaining human-understandable concepts (tags) of images in the dataset and then analyze the model's behavior based on the presence or absence of combinations of these tags. Our method also ensures that the tags describing a failure mode form a minimal set, avoiding redundant and noisy descriptions. Through several experiments on different datasets, we show that our method successfully identifies failure modes and generates high-quality text descriptions associated with them. These results highlight the importance of prioritizing interpretability in understanding model failures.

摘要
在这个工作中，我们研究了提供训练过的图像分类模型失败模式的人类可理解描述的挑战。现有的方法通过首先标识特征空间中错误分类样本的集群（或方向）来解决这个问题，然后尝试提供人类可理解的文本描述。我们发现，在某些情况下，描述文本并不好匹配错误模式的共同可解释特征，部分是因为在特征空间中使用 clustering 可能会遗漏共享可解释特征。为了改进这些缺陷，我们提出了一种新的方法，即优先考虑可解释性：我们首先从数据集中获取图像的人类可理解概念（标签），然后分析模型的行为是否符合标签的存在或缺失。我们的方法还确保了失败模式的标签形成最小集，以避免重复和噪音的描述。通过在不同的数据集上进行了多个实验，我们证明了我们的方法可以成功地认出失败模式并生成高质量的相关文本描述。这些结果强调了在理解模型失败时优先考虑可解释性的重要性。

Prior Mismatch and Adaptation in PnP-ADMM with a Nonconvex Convergence Analysis

paper_url: http://arxiv.org/abs/2310.00133
repo_url: None
paper_authors: Shirin Shoushtari, Jiaming Liu, Edward P. Chandler, M. Salman Asif, Ulugbek S. Kamilov
for: 这篇论文主要针对解决图像 inverse problems 的问题，通过将物理测量模型与图像约束相结合，使用图像滤波器来指定图像约束。
methods: 这篇论文使用了 Alternating Direction Method of Multipliers (ADMM) 方法，并研究了在不同的数据分布下对 PnP 方法的影响。
results: 研究结果显示，PnP-ADMM 方法对于图像超分辨问题的性能是可以忽略不计的，但是如果使用不同的数据分布来训练和测试，则会导致性能下降。然而，通过一些简单的采样和适应策略，可以减少这种性能差距。

Abstract
Plug-and-Play (PnP) priors is a widely-used family of methods for solving imaging inverse problems by integrating physical measurement models with image priors specified using image denoisers. PnP methods have been shown to achieve state-of-the-art performance when the prior is obtained using powerful deep denoisers. Despite extensive work on PnP, the topic of distribution mismatch between the training and testing data has often been overlooked in the PnP literature. This paper presents a set of new theoretical and numerical results on the topic of prior distribution mismatch and domain adaptation for alternating direction method of multipliers (ADMM) variant of PnP. Our theoretical result provides an explicit error bound for PnP-ADMM due to the mismatch between the desired denoiser and the one used for inference. Our analysis contributes to the work in the area by considering the mismatch under nonconvex data-fidelity terms and expansive denoisers. Our first set of numerical results quantifies the impact of the prior distribution mismatch on the performance of PnP-ADMM on the problem of image super-resolution. Our second set of numerical results considers a simple and effective domain adaption strategy that closes the performance gap due to the use of mismatched denoisers. Our results suggest the relative robustness of PnP-ADMM to prior distribution mismatch, while also showing that the performance gap can be significantly reduced with few training samples from the desired distribution.

摘要
插件和执行（PnP）先验是一种广泛使用的方法来解决图像反问题，通过将物理测量模型与使用图像抑制器指定的图像先验集成。PnP方法已经达到了现状的性能，当使用强大的深度抑制器时。 DESPITE extensive work on PnP, the topic of distribution mismatch between the training and testing data has often been overlooked in the PnP literature. This paper presents a set of new theoretical and numerical results on the topic of prior distribution mismatch and domain adaptation for alternating direction method of multipliers (ADMM) variant of PnP. Our theoretical result provides an explicit error bound for PnP-ADMM due to the mismatch between the desired denoiser and the one used for inference. Our analysis contributes to the work in the area by considering the mismatch under nonconvex data-fidelity terms and expansive denoisers. Our first set of numerical results quantifies the impact of the prior distribution mismatch on the performance of PnP-ADMM on the problem of image super-resolution. Our second set of numerical results considers a simple and effective domain adaption strategy that closes the performance gap due to the use of mismatched denoisers. Our results suggest the relative robustness of PnP-ADMM to prior distribution mismatch, while also showing that the performance gap can be significantly reduced with few training samples from the desired distribution.

Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition

paper_url: http://arxiv.org/abs/2310.00132
repo_url: None
paper_authors: Xiang Li, Jinglu Wang, Xiaohao Xu, Xiulian Peng, Rita Singh, Yan Lu, Bhiksha Raj
for: 这篇论文旨在解决视频中的听视对应问题，通过基于听音的Semantic Space分解和量化来提高听视对应性。
methods: 本文提出了一种基于产品量化的Semantic decomposition方法，将多源语音 semantics 分解为几个独立的单源 semantics，并引入全局到本地量化机制来处理听音 semantics的不断变化。
results: 实验结果表明，基于Semantic decomposition和量化的听视对应方法可以显著提高AVS性能，例如在AVS-Semantic标准测试集上提高了21.2%的mIoU。

Abstract
Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos based on their associated acoustic cues. With multiple sound sources involved, establishing robust correspondences between audio and visual contents poses unique challenges due to its (1) intricate entanglement across sound sources and (2) frequent shift among sound events. Assuming sound events occur independently, the multi-source semantic space (which encompasses all possible semantic categories) can be viewed as the Cartesian product of single-source sub-spaces. This motivates us to decompose the multi-source audio semantics into single-source semantics, allowing for more effective interaction with visual content. Specifically, we propose a semantic decomposition method based on product quantization, where the multi-source semantics can be decomposed and represented by several quantized single-source semantics. Furthermore, we introduce a global-to-local quantization mechanism that distills knowledge from stable global (clip-level) features into local (frame-level) ones to handle the constant shift of audio semantics. Extensive experiments demonstrate that semantically quantized and decomposed audio representation significantly improves AVS performance, e.g., +21.2% mIoU on the most challenging AVS-Semantic benchmark.

摘要
宽泛听视分割（AVS）是一项复杂的任务，旨在根据视频中相关的声音提示，将视觉对象分割成多个类别。由于多个声音源参与，在Audio和视觉内容之间建立坚实的对应关系具有独特挑战，主要是因为声音源之间的复杂互相关系和声音事件频繁变化。假设声音事件独立发生，那么多源语义空间（包括所有可能的语义类别）可以视为Cartesian产品的多个单源语义空间。这使得我们可以将多源语义分解为单源语义，以更好地与视觉内容交互。具体来说，我们提出了基于产品量化的语义分解方法，可以将多源语义分解并表示为多个量化的单源语义。此外，我们还引入了全局到本地量化机制，可以从稳定的全局（clip级）特征中提取知识，并将其转化为本地（帧级）特征，以处理声音语义的不断变化。广泛的实验表明，semantic量化和分解后的声音表示可以显著提高AVS性能，例如+21.2% mIoU在AVS-Semantic标准测试上。

Fewshot learning on global multimodal embeddings for earth observation tasks

paper_url: http://arxiv.org/abs/2310.00119
repo_url: None
paper_authors: Matt Allen, Francisco Dorr, Joseph A. Gallego-Mejia, Laura Martínez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Raúl Ramos-Pollán
for: 这个研究使用CLIP/ViT模型在五个不同的地区进行了三种不同的卫星图像模式的预训练，涵盖了全球陆地面积的约10%。
methods: 该模型使用了约250M个参数，并使用了经典机器学习方法来进行不同的下游任务，包括覆盖面积、建筑面积、农地和永久水面。
results: 研究表明，只需使用约200-500个随机选择的标注示例（约4K-10K平方公里），可以达到与完整的标注数据集（约150K张图像片或3M平方公里）的性能水平，在所有模式、地区和下游任务中。这表明模型已经捕捉了大地特征，可以在各种enario中应用。

Abstract
In this work we pretrain a CLIP/ViT based model using three different modalities of satellite imagery across five AOIs covering over ~10\% of the earth total landmass, namely Sentinel 2 RGB optical imagery, Sentinel 1 SAR amplitude and Sentinel 1 SAR interferometric coherence. This model uses $\sim 250$ M parameters. Then, we use the embeddings produced for each modality with a classical machine learning method to attempt different downstream tasks for earth observation related to vegetation, built up surface, croplands and permanent water. We consistently show how we reduce the need for labeled data by 99\%, so that with ~200-500 randomly selected labeled examples (around 4K-10K km$^2$) we reach performance levels analogous to those achieved with the full labeled datasets (about 150K image chips or 3M km$^2$ in each AOI) on all modalities, AOIs and downstream tasks. This leads us to think that the model has captured significant earth features useful in a wide variety of scenarios. To enhance our model's usability in practice, its architecture allows inference in contexts with missing modalities and even missing channels within each modality. Additionally, we visually show that this embedding space, obtained with no labels, is sensible to the different earth features represented by the labelled datasets we selected.

摘要
“在这项工作中，我们预训练了基于CLIP/ViT的模型，使用三种不同的卫星图像模式，覆盖地球总面积的约10%的五个AOI，即Sentinel 2 红色光学图像、Sentinel 1 SAR振荡和Sentinel 1 SAR相互干扰干扰。这个模型使用大约250M个参数。然后，我们使用每个模式生成的嵌入，与经典机器学习方法进行不同的下游任务，关于地球观测，如茂密 vegetation、建筑物表面、农地和常年水。我们一致地表明，我们可以降低需要标注数据的量，从99%下降到200-500个随机选择的标注示例（约4K-10K km$^2$），以达到与全部标注数据集（约150K图像块或3M km$^2$在每个AOI）的性能水平。这导致我们认为，这个模型已经捕捉了大地特征，可以在各种enario中应用。为了使这个模型在实际应用中更加可用，它的架构允许在缺失模式和内部模式中进行推理。此外，我们可见地表示，这些无标注空间，通过与我们选择的标注集合进行比较，表明这些空间是有意义的。”

paper_url: http://arxiv.org/abs/2310.00108
repo_url: https://github.com/ruoxi-jia-group/clip-mia
paper_authors: Myeongseob Ko, Ming Jin, Chenguang Wang, Ruoxi Jia
for: This paper aims to develop practical membership inference attacks (MIAs) against large-scale multi-modal models, specifically targeting CLIP models.
methods: The proposed baseline strategy thresholds the cosine similarity between text and image features of a target point, and the enhanced attack method aggregates cosine similarity across transformations of the target. Additionally, a new weakly supervised attack method leverages ground-truth non-members to further enhance the attack.
results: The simple baseline achieves over $75%$ membership identification accuracy, and the enhanced attacks outperform the baseline across multiple models and datasets, with the weakly supervised attack demonstrating an average-case performance improvement of $17%$ and being at least $7$X more effective at low false-positive rates.

Abstract
Membership inference attacks (MIAs) aim to infer whether a data point has been used to train a machine learning model. These attacks can be employed to identify potential privacy vulnerabilities and detect unauthorized use of personal data. While MIAs have been traditionally studied for simple classification models, recent advancements in multi-modal pre-training, such as CLIP, have demonstrated remarkable zero-shot performance across a range of computer vision tasks. However, the sheer scale of data and models presents significant computational challenges for performing the attacks. This paper takes a first step towards developing practical MIAs against large-scale multi-modal models. We introduce a simple baseline strategy by thresholding the cosine similarity between text and image features of a target point and propose further enhancing the baseline by aggregating cosine similarity across transformations of the target. We also present a new weakly supervised attack method that leverages ground-truth non-members (e.g., obtained by using the publication date of a target model and the timestamps of the open data) to further enhance the attack. Our evaluation shows that CLIP models are susceptible to our attack strategies, with our simple baseline achieving over $75\%$ membership identification accuracy. Furthermore, our enhanced attacks outperform the baseline across multiple models and datasets, with the weakly supervised attack demonstrating an average-case performance improvement of $17\%$ and being at least $7$X more effective at low false-positive rates. These findings highlight the importance of protecting the privacy of multi-modal foundational models, which were previously assumed to be less susceptible to MIAs due to less overfitting. Our code is available at https://github.com/ruoxi-jia-group/CLIP-MIA.

摘要
美国文学攻击（MIA）旨在判断一个数据点是否被用于训练机器学习模型。这些攻击可以用来检测潜在隐私漏洞和未经授权使用个人数据。虽然MIA在简单的分类模型上已经得到了广泛的研究，但是随着多Modal预训练的发展，如CLIP，其在多种计算机视觉任务上表现出了很好的零基础性能。然而，大规模数据和模型的计算挑战仍然存在。这篇文章是开发大规模多Modal模型的实用MIA的第一步。我们提出了一个简单的基线策略，即将文本和图像特征的夹角相似度进行阈值设置，并提出了在 transformations 上聚合 cosine similarity 来增强基线。我们还提出了一种新的弱监督攻击方法，利用非成员（例如，使用目标模型的发布日期和公开数据的时间戳）来进一步提高攻击。我们的评估结果显示，CLIP模型对我们的攻击策略非常易受到影响，我们的简单基线可以达到超过 75%的成员认定精度。此外，我们的增强攻击方法在多个模型和数据集上表现出了超过基eline的性能，弱监督攻击在低 false-positive 率下表现出了平均情况性能提高约 17%，并且在至少 7 倍的 false-positive 率下表现出了最好的性能。这些发现说明了保护多Modal基础模型的隐私的重要性，这些模型曾经被认为是MIA less susceptible due to less overfitting。我们的代码可以在中找到。

Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation

paper_url: http://arxiv.org/abs/2310.00099
repo_url: None
paper_authors: Zhuoran Yu, Manchen Wang, Yanbei Chen, Paolo Favaro, Davide Modolo
for: 提出一种新的半监督学习设计，用于人姿估计，涵盖了两个方面的优化。
methods: 首先，引入一种去噪方案，生成可靠的 Pseudo-热图作为学习从无标记数据中学习的目标。其次，通过选择学习目标，根据估计的交叉学生uncertainty来选择学习目标。
results: 我们在COCObenchmark上进行了多种评估 setup，并证明了我们的模型在半监督学习下比前一代最佳性能的 semi-supervised pose estimator 更高，特别是在低数据 regime 下。例如，只有0.5K 标注图像，我们的方法可以超越最佳竞争对手by 7.22 mAP (+25% 绝对提升)。此外，我们还证明了我们的模型可以从无标记数据中学习，以进一步提高其泛化和性能。

Abstract
We propose a new semi-supervised learning design for human pose estimation that revisits the popular dual-student framework and enhances it two ways. First, we introduce a denoising scheme to generate reliable pseudo-heatmaps as targets for learning from unlabeled data. This uses multi-view augmentations and a threshold-and-refine procedure to produce a pool of pseudo-heatmaps. Second, we select the learning targets from these pseudo-heatmaps guided by the estimated cross-student uncertainty. We evaluate our proposed method on multiple evaluation setups on the COCO benchmark. Our results show that our model outperforms previous state-of-the-art semi-supervised pose estimators, especially in extreme low-data regime. For example with only 0.5K labeled images our method is capable of surpassing the best competitor by 7.22 mAP (+25% absolute improvement). We also demonstrate that our model can learn effectively from unlabeled data in the wild to further boost its generalization and performance.

摘要
我们提出一种新的半监督学习设计，用于人姿估计，这种设计基于流行的双学生框架，并在两个方面进行了改进。首先，我们引入一种去噪方案，以生成可靠的假热图作为不监督学习的目标。这使用多视图扩展和阈值和精炼过程生成一个热图pool。其次，我们根据这些热图pool中的估计交叉学生不确定性选择学习目标。我们在COCO标准库上多个评估setup中评估了我们的提议方法，结果显示，我们的模型在低数据量情况下超过了之前的半监督pose估计器，具体提高了7.22个mAP (+25%相对提高)。此外，我们还证明了我们的模型可以从无标注数据中学习，并在野外数据中进一步提高其普适性和性能。

Towards Few-Call Model Stealing via Active Self-Paced Knowledge Distillation and Diffusion-Based Image Generation

paper_url: http://arxiv.org/abs/2310.00096
repo_url: None
paper_authors: Vlad Hondru, Radu Tudor Ionescu
for: 本研究旨在探讨一种新的应用场景，即使用Diffusion模型来 копироваblack-box类型的图像分类模型，无需访问原始训练数据、模型 architecture和参数。
methods: 我们提出了一种新的框架，使用Diffusion模型生成的假数据集（ proxy data set）作为训练数据，并采用最大化API调用数量的限制。我们通过将一定数量的样本传递 через黑盒模型，收集labels，然后通过沟通知论把黑盒教师（攻击模型）的知识 transferred 到学生模型（复制模型）中。
results: 我们的实验结果表明，我们的框架在几个API调用的情况下比两种现状顶尖方法更高效。

Abstract
Diffusion models showcased strong capabilities in image synthesis, being used in many computer vision tasks with great success. To this end, we propose to explore a new use case, namely to copy black-box classification models without having access to the original training data, the architecture, and the weights of the model, \ie~the model is only exposed through an inference API. More specifically, we can only observe the (soft or hard) labels for some image samples passed as input to the model. Furthermore, we consider an additional constraint limiting the number of model calls, mostly focusing our research on few-call model stealing. In order to solve the model extraction task given the applied restrictions, we propose the following framework. As training data, we create a synthetic data set (called proxy data set) by leveraging the ability of diffusion models to generate realistic and diverse images. Given a maximum number of allowed API calls, we pass the respective number of samples through the black-box model to collect labels. Finally, we distill the knowledge of the black-box teacher (attacked model) into a student model (copy of the attacked model), harnessing both labeled and unlabeled data generated by the diffusion model. We employ a novel active self-paced learning framework to make the most of the proxy data during distillation. Our empirical results on two data sets confirm the superiority of our framework over two state-of-the-art methods in the few-call model extraction scenario.

摘要
Diffusion models have shown great potential in image synthesis and have been successfully applied to various computer vision tasks. In this paper, we aim to explore a new use case for diffusion models: copying black-box classification models without access to the original training data, model architecture, or weights. Specifically, we can only observe the soft or hard labels of some input image samples through the model's inference API. To solve this task under these restrictions, we propose the following framework:1. Create a synthetic data set (proxy data set) using diffusion models to generate realistic and diverse images.2. Pass the maximum number of allowed API calls through the black-box model to collect labels.3. Distill the knowledge of the black-box teacher (attacked model) into a student model (copy of the attacked model) using both labeled and unlabeled data generated by the diffusion model.4. Employ a novel active self-paced learning framework to make the most of the proxy data during distillation.Our experimental results on two data sets demonstrate the superiority of our framework over two state-of-the-art methods in the few-call model extraction scenario.

DataDAM: Efficient Dataset Distillation with Attention Matching

paper_url: http://arxiv.org/abs/2310.00093
repo_url: https://github.com/datadistillation/datadam
paper_authors: Ahmad Sajedi, Samir Khaki, Ehsan Amjadian, Lucy Z. Liu, Yuri A. Lawryshyn, Konstantinos N. Plataniotis
for: 降低深度学习训练成本，保持多个 dataset 上强大的泛化性
methods: 使用高效的 Dataset Distillation with Attention Matching (DataDAM) 方法，通过匹配不同层的神经网络中的空间注意力图来学习高质量的合成图像
results: 在 CIFAR10/100、TinyImageNet、ImageNet-1K 和 ImageNet-1K 的一些subset中，与先前方法相比，实现了状态机器的性能，并在 CIFAR100 和 ImageNet-1K 上实现了提高分别为 6.5% 和 4.1%。

Abstract
Researchers have long tried to minimize training costs in deep learning while maintaining strong generalization across diverse datasets. Emerging research on dataset distillation aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset and ultimately achieves test accuracy equivalent to a model trained on the whole dataset. Unfortunately, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data, and they incur significant computational costs. Despite promising results, there still exists a significant performance gap between models trained on condensed synthetic sets and those trained on the whole dataset. In this paper, we address these challenges using efficient Dataset Distillation with Attention Matching (DataDAM), achieving state-of-the-art performance while reducing training costs. Specifically, we learn synthetic images by matching the spatial attention maps of real and synthetic data generated by different layers within a family of randomly initialized neural networks. Our method outperforms the prior methods on several datasets, including CIFAR10/100, TinyImageNet, ImageNet-1K, and subsets of ImageNet-1K across most of the settings, and achieves improvements of up to 6.5% and 4.1% on CIFAR100 and ImageNet-1K, respectively. We also show that our high-quality distilled images have practical benefits for downstream applications, such as continual learning and neural architecture search.

摘要

Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

paper_url: http://arxiv.org/abs/2310.00076
repo_url: None
paper_authors: Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei, Aounon Kumar, Atoosa Chegini, Wenxiao Wang, Soheil Feizi
for: 本研究旨在探讨用于分辨真实图像和人工智能生成的图像的检测技术，以避免使用假图像作为真实图像。
methods: 本研究使用了许多检测技术，包括水印方法和深度模型检测器。
results: 研究发现，对于低扰动量水印方法，存在一种基本的负反向关系，即避免错误率和伪装错误率之间的负反向关系。此外，研究还发现了一种替换模型攻击，可以成功地去除水印。此外，研究还发现了水印方法对于伪装攻击的漏洞，可能导致真实图像被误认为是伪图像。

Abstract
In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of a diffusion purification attack. In this regime, we also empirically show that diffusion purification effectively removes watermarks with minimal changes to images. For high perturbation watermarking methods where notable changes are applied to images, the diffusion purification attack is not effective. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images (potentially obscene) identified as watermarked ones, damaging the reputation of the developers. In particular, by just having black-box access to the watermarking method, we show that one can generate a watermarked noise image which can be added to the real images to have them falsely flagged as watermarked ones. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.

摘要
随着生成型人工智能模型的发展，识别真实内容和人工生成的内容变得非常重要，以避免使用假材料作为真实内容并 vice versa。各种技术已经被提出来识别人工生成的图像，其中水印技术成为了一种有希望的方法。在这篇论文中，我们分析了不同的人工生成图像检测器，包括水印和深度骗局检测器。对于低干扰量水印方法，我们发现了一个基本的负面负担关系，即逃脱错误率（即水印图像被识别为非水印图像的比率）和骗局错误率（即非水印图像被识别为水印图像的比率）。在这种情况下，我们也实证表明了 diffusion purification 攻击可以有效地去除水印。对于高干扰量水印方法， diffusion purification 攻击不生效。在这种情况下，我们开发了模型替换对抗攻击，可以成功地去除水印。此外，我们还发现了水印方法对骗局攻击（攻击者尝试将真实图像（可能是不良的）识别为水印图像，从而损害开发者的声誉）的漏洞。具体来说，只要对水印方法有黑盒访问，就可以生成一个水印图像，并将其添加到真实图像中，以误导系统将真实图像识别为水印图像。最后，我们扩展了我们的理论，并通过实验来描述一个基本的负面负担关系，即检测器的可靠性和robustness之间的负面负担关系。

Multi-task View Synthesis with Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.17450
repo_url: https://github.com/zsh2000/muvienerf
paper_authors: Shuhong Zheng, Zhipeng Bao, Martial Hebert, Yu-Xiong Wang
for: 本研究旨在探讨多任务视觉学中的多视图同时Synthesize多种场景属性的问题，并提出了一种名为Multi-task View Synthesis（MTVS）的问题设定。
methods: 该研究提出了一种名为MuvieNeRF的框架，该框架包括了多任务和跨视图知识的 integrate，以同时Synthesize多种场景属性。MuvieNeRF包括了两个关键模块：跨任务注意力（CTA）模块和跨视图注意力（CVA）模块，这两个模块使得信息可以有效地在多个视图和任务之间进行交互。
results: 对于 synthetic 和 realistic 的标准准则进行了广泛的评估，表明MuvieNeRF可以同时Synthesize多种场景属性，并且在不同的 NeRF 背景下表现优秀。特别是，我们展示了MuvieNeRF在不同的应用场景中具有通用性。代码可以在https://github.com/zsh2000/MuvieNeRF中下载。

Abstract
Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capability for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MTVS), which reinterprets multi-task prediction as a set of novel-view synthesis tasks for multiple scene properties, including RGB. To tackle the MTVS problem, we propose MuvieNeRF, a framework that incorporates both multi-task and cross-view knowledge to simultaneously synthesize multiple scene properties. MuvieNeRF integrates two key modules, the Cross-Task Attention (CTA) and Cross-View Attention (CVA) modules, enabling the efficient use of information across multiple views and tasks. Extensive evaluation on both synthetic and realistic benchmarks demonstrates that MuvieNeRF is capable of simultaneously synthesizing different scene properties with promising visual quality, even outperforming conventional discriminative models in various settings. Notably, we show that MuvieNeRF exhibits universal applicability across a range of NeRF backbones. Our code is available at https://github.com/zsh2000/MuvieNeRF.

摘要
多任务视觉学是计算机视觉的关键方面。然而，当前研究主要集中在多任务紧凑预测设置中，忽略了自然3D世界的多视点共同结构，而且缺乏多样化的想象能力。为了解决这些局限性，我们提出了一个新的问题设定---多任务视觉合成（MTVS），将多任务预测重新解释为多个场景属性的新视图合成任务。为了解决MTVS问题，我们提出了MuvieNeRF框架，该框架包含了多任务和交叉视图知识的两个关键模块：交叉任务注意力（CTA）和交叉视图注意力（CVA）模块。这两个模块使得可以有效利用多视点和任务之间的信息。我们对各种 sintetic和实际的标准块进行了广泛的评估，并证明了MuvieNeRF可以同时合成不同场景属性，并且视觉质量得到了广泛的提高。特别是，我们表明了MuvieNeRF在不同的NeRF底层模型上 exhibits 普遍可用性。我们的代码可以在 GitHub 上找到。

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

paper_url: http://arxiv.org/abs/2309.17448
repo_url: https://github.com/caizhongang/SMPLer-X
paper_authors: Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu
For: 本研究旨在推进人体 pose 和形态估计（EHPS）领域的发展，通过大量数据和大型模型的使用，实现对多种场景的人体动作和形态估计。* Methods: 本研究使用了许多 EHPS 数据集，并对模型训练 scheme 进行了系统atic investigation，以便选择最适合的数据集和训练策略。此外，研究还使用了视力 transformer 来研究模型大小的扩展法律，并实现了模型的特殊化。* Results: 研究表明，通过使用大量数据和大型模型，可以实现对多种场景的人体动作和形态估计的出色表现，并且可以在不同的环境下进行出色的转移。特别是，基于 SMPLer-X 的基础模型可以在七个测试标准中占据国际第一名，包括 AGORA（107.2 mm NMVE）、UBody（57.4 mm PVE）、EgoBody（63.6 mm PVE）和 EHF（62.3 mm PVE 无需 fine-tuning）等。

Abstract
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/

摘要
traditional Chinese:这个研究探讨了将表达人体姿态和形状估计（EHPS）与许多应用结合起来，尽管现有的进步已经很鼓舞人，但现在的州OF-the-art方法仍然受到一个仅仅有限的训练数据集的限制。在这个工作中，我们investigated scaling up EHPS towards the first generalist foundation model（dubbed SMPLer-X），使用Up to ViT-Huge as the backbone和Up to 4.5M个实例从多元数据来训练。With big data和大型模型，SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments。1. 我们进行了系统性的调查，探讨在32个EHPS数据集中，包括各种情况，一个训练在任何单一数据集上的模型无法应对的各种情况。更重要的是，我们根据广泛的 benchmarking 过程中所获得的启示，来优化我们的训练方案，并选择最适合的数据集，从而导致EHPS的能力获得了重大的提升。2. 我们利用 Computer Vision Transformers 来研究EHPS中模型的扩展法则。此外，我们还提出了一种finetuning 策略，将SMPLer-X转换成专家模型，以获得更高的性能提升。值得注意的是，我们的基础模型SMPLer-X在七个benchmark上均 delivery state-of-the-art results，包括AGORA（107.2 mm NMVE）、UBody（57.4 mm PVE）、EgoBody（63.6 mm PVE）和EHF（62.3 mm PVE without finetuning）。更多信息可以在以下页面上找到：https://caizhongang.github.io/projects/SMPLer-X/

FACTS: First Amplify Correlations and Then Slice to Discover Bias

paper_url: http://arxiv.org/abs/2309.17430
repo_url: https://github.com/yvsriram/facts
paper_authors: Sriram Yenamandra, Pratik Ramesh, Viraj Prabhu, Judy Hoffman
for: 本研究旨在identifying spurious correlations in computer vision datasets, in order to improve downstream bias mitigation strategies.
methods: 我们提出了一种方法 called First Amplify Correlations and Then Slice to Discover Bias (FACTS), which involves amplifying correlations and then performing correlation-aware slicing to identify underperforming data slices.
results: 我们的方法可以 considerably improve correlation bias identification compared to prior work, with improvements of up to 35% precision@10 in a range of diverse evaluation settings.

Abstract
Computer vision datasets frequently contain spurious correlations between task-relevant labels and (easy to learn) latent task-irrelevant attributes (e.g. context). Models trained on such datasets learn "shortcuts" and underperform on bias-conflicting slices of data where the correlation does not hold. In this work, we study the problem of identifying such slices to inform downstream bias mitigation strategies. We propose First Amplify Correlations and Then Slice to Discover Bias (FACTS), wherein we first amplify correlations to fit a simple bias-aligned hypothesis via strongly regularized empirical risk minimization. Next, we perform correlation-aware slicing via mixture modeling in bias-aligned feature space to discover underperforming data slices that capture distinct correlations. Despite its simplicity, our method considerably improves over prior work (by as much as 35% precision@10) in correlation bias identification across a range of diverse evaluation settings. Our code is available at: https://github.com/yvsriram/FACTS.

摘要
computer vision datasets часто包含偶极性关系 между任务相关标签和（容易学习的）隐藏任务不相关属性（例如上下文）。模型在这些 dataset 上训练时会学习 "短cut"，并在数据 slice 中表现不佳，其中 correlation 不存在。在这项工作中，我们研究如何认识这些 slice 以便下游减偏策略。我们提出了 First Amplify Correlations and Then Slice to Discover Bias 方法（FACTS），其中首先使用强regularization 的实际风险最小化来适应偶极性假设。接着，我们通过混合模型在偏aligned feature space 中进行相关探测，以发现表现不佳的数据 slice，这些 slice 捕捉了不同的相关性。尽管其简单，我们的方法在多种多样的评估环境中可以明显提高对偶极性标识的性能，提高了至多 35% 的精度@10。我们的代码可以在：https://github.com/yvsriram/FACTS 中找到。

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

paper_url: http://arxiv.org/abs/2309.17400
repo_url: None
paper_authors: Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet
for: 本研究旨在提出一种简单有效的方法，用于微调Diffusion模型，以优化可导 reward函数的极大化。
methods: 我们首先证明了，可以通过整个抽象过程来倒推 reward function 的梯度，并且这种方法可以在不同的奖励函数下达到优秀的性能。然后，我们提出了DRaFT变体：DRaFT-K和DRaFT-LV，它们分别 truncates backpropagation 到最后 K 步和只有一步。我们证明了我们的方法在不同的奖励函数下都能够达到好的性能。
results: 我们的实验结果表明，DRaFT 可以在不同的 reward 函数下提高图像的艺术质量。此外，我们还提出了与先前的工作之间的连接，为 gradient-based 微调算法的设计空间提供了一个统一的视角。

Abstract
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.

摘要
我们介绍 Direct Reward Fine-Tuning (DRaFT)，一种简单有效的方法，可以调整传播模型以最大化可微分的奖励函数，例如人类偏好模型中的分数。我们首先显示了可以将奖励函数梯度传递到整个抽样过程中，并且这会实现强大的表现在多种奖励下，超过循环学习基础的方法。然后我们提出了DRaFT-K和DRaFT-LV，这两种方法分别是将梯度传递到最后K步抽样中和仅将梯度估计为一个低方差的情况下。我们显示了我们的方法可以适用于多种奖励函数，并且可以对Stable Diffusion 1.4中生成的图像进行重大改善。最后，我们将我们的方法与先前的工作进行比较，提供了一个统一的观点，对于梯度基本调整算法的设计空间。

IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

paper_url: http://arxiv.org/abs/2309.17399
repo_url: None
paper_authors: Jiancheng Huang, Donghao Zhou, Shifeng Chen
for: 本研究旨在提高单框面反护身份验证系统的安全性，并且只需要静止图像作为输入。
methods: 本研究使用了一种名为Interpretable FAS Transformer（IFAST）的新方法，该方法只需要弱级指导来生成可解释的预测结果。
results: 经过广泛的实验，我们的IFAST方法可以在一个大量的binocular NIR图像数据集（BNI-FAS）上 дости得状态的前两名的 результаados，证明单框面反护身份验证可以基于双目NIR图像进行。

Abstract
Single-shot face anti-spoofing (FAS) is a key technique for securing face recognition systems, and it requires only static images as input. However, single-shot FAS remains a challenging and under-explored problem due to two main reasons: 1) on the data side, learning FAS from RGB images is largely context-dependent, and single-shot images without additional annotations contain limited semantic information. 2) on the model side, existing single-shot FAS models are infeasible to provide proper evidence for their decisions, and FAS methods based on depth estimation require expensive per-pixel annotations. To address these issues, a large binocular NIR image dataset (BNI-FAS) is constructed and published, which contains more than 300,000 real face and plane attack images, and an Interpretable FAS Transformer (IFAST) is proposed that requires only weak supervision to produce interpretable predictions. Our IFAST can produce pixel-wise disparity maps by the proposed disparity estimation Transformer with Dynamic Matching Attention (DMA) block. Besides, a well-designed confidence map generator is adopted to cooperate with the proposed dual-teacher distillation module to obtain the final discriminant results. The comprehensive experiments show that our IFAST can achieve state-of-the-art results on BNI-FAS, proving the effectiveness of the single-shot FAS based on binocular NIR images.

摘要
单张图像反骗检测（FAS）是保护人脸识别系统的关键技术，它只需要静态图像作为输入。然而，单张图像FAS仍然是一个挑战和未探索的问题，主要由两个原因导致：1）数据方面，从RGB图像学习FAS是高度依赖于上下文，单张图像没有额外标注的情况下具有有限的 semantic information。2）模型方面，现有的单张图像FAS模型无法提供有效的证据，基于深度估计的FAS方法需要每个像素的昂贵标注。为解决这些问题，一个大量的双目near-infrared（NIR）图像集（BNI-FAS）被构建并公布，该集包含了超过300,000个真实的人脸和平面攻击图像，以及一个可解释的FAS转换器（IFAST），该模型只需弱级超vision来生成可解释的预测。我们的IFAST可以生成像素级的差分地图，并采用了一个Well-designed的信任度地图生成器，以及一个 dual-teacher 分离模块，以获得最终的检测结果。广泛的实验表明，我们的IFAST可以在BNI-FAS上达到状态艺术的结果，证明了单张图像FAS的可行性。

Forward Flow for Novel View Synthesis of Dynamic Scenes

paper_url: http://arxiv.org/abs/2309.17390
repo_url: None
paper_authors: Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang
for: 本文提出了一种基于神经辐射场（NeRF）的方法，用于synthesizing动态场景中的新视图。现有方法frequently采用一个静止的NeRF来表示canonical space，然后通过将采样的3D点映射回canonical space中的学习后向流场景来渲染动态图像。但是，这个后向流场景是不连续的， diffficult to be fitted by commonly used smooth motion models。
methods: 我们提出了一种直接使用forward flow field来截归canonical radiance field到其他时间步骤。这个forward flow field是在物体区域内具有连续和平滑的特点，可以有助于学习动作模型。为实现这一目标，我们使用voxel网格来表示canonical radiance field，并提出了一种可导的截归过程，包括一个平均拼接操作和一个inpaint网络，以解决多对一和一对多映射问题。
results: 我们的方法在对比 existed methods 时表现出优异，在novel view rendering和动作模型学习方面都达到了更高的水平。这demonstrates the effectiveness of our forward flow motion modeling.

Abstract
This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF

摘要
这个论文提出了一种基于神经辐射场（NeRF）的方法，用于新视角合成动态场景中的动态图像Synthesis。现有方法通常采用一个静态NeRF来表示坐标轴空间的底层结构，然后将其他时间步骤中的图像映射回到坐标轴空间中，使用学习的反向流场场所。然而，这个反向流场场是不平滑的，这会使得使用常见的光滑运动模型来拟合困难。为了解决这个问题，我们提出了直接将前进流场场估计到其他时间步骤，以便将坐标轴空间中的Canonical radiance field直接截射到其他时间步骤。这个前进流场场是在物体区域内平滑和连续的，这有助于学习运动模型。为了实现这一目标，我们使用精度的 voxel 网格来表示Canonical radiance field，并提出了一种可导的截射过程，包括一个平均拼接操作和一个填充网络，以解决多对一和一对多的映射问题。经过实验表明，我们的方法在新视角渲染和运动模型学习方面具有优异性，证明了我们的前进流场运动模型的效果。项目页面：https://npucvr.github.io/ForwardFlowDNeRF

Prompt-based test-time real image dehazing: a novel pipeline

paper_url: http://arxiv.org/abs/2309.17389
repo_url: https://github.com/cecret3350/PTTD-Dehazing
paper_authors: Zixuan Chen, Zewei He, Ziqian Lu, Zhe-Ming Lu
for: 提高模型对实际雾气图像的泛化能力
methods: 使用提示生成模块（PGM）和特征适应模块（FAM）对已经训练过的某些雾气模型进行微调，以适应实际雾气图像
results: 在实际雾气图像场景下，比对使用现有的雾气模型和PTTD模型，PTTD模型能够更好地提高图像的可见度和细节表示，同时具有更好的泛化能力和灵活性。

Abstract
Existing methods attempt to improve models' generalization ability on real-world hazy images by exploring well-designed training schemes (e.g., cycleGAN, prior loss). However, most of them need very complicated training procedures to achieve satisfactory results. In this work, we present a totally novel testing pipeline called Prompt-based Test-Time Dehazing (PTTD) to help generate visually pleasing results of real-captured hazy images during the inference phase. We experimentally find that given a dehazing model trained on synthetic data, by fine-tuning the statistics (i.e., mean and standard deviation) of encoding features, PTTD is able to narrow the domain gap, boosting the performance of real image dehazing. Accordingly, we first apply a prompt generation module (PGM) to generate a visual prompt, which is the source of appropriate statistical perturbations for mean and standard deviation. And then, we employ the feature adaptation module (FAM) into the existing dehazing models for adjusting the original statistics with the guidance of the generated prompt. Note that, PTTD is model-agnostic and can be equipped with various state-of-the-art dehazing models trained on synthetic hazy-clean pairs. Extensive experimental results demonstrate that our PTTD is flexible meanwhile achieves superior performance against state-of-the-art dehazing methods in real-world scenarios. The source code of our PTTD will be made available at https://github.com/cecret3350/PTTD-Dehazing.

摘要
现有方法尝试提高模型对实际雾图像的泛化能力，通过设计有利的训练方案（如循环GAN、先验损失）。然而，大多数方法需要非常复杂的训练过程来获得满意的结果。在这项工作中，我们提出了一个全新的测试管道，即Prompt-based Test-Time Dehazing（PTTD），以帮助在推理阶段生成高质量的实际雾图像结果。我们实验发现，对于已经训练在 sintetic hazy-clean 对的模型，通过细化编码特征统计信息（即均值和标准差），PTTD可以降低领域差距，提高实际雾图像的表现。我们首先使用Prompt Generation Module（PGM）生成视觉提示，该提示是适合编码特征统计信息的适应性扰动的来源。然后，我们在现有的朔杂雾模型中引入特征适应模块（FAM），以适应原始统计信息的修正。需要注意的是，PTTD是模型无关的，可以与多种现有的 state-of-the-art 朔杂雾模型结合使用。我们的PTTD在实际场景中表现出优于现有的朔杂雾方法，并且具有高灵活性。我们将PTTD的源代码发布在GitHub上，可以在中下载。

Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings

paper_url: http://arxiv.org/abs/2309.17361
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 减少深度神经网络（DNNs）的内存占用，以便在常见的移动设备上加载模型。
methods: 提议使用可编程的codebooks，并通过初始化和学习来确定最佳的映射。
results: 实现了高效地压缩Llama 7B模型，可以在5年前的手机上加载。

Abstract
The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.

摘要
“深度神经网络（DNN）的巨大兴趣，尤其是计算机视觉和自然语言处理，归功于计算力的增长。然而，这也导致模型的内存占用量增加，到了一定程度，甚至无法将模型加载到常见设备，如智能手机上。为解决这个限制，量化成为一种非常受欢迎的解决方案，它将高精度张量映射到低精度、内存效率高的格式上。在内存占用量减少方面，最有效的变种是基于codebook。然而，这些方法受到两个限制。首先，它们可能会定义每个张量都有自己的codebook，或者使用内存昂贵的映射来多个codebook。其次，对于梯度下降优化，映射偏好跳跃到极值，因此不能定义 proximal search。在这个工作中，我们提出了解决这两个限制的方法。首先，我们将相似分布的神经元Initially group together，并利用这些结构的重新排序来应用不同的准则因子或者映射多个codebook中的映射，无需任何映射开销。其次，基于这种初始化，我们提议一种联合学习codebook和映射的方法，与最近的梯度基于后期quantization技术相似。此外，我们还引入了一种新的梯度更新定义，以实现 proximal search 的搜索。我们提出的联合学习codebooks和映射（JLCM）方法，可以非常有效地近似任何DNN，例如，一个Llama 7B可以压缩到2Go，并在5年前的智能手机上加载。”

Towards Free Data Selection with General-Purpose Models

paper_url: http://arxiv.org/abs/2309.17342
repo_url: https://github.com/yichen928/freesel
paper_authors: Yichen Xie, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan
for: 提高限制的注释预算下的数据选择效率，以最大化数据选择的用处。
methods: 使用现有的通用模型进行单次推理，不需要额外的训练或监督，直接从多个数据集中选择数据样本。定义基于中间特征的语义模式，以捕捉每幅图像中细腻的地方信息。使用距离基于语义模式的样本选择，不需要批处理选择过程。
results: 在多种计算机视觉任务上实现了显著改进，与现有活跃学习方法相比，速度提高530倍。

Abstract
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets. However, current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly. In this paper, we challenge this status quo by designing a distinct data selection pipeline that utilizes existing general-purpose models to select data from various datasets with a single-pass inference without the need for additional training or supervision. A novel free data selection (FreeSel) method is proposed following this new pipeline. Specifically, we define semantic patterns extracted from inter-mediate features of the general-purpose model to capture subtle local information in each image. We then enable the selection of all data samples in a single pass through distance-based sampling at the fine-grained semantic pattern level. FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods. Extensive experiments verify the effectiveness of FreeSel on various computer vision tasks. Our code is available at https://github.com/yichen928/FreeSel.

摘要
一种愿景优选算法可以高效地选择最有用的样本，以最大化limited的注释预算。然而，现有的方法，通常是通过活动学习方法，iterates占用时间consuming的模型训练和批处理数据选择。在这篇论文中，我们挑战这个状况，并提出一个新的数据选择管道。我们使用现有的通用模型来选择不同的数据集中的样本，而不需要额外的训练或监督。我们提出了一种新的自由数据选择（FreeSel）方法。我们定义通过通用模型中的间隔特征EXTRACTINGsemantic pattern，以捕捉每个图像中的细节信息。然后，我们通过基于距离的样本选择，在细化的semantic pattern level上选择所有的数据样本。FreeSel可以 circumvent批处理选择过程，实现了 significantefficiency提升，比现有的活动学习方法快530倍。我们的代码可以在https://github.com/yichen928/FreeSel上下载。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

paper_url: http://arxiv.org/abs/2309.17336
repo_url: None
paper_authors: Jianning Deng, Gabriel Chan, Hantao Zhong, Chris Xiaoxuan Lu
for:* 这种新的框架是用于鲁棒的3D物体检测从点云数据中，通过跨模态梦境投影来实现。methods:* 我们提出了多种对齐方法，包括空间对齐和特征对齐，以实现同时的后缘修正和梦境生成。results:* 我们的方法可以更好地处理具有各种频谱特征的检测问题，并且在训练和测试阶段都具有优秀的性能和效率。

Abstract
This paper presents a novel framework for robust 3D object detection from point clouds via cross-modal hallucination. Our proposed approach is agnostic to either hallucination direction between LiDAR and 4D radar. We introduce multiple alignments on both spatial and feature levels to achieve simultaneous backbone refinement and hallucination generation. Specifically, spatial alignment is proposed to deal with the geometry discrepancy for better instance matching between LiDAR and radar. The feature alignment step further bridges the intrinsic attribute gap between the sensing modalities and stabilizes the training. The trained object detection models can deal with difficult detection cases better, even though only single-modal data is used as the input during the inference stage. Extensive experiments on the View-of-Delft (VoD) dataset show that our proposed method outperforms the state-of-the-art (SOTA) methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime.

摘要
To address the geometry discrepancy between LiDAR and radar, we propose spatial alignment for better instance matching. Additionally, we use feature alignment to bridge the intrinsic attribute gap between the sensing modalities and stabilize the training.Our proposed method can handle difficult detection cases with single-modal data as input during the inference stage, and outperforms state-of-the-art methods for both radar and LiDAR object detection while maintaining competitive efficiency in runtime. We demonstrate the effectiveness of our approach through extensive experiments on the View-of-Delft (VoD) dataset.

Multi-Depth Branches Network for Efficient Image Super-Resolution

paper_url: http://arxiv.org/abs/2309.17334
repo_url: https://github.com/thy960112/mdbn
paper_authors: Huiyuan Tian, Li Zhang, Shijian Li, Min Yao, Gang Pan
for: 这个论文主要针对超解像（SR）领域的问题，旨在提高 convolutional neural networks（CNNs）基于 SR 模型的性能，特别是Restore高频环境信息而忽略低频环境信息。
methods: 该论文提出了一种 Multi-Depth Branches Network (MDBN) 框架，它是基于 ResNet 架构的延伸，通过添加一个额外的分支来捕捉图像中重要的结构特征。我们提出的多深分支模块 (MDBM) 通过在不同的分支中堆叠同样大小的卷积核，使得不同的分支可以捕捉不同的信息，例如，低频环境信息和高频环境信息。
results: 我们的模型在 SR 领域实现了更好的性能，并且在 computation overhead 方面具有更好的性能。我们的代码可以在 https://github.com/thy960112/MDBN 上获取。

Abstract
Significant progress has been made in the field of super-resolution (SR), yet many convolutional neural networks (CNNs) based SR models primarily focus on restoring high-frequency details, often overlooking crucial low-frequency contour information. Transformer-based SR methods, while incorporating global structural details, frequently come with an abundance of parameters, leading to high computational overhead. In this paper, we address these challenges by introducing a Multi-Depth Branches Network (MDBN). This framework extends the ResNet architecture by integrating an additional branch that captures vital structural characteristics of images. Our proposed multi-depth branches module (MDBM) involves the stacking of convolutional kernels of identical size at varying depths within distinct branches. By conducting a comprehensive analysis of the feature maps, we observe that branches with differing depths can extract contour and detail information respectively. By integrating these branches, the overall architecture can preserve essential low-frequency semantic structural information during the restoration of high-frequency visual elements, which is more closely with human visual cognition. Compared to GoogLeNet-like models, our basic multi-depth branches structure has fewer parameters, higher computational efficiency, and improved performance. Our model outperforms state-of-the-art (SOTA) lightweight SR methods with less inference time. Our code is available at https://github.com/thy960112/MDBN

摘要
“在超分解（SR）领域中，有 significiant progress 已经做出，但是许多 convolutional neural networks（CNNs）基于 SR 模型专注于重新实现高频率的详细信息，经常忽略重要的低频率构造信息。Transformer-based SR 方法，尽管包含全球结构的细节，却常常具有很多参数，导致高 computational overhead。在这篇文章中，我们解决这些挑战，通过引入 Multi-Depth Branches Network (MDBN) 架构。这个架构将 ResNet 架构扩展， adds an additional branch 来捕捉图像中重要的构造特征。我们的提案的多层分支模组 (MDBM) 包括对应的嵌入尺寸在不同深度的几个分支中堆叠的卷积核。我们通过对特征图进行全面分析，发现不同的深度分支可以分别提取构造和细节信息。通过融合这些分支，总体架构可以在实现高频率视觉元素的Restoration 过程中保留重要的低频率semantic构造信息，与人类视觉认知更加相似。相比 GoogLeNet-like 模型，我们的基本多层分支结构具有较少参数、更高的计算效率和提高的性能。我们的模型在较低的推导时间内，已经超过了现有的SOTA 轻量级 SR 方法。我们的代码可以在 https://github.com/thy960112/MDBN 中找到。”

Telling Stories for Common Sense Zero-Shot Action Recognition

paper_url: http://arxiv.org/abs/2309.17327
repo_url: https://github.com/kini5gowda/stories
paper_authors: Shreyank N Gowda, Laura Sevilla-Lara
for: 这篇论文旨在提高零例学习视频分类的效果，推动零例视频分析领域的进步。
methods: 该论文使用了一个新的数据集，即Stories数据集，该数据集包含了多种动作类型的文本描述，从 WikiHow 文章中提取出来的。每个类型的动作都有多句叙述，描述了必需的步骤、场景、物品和动词，这些文本数据帮助建立了更加复杂的动作关系，为零例转移提供了基础。此外，该论文还提出了一种使用 Stories 数据集进行零例特征生成的方法，不需要任何目标数据集 fine-tuning，可以达到新的领先性水平，提高 top-1 准确率达到 6.1% 之上。
results: 该论文通过使用 Stories 数据集和提出的方法，在多个 benchmark 上达到了新的领先性水平，提高 top-1 准确率达到 6.1% 之上。

Abstract
Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories .

摘要
视频理解已经长期受到大量标注数据的限制，这促使了研究零shot学习。现在的语言模型研究开创了对零shot视频分析的新机会，但构建有效的Semantic空间关系动作类仍然是挑战。我们解决这个问题，通过引入一个新的数据集，叫做Stories，这个数据集包含了多种动作类的丰富文本描述，从WikiHow文章中提取出来的。对于每个类，我们提取了多句叙述，描述了动作的必要步骤、场景、物品和动词，这些文本数据允许我们模型动作之间的细化关系，为零shot传递开辟新的道路。我们还提出了一种使用Stories来提高零shot分类训练的方法。无需任务集微调，我们的方法在多个benchmark上达到了新的州OF-THE-ART，提高了top-1准确率by up to 6.1%。我们认为Stories提供了一个价值的资源，可以促进零shot动作认知领域的进步。这些文本描述将seen和unseen类之间建立连接，超越了标注数据的瓶颈，长期妨碍了这一领域的发展。数据可以在以下地址找到：https://github.com/kini5gowda/Stories。

Development of a Deep Learning Method to Identify Acute Ischemic Stroke Lesions on Brain CT

paper_url: http://arxiv.org/abs/2309.17320
repo_url: None
paper_authors: Alessandro Fontanella, Wenwen Li, Grant Mair, Antreas Antoniou, Eleanor Platt, Paul Armitage, Emanuele Trucco, Joanna Wardlaw, Amos Storkey
for: 这个研究的目的是开发一种基于深度学习的Computed Tomography（CT）脑神经凝聚检测方法，以便诊断急性血栓性脑梗（AIS）病例。
methods: 这个研究使用了一种基于 convolutional neural network（CNN）的深度学习算法，使用了来自第三次国际中风试验（IST-3）的 routinely-collected CT脑神经扫描数据，这些数据没有遵循严格的研究协议。研究者还explored了AIS损伤特征、背景脑神经的影响以及时间因素对深度学习性能的影响。
results: 研究发现，使用这种深度学习算法可以准确地检测AIS损伤，并且可以分类脑神经损伤的侧。研究中的最佳方法达到了72%的准确率，并且发现大型损伤（80%的准确率）、多个损伤（87%的准确率）以及follow-up扫描（76%的准确率）的检测率较高。然而，chronic brain conditions会降低深度学习的准确率，特别是非 stroke lesions和老 stroke lesions（32%和31%的错误率）。

Abstract
Computed Tomography (CT) is commonly used to image acute ischemic stroke (AIS) patients, but its interpretation by radiologists is time-consuming and subject to inter-observer variability. Deep learning (DL) techniques can provide automated CT brain scan assessment, but usually require annotated images. Aiming to develop a DL method for AIS using labelled but not annotated CT brain scans from patients with AIS, we designed a convolutional neural network-based DL algorithm using routinely-collected CT brain scans from the Third International Stroke Trial (IST-3), which were not acquired using strict research protocols. The DL model aimed to detect AIS lesions and classify the side of the brain affected. We explored the impact of AIS lesion features, background brain appearances, and timing on DL performance. From 5772 unique CT scans of 2347 AIS patients (median age 82), 54% had visible AIS lesions according to expert labelling. Our best-performing DL method achieved 72% accuracy for lesion presence and side. Lesions that were larger (80% accuracy) or multiple (87% accuracy for two lesions, 100% for three or more), were better detected. Follow-up scans had 76% accuracy, while baseline scans 67% accuracy. Chronic brain conditions reduced accuracy, particularly non-stroke lesions and old stroke lesions (32% and 31% error rates respectively). DL methods can be designed for AIS lesion detection on CT using the vast quantities of routinely-collected CT brain scan data. Ultimately, this should lead to more robust and widely-applicable methods.

摘要
计算Tomography（CT）通常用于图像新生疱疹（AIS）患者，但是它的解释由放射学家是时间消耗和对观察者的不一致的。深度学习（DL）技术可以提供自动化CT脑部扫描评估，但通常需要标注图像。为了开发一种DL方法用于AIS，我们设计了基于 convolutional neural network（CNN）的DL算法，使用了不同研究协议下收集的CT脑部扫描图像。DL模型的目标是检测AIS损害和脑部哪个区域受到影响。我们研究了AIS损害特征、背景脑部 appearances和时间对DL性能的影响。从5772个Unique CT扫描图像中（2347名AIS患者， median age 82），54%有可见的AIS损害。我们最佳的DL方法达到了72%的损害存在和脑部哪个区域受到影响的准确率。大型损害（80%的准确率）、多个损害（87%的准确率）和追踪扫描（76%的准确率）都有更好的检测性能。与此同时，慢性脑部疾病（如非stroke损害和老stroke损害）会降低DL性能，特别是错误率为32%和31%。DL方法可以用于AIS损害检测CT脑部扫描图像，从而提高方法的可靠性和普遍性。最终，这将导致更加稳定和可靠的方法。

Efficient Large Scale Medical Image Dataset Preparation for Machine Learning Applications

paper_url: http://arxiv.org/abs/2309.17285
repo_url: None
paper_authors: Stefan Denner, Jonas Scherer, Klaus Kades, Dimitrios Bounias, Philipp Schader, Lisa Kausch, Markus Bujotzek, Andreas Michael Bucher, Tobias Penzkofer, Klaus Maier-Hein
for: 该论文旨在提高医疗影像识别精度，通过机器学习算法提高诊断精度。
methods: 该论文提出了一种新的数据筛选工具，用于处理大规模医疗影像数据。该工具包括高级搜索、自动注释和高效标注功能，以便改进数据筛选。
results: 该论文通过使用该数据筛选工具，可以提高医疗影像数据的质量和可靠性，并且可以探索数据中的潜在偏见。此外，该工具还可以帮助研究人员 validate 图像和分割质量，以及检测数据中的偏见。

Abstract
In the rapidly evolving field of medical imaging, machine learning algorithms have become indispensable for enhancing diagnostic accuracy. However, the effectiveness of these algorithms is contingent upon the availability and organization of high-quality medical imaging datasets. Traditional Digital Imaging and Communications in Medicine (DICOM) data management systems are inadequate for handling the scale and complexity of data required to be facilitated in machine learning algorithms. This paper introduces an innovative data curation tool, developed as part of the Kaapana open-source toolkit, aimed at streamlining the organization, management, and processing of large-scale medical imaging datasets. The tool is specifically tailored to meet the needs of radiologists and machine learning researchers. It incorporates advanced search, auto-annotation and efficient tagging functionalities for improved data curation. Additionally, the tool facilitates quality control and review, enabling researchers to validate image and segmentation quality in large datasets. It also plays a critical role in uncovering potential biases in datasets by aggregating and visualizing metadata, which is essential for developing robust machine learning models. Furthermore, Kaapana is integrated within the Radiological Cooperative Network (RACOON), a pioneering initiative aimed at creating a comprehensive national infrastructure for the aggregation, transmission, and consolidation of radiological data across all university clinics throughout Germany. A supplementary video showcasing the tool's functionalities can be accessed at https://bit.ly/MICCAI-DEMI2023.

摘要
在医学成像领域的快速发展中，机器学习算法已成为提高诊断精度的不可或缺的工具。然而，这些算法的效iveness取决于医学成像数据的可用性和组织化。传统的Digital Imaging and Communications in Medicine（DICOM）数据管理系统不能满足大规模医学成像数据的需求。这篇论文介绍了一种创新的数据 curación工具，作为Kaapana开源工具包的一部分，用于改善医学成像数据的组织、管理和处理。该工具特地针对医生和机器学习研究人员的需求，并包括高级搜索、自动标注和高效标记功能。此外，工具还实现了质量控制和审核功能，帮助研究人员 validate 图像和分割质量，并推断数据中可能存在的偏见。此外，Kaapana与德国大学医院 radiological Cooperative Network（RACOON）集成，该项目旨在创建一个涵盖所有大学医院的 радиологиical 数据集成、传输和 консоли达到的全国基础设施。补充视频展示工具的功能可以在https://bit.ly/MICCAI-DEMI2023中获取。

Information Flow in Self-Supervised Learning

paper_url: http://arxiv.org/abs/2309.17281
repo_url: https://github.com/yifanzhang-pro/m-mae
paper_authors: Zhiquan Tan, Jingqin Yang, Weiran Huang, Yang Yuan, Yifan Zhang
for: 本研究提供了一个完整的自助学习（SSL）工具箱，通过矩阵信息理论来理解和提高SSL方法。
methods: 本研究利用矩阵相互信息和共同 entropy的原则，对异构和特征减 corr 方法进行统一分析。此外，我们提议基于矩阵信息理论的矩阵变量掩码自动编码器（M-MAE）方法，用于增强图像模型。
results: 实验表明，与当前方法相比，M-MAE方法具有较高的效果，包括在 ImageNet 上Linear probing ViT-Base 上提高3.9%，和 fine-tuning ViT-Large 上提高1%。

Abstract
In this paper, we provide a comprehensive toolbox for understanding and enhancing self-supervised learning (SSL) methods through the lens of matrix information theory. Specifically, by leveraging the principles of matrix mutual information and joint entropy, we offer a unified analysis for both contrastive and feature decorrelation based methods. Furthermore, we propose the matrix variational masked auto-encoder (M-MAE) method, grounded in matrix information theory, as an enhancement to masked image modeling. The empirical evaluations underscore the effectiveness of M-MAE compared with the state-of-the-art methods, including a 3.9% improvement in linear probing ViT-Base, and a 1% improvement in fine-tuning ViT-Large, both on ImageNet.

摘要
在这篇论文中，我们提供了一个完整的工具箱，用于理解和提高自动编目学习（SSL）方法的表示。具体来说，我们通过利用矩阵共同信息理论和矩阵共同熵的原则，为对比和特征抑制基于方法提供了一个统一的分析。此外，我们还提出了基于矩阵信息理论的矩阵变量隐藏自动编码器（M-MAE）方法，用于提高图像模型。实验证明，M-MAE方法比现有的方法更有效，包括对ViT-Base和ViT-Large图像集进行线性搜索中的3.9%提高，以及对ImageNet图像集进行练习中的1%提高。

Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

paper_url: http://arxiv.org/abs/2309.17269
repo_url: None
paper_authors: Weiwen Zhang, Dawei Yang, Haoxuan Che, An Ran Ran, Carol Y. Cheung, Hao Chen
for: This paper aims to improve the resolution of optical coherence tomography angiography (OCTA) images without paired training data.
methods: The proposed method uses a generator and discriminator based on Generative Adversarial Networks (GANs), with a dual-path generator to emphasize high-frequency fine capillaries and a frequency-aware adversarial loss for the discriminator.
results: The proposed method outperforms other state-of-the-art unpaired methods both quantitatively and visually, with improved preservation of fine capillary details.Here’s the simplified Chinese text:
for: 这个 paper 目的是提高无对应数据的 optical coherence tomography angiography (OCTA) 图像的分辨率。
methods: 提议的方法使用基于 Generative Adversarial Networks (GANs) 的生成器和批准器，并使用双轨生成器来强调高频环境细血管。
results: 提议的方法在对比其他状态体际的无对应方法时，都有较好的表现，并且可以更好地保留细血管的细节。

Abstract
For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.

摘要
For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.Here's the translation in Traditional Chinese:For optical coherence tomography angiography (OCTA) images, a limited scanning rate leads to a trade-off between field-of-view (FOV) and imaging resolution. Although larger FOV images may reveal more parafoveal vascular lesions, their application is greatly hampered due to lower resolution. To increase the resolution, previous works only achieved satisfactory performance by using paired data for training, but real-world applications are limited by the challenge of collecting large-scale paired images. Thus, an unpaired approach is highly demanded. Generative Adversarial Network (GAN) has been commonly used in the unpaired setting, but it may struggle to accurately preserve fine-grained capillary details, which are critical biomarkers for OCTA. In this paper, our approach aspires to preserve these details by leveraging the frequency information, which represents details as high-frequencies ($\textbf{hf}$) and coarse-grained backgrounds as low-frequencies ($\textbf{lf}$). In general, we propose a GAN-based unpaired super-resolution method for OCTA images and exceptionally emphasize $\textbf{hf}$ fine capillaries through a dual-path generator. To facilitate a precise spectrum of the reconstructed image, we also propose a frequency-aware adversarial loss for the discriminator and introduce a frequency-aware focal consistency loss for end-to-end optimization. Experiments show that our method outperforms other state-of-the-art unpaired methods both quantitatively and visually.

Effect of structure-based training on 3D localization precision and quality

paper_url: http://arxiv.org/abs/2309.17265
repo_url: None
paper_authors: Armin Abdehkakha, Craig Snoeyink
for: 这个研究旨在提出基于结构的训练方法，用于在单分子位置显微镜（SMLM）和3D物体重建中使用深度学习算法。
methods: 该方法比传统的随机训练方法更有优势，使用LUENN包作为我们的人工智能管道。
results: 对比训练方法，结构基于训练方法得到了显著提高的检测率和地址精度，特别是在不同的信号噪比（SNR）下。此外，该方法还能有效地消除棱镜质量问题，以确保更准确的3D重建。

Abstract
This study introduces a structural-based training approach for CNN-based algorithms in single-molecule localization microscopy (SMLM) and 3D object reconstruction. We compare this approach with the traditional random-based training method, utilizing the LUENN package as our AI pipeline. The quantitative evaluation demonstrates significant improvements in detection rate and localization precision with the structural-based training approach, particularly in varying signal-to-noise ratios (SNRs). Moreover, the method effectively removes checkerboard artifacts, ensuring more accurate 3D reconstructions. Our findings highlight the potential of the structural-based training approach to advance super-resolution microscopy and deepen our understanding of complex biological systems at the nanoscale.

摘要
Translation notes:* "single-molecule localization microscopy" (SMLM) is translated as "单分子定位微scopie" (SMLM)* "3D object reconstruction" is translated as "3D物体重建"* "deep learning algorithms" is translated as "深度学习算法"* "structural-based training approach" is translated as "结构基于的训练方法"* "random-based training method" is translated as "随机基于的训练方法"* "LUENN package" is translated as "LUENN包"* "quantitative evaluation" is translated as "量化评估"* "detection rate" is translated as "检测率"* "localization precision" is translated as "定位精度"* "checkerboard artifacts" is translated as "棋盘艺术ifacts"

Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors

paper_url: http://arxiv.org/abs/2309.17261
repo_url: None
paper_authors: Yukang Lin, Haonan Han, Chaoqun Gong, Zunnan Xu, Yachao Zhang, Xiu Li
for: 高精度的3D对象重建从单个图像 guidance by pre-trained diffusion models
methods: 提出了一种case-aware two-stage方法，利用2D和3D扩散先验来实现高度一致的3D资产重建
results: 对多种物体进行了详细的测试，并表明了our方法可以具有高度一致的3D重建和强大的泛化能力

Abstract
Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods. See https://Consistent123.github.io for a more comprehensive exploration of our generated 3D assets.

摘要
<>Translate the given text into Simplified Chinese.<>根据预训练的扩散模型从单张图像中重建3D对象的方法，已经展现出了有前途的结果。然而，由于使用 случаagnostic的静态策略，其对于任意情况和3D重建的一致性仍然不够。在这项工作中，我们提出了Consistent123，一种情况意识的两stage方法，用于从单张图像中高度一致的3D资产重建。在第一阶段，Consistent123仅利用3D结构约束，通过CLIP基于的情况意识适应检测机制来进行充分的geometry利用。在第二阶段，2D文本约束被引入，逐渐取代导向role，细腻地雕琢3D模型的细节。Consistent123与导航需求的演化趋势更加相似，适应不同对象的3D初始化和2D文本细化。Consistent123可以获得高度一致的3D重建，并且在不同对象上展现出强大的泛化能力。Qualitative和量化实验表明，我们的方法在图像到3D重建领域内Significantly outperforms state-of-the-art方法。请参考https://Consistent123.github.io进行更多的3D资产的探索。

A Survey on Deep Learning Techniques for Action Anticipation

paper_url: http://arxiv.org/abs/2309.17257
repo_url: None
paper_authors: Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
for: 本研究的目的是实时预测人员的动作，以应对日常生活中的各种情况。
methods: 本文专注于使用深度学习方法进行动作预测，并按照它们的主要贡献分为多个字段，供读者快速了解。
results: 本研究总结了各种动作预测算法的最新进展，并评估了不同评估指标和数据集的影响。未来的发展方向亦有系统的讨论。

Abstract
The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.

摘要
<>转换文本到简化中文。人类未来行为预测能力在各种应用领域都是非常重要，如自动驾驶和人机交互。因此，过去几年内，有很多方法被提出用于行动预测，其中深度学习基本方法尤为流行。在这项工作中，我们对当前的行动预测算法进行了latest advances的综述，并将其分为不同类别，以便读者一眼了解。此外，我们还详细介绍了通用的评价指标和数据集，以及未来的发展方向。>>Here's the translation in Traditional Chinese:<>转换文本到简化中文。人类未来行为预测能力在各种应用领域都是非常重要，如自动驾驶和人机交互。因此，过去几年内，有很多方法被提出用于行动预测，其中深度学习基本方法尤为流行。在这个工作中，我们对当前的行动预测算法进行了latest advances的综述，并将其分为不同类别，以便读者一眼了解。此外，我们还详细介绍了通用的评价指标和数据集，以及未来的发展方向。

EGVD: Event-Guided Video Deraining

paper_url: http://arxiv.org/abs/2309.17239
repo_url: https://github.com/booker-max/egvd
paper_authors: Yueyi Zhang, Jin Wang, Wenming Weng, Xiaoyan Sun, Zhiwei Xiong
for: 这篇论文目的是为了解决 Complex Spatio-Temporal Distribution 的雨层推广 Video Deraining 问题。
methods: 本文使用 Event Camera 和 End-to-End Learning-based Network 来解决这个问题。 Specifically, the authors propose an event-aware motion detection module and a pyramidal adaptive selection module to reliably separate the background and rain layers.
results: compared with existing state-of-the-art methods, the proposed method demonstrates clear superiority on synthetic and self-collected real-world datasets. The code and dataset are available at \url{https://github.com/booker-max/EGVD}.

Abstract
With the rapid development of deep learning, video deraining has experienced significant progress. However, existing video deraining pipelines cannot achieve satisfying performance for scenes with rain layers of complex spatio-temporal distribution. In this paper, we approach video deraining by employing an event camera. As a neuromorphic sensor, the event camera suits scenes of non-uniform motion and dynamic light conditions. We propose an end-to-end learning-based network to unlock the potential of the event camera for video deraining. First, we devise an event-aware motion detection module to adaptively aggregate multi-frame motion contexts using event-aware masks. Second, we design a pyramidal adaptive selection module for reliably separating the background and rain layers by incorporating multi-modal contextualized priors. In addition, we build a real-world dataset consisting of rainy videos and temporally synchronized event streams. We compare our method with extensive state-of-the-art methods on synthetic and self-collected real-world datasets, demonstrating the clear superiority of our method. The code and dataset are available at \url{https://github.com/booker-max/EGVD}.

摘要
随着深度学习的快速发展，视频抖雨得到了显著进步。然而，现有的视频抖雨管道无法实现复杂的雨层分布场景中满意的性能。在这篇论文中，我们使用事件摄像头来解决这个问题。作为一种神经omorphic感知器，事件摄像头适合非均匀的运动场景和动态的照明条件。我们提出了一种基于学习的终端网络，用于解锁事件摄像头的潜在能力。首先，我们设计了一个事件感知模块，用于自适应聚合多帧运动上下文。其次，我们设计了一个层次自适应选择模块，用于可靠地分离背景和雨层。此外，我们构建了一个包含雨天视频和时间同步事件流的实际数据集。我们对与现有的状态艺术方法进行了广泛的比较，并证明了我们的方法的明显优越性。代码和数据集可以在 GitHub 上获取：。

Glioma subtype classification from histopathological images using in-domain and out-of-domain transfer learning: An experimental study

paper_url: http://arxiv.org/abs/2309.17223
repo_url: None
paper_authors: Vladimir Despotovic, Sang-Yoon Kim, Ann-Christin Hau, Aliaksandra Kakoichankava, Gilbert Georg Klamminger, Felix Bruno Kleine Borgmann, Katrin B. M. Frauenknecht, Michel Mittelbronn, Petr V. Nazarov
for: 这个论文主要目的是对Computer-aided classification of adult-type diffuse gliomas进行了全面的比较和深度学习架构的研究。
methods: 这篇论文使用了多种转移学习策略和深度学习架构，包括自然语言处理ImageNet表示的外部领域适应，以及自我supervised和多任务学习方法。
results: 这篇论文的研究结果表明，使用这些方法可以提高医疗影像分类的性能，并且可以减少病理学家的标注工作。此外，这篇论文还提供了一个可视化工具，可以在整个扫描图像水平上生成热图，以便为病理学家提供有用的信息。

Abstract
We provide in this paper a comprehensive comparison of various transfer learning strategies and deep learning architectures for computer-aided classification of adult-type diffuse gliomas. We evaluate the generalizability of out-of-domain ImageNet representations for a target domain of histopathological images, and study the impact of in-domain adaptation using self-supervised and multi-task learning approaches for pretraining the models using the medium-to-large scale datasets of histopathological images. A semi-supervised learning approach is furthermore proposed, where the fine-tuned models are utilized to predict the labels of unannotated regions of the whole slide images (WSI). The models are subsequently retrained using the ground-truth labels and weak labels determined in the previous step, providing superior performance in comparison to standard in-domain transfer learning with balanced accuracy of 96.91% and F1-score 97.07%, and minimizing the pathologist's efforts for annotation. Finally, we provide a visualization tool working at WSI level which generates heatmaps that highlight tumor areas; thus, providing insights to pathologists concerning the most informative parts of the WSI.

摘要
我们在这篇论文中对各种传输学习策略和深度学习架构进行了广泛的比较，用于计算机辅助Diffuse gliomas的成人类型分类。我们评估了ImageNet表示的 OUT-OF-DOMAIN 表示的泛化性，并研究了采用自我超vised和多任务学习方法进行预训练，以提高模型在历史病理图像数据集上的性能。此外，我们还提出了一种半supervised学习方法，其中使用精度调整的模型来预测整个扫描图像（WSI）中的标签。模型然后被重新训练使用真实标签和弱标签，从而提高性能，并最小化病理医生对标注的努力。最后，我们还提供了一个可视化工具，它可以在WSI级别上生成热图，并高亮显示肿瘤区域，从而为病理医生提供有用的信息。

When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

paper_url: http://arxiv.org/abs/2309.17218
repo_url: https://github.com/tqtqliu/et-mvsnet
paper_authors: Tianqi Liu, Xinyi Ye, Weiyue Zhao, Zhiyu Pan, Min Shi, Zhiguo Cao
for:This paper proposes a new method for multi-view stereo (MVS) reconstruction, which is designed to improve the efficiency and accuracy of the feature matching process.methods:The proposed method, called ET-MVSNet, uses a novel non-local feature augmentation strategy based on the epipolar geometry. This strategy reduces the 2D search space into the epipolar line in stereo matching, making it more efficient and accurate.results:ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmarks with high efficiency. The proposed method is able to improve the accuracy and speed of MVS reconstruction, making it a valuable contribution to the field.

Abstract
Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet.

摘要
学习基于多视图雷达（MVS）方法仰赖特征匹配，需要突出特征描述。一个有效的解决方案是应用非本地特征聚合，例如Transformer。虽然有用，这些技术增加了MVS中计算负担。每个像素都密集关注整个图像。相比之下，我们提议在对应对线对线上进行非本地特征聚合约束。我们的想法来源于经典的 epipolar геометрии，它表明一个点在不同深度假设下将被投影到另一个视图中的 epipolar 线上。这种约束将2D 搜索空间缩小到 epipolar 线上。类似地，这表明 MVS 的匹配是将一系列点分类为同一条线上。 inspired by this point-to-line 搜索，我们开发了一种线到点非本地增强策略。我们首先开发了一种优化的搜索算法，将2D 特征图分成 epipolar 线对。然后，我们使用 Epipolar Transformer（ET）进行非本地特征增强。我们将 ET incorporated 到一个学习基于 MVS 的基础模型中，称为 ET-MVSNet。 ET-MVSNet 在 DTU 和 Tanks-and-Temples 测试 benchmark 上 achieve 状态的重建性能，同时具有高效性。代码可以在 https://github.com/TQTQliu/ET-MVSNet 中找到。

Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing

paper_url: http://arxiv.org/abs/2309.17211
repo_url: None
paper_authors: Lukas Meiner, Jens Mehnert, Alexandru Paul Condurache
for: 实时对资源有限的设备进行检查和识别，以减少对于调用运算的需求。
methods: 使用构造化排除法，将网络中的浮点运算减少到最小化，而不需要特定的训练或调整程序。
results: 在CIFAR-10和ImageNet等受欢迎的视觉标准库上，实现了对网络的减少，并且仅对网络的调用时间进行快速对应。 Specifically, 在CIFAR-10上，将ResNet34中的条件减少46.72%，仅对网络的精度下降1.25%。

Abstract
To reduce the computational cost of convolutional neural networks (CNNs) for usage on resource-constrained devices, structured pruning approaches have shown promising results, drastically reducing floating-point operations (FLOPs) without substantial drops in accuracy. However, most recent methods require fine-tuning or specific training procedures to achieve a reasonable trade-off between retained accuracy and reduction in FLOPs. This introduces additional cost in the form of computational overhead and requires training data to be available. To this end, we propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module. It instantly reduces the network's test-time inference cost without requiring any training or fine-tuning. We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH) to detect redundancies in the channel dimension. Similar channels are aggregated to reduce the input and filter depth simultaneously, allowing for cheaper convolutions. We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet. In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.

摘要
为了减少深度学习网络（CNN）的计算成本，使其在资源有限的设备上运行，结构化剪辑方法已经显示出了可观的成果，可以减少浮点操作数（FLOPs）而无需减少准确率。然而，大多数最新的方法需要精细调整或特定的训练程序来达到一个合适的平衡点，这会增加计算开销和需要训练数据。为此，我们提出了快速压缩（HASTE）模块，它可以作为任何常见卷积模块的替换部件。它可以立即降低网络的测试时间执行成本，不需要任何训练或精细调整。我们利用了本地性敏感哈希（LSH）检测通道维度中的重复性，从而压缩缓存特征图而无需减少准确率。相似的通道被聚合，以降低输入和滤波器的深度，使得更加便宜的卷积操作。我们在CIFAR-10和ImageNet等流行的视觉benchmark上展示了我们的方法。特别是，我们可以在ResNet34中INSTANTLY降低46.72%的FLOPs，只失去1.25%的准确率，只需替换卷积模块为我们的快速压缩模块。

Robots That Can See: Leveraging Human Pose for Trajectory Prediction

paper_url: http://arxiv.org/abs/2309.17209
repo_url: https://github.com/google-research/human-scene-transformer
paper_authors: Tim Salzmann, Lewis Chiang, Markus Ryll, Dorsa Sadigh, Carolina Parada, Alex Bewley
for: 预测人类在动态环境中的运动轨迹，以便安全有效地让Robot进行导航。
methods: 使用Transformer架构，从人类位置、头orientation和3D关节点数据中提取输入特征，预测人类未来轨迹。
results: 实现了预测人类轨迹的未来状态的目标性能，并在常见预测 bencmarks 和一个人类跟踪数据集中达到了状态前的表现。此外，还发现了有限历史数据的新代理人为预测错误的主要贡献者，并证明了3D关节点位置在这些挑战性enario中减少预测错误的作用。

Abstract
Anticipating the motion of all humans in dynamic environments such as homes and offices is critical to enable safe and effective robot navigation. Such spaces remain challenging as humans do not follow strict rules of motion and there are often multiple occluded entry points such as corners and doors that create opportunities for sudden encounters. In this work, we present a Transformer based architecture to predict human future trajectories in human-centric environments from input features including human positions, head orientations, and 3D skeletal keypoints from onboard in-the-wild sensory information. The resulting model captures the inherent uncertainty for future human trajectory prediction and achieves state-of-the-art performance on common prediction benchmarks and a human tracking dataset captured from a mobile robot adapted for the prediction task. Furthermore, we identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3D skeletal poses in reducing prediction error in such challenging scenarios.

摘要
anticipating the motion of all humans in dynamic environments such as homes and offices is critical to enable safe and effective robot navigation. such spaces remain challenging as humans do not follow strict rules of motion and there are often multiple occluded entry points such as corners and doors that create opportunities for sudden encounters. in this work, we present a transformer-based architecture to predict human future trajectories in human-centric environments from input features including human positions, head orientations, and 3d skeletal keypoints from onboard in-the-wild sensory information. the resulting model captures the inherent uncertainty for future human trajectory prediction and achieves state-of-the-art performance on common prediction benchmarks and a human tracking dataset captured from a mobile robot adapted for the prediction task. furthermore, we identify new agents with limited historical data as a major contributor to error and demonstrate the complementary nature of 3d skeletal poses in reducing prediction error in such challenging scenarios.

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

paper_url: http://arxiv.org/abs/2309.17205
repo_url: None
paper_authors: Wei Ji, Li Li, Hao Fei, Xiangyan Liu, Xun Yang, Juncheng Li, Roger Zimmermann
for: 本研究的目的是提高图像理解（RIS）的性能，特别是在面对复杂语言查询时。
methods: 本研究使用了现有的RefCOCO和Visual Genome datasets，并提出了一个新的RISbenchmark数据集，即RIS-CQ，以挑战现有的RIS方法。此外，提出了一种叫做 dual-modality graph alignment model（\textsc{DuMoGa}）的 nichetargeting方法，以提高RIS-CQ的性能。
results: 实验结果表明，\textsc{DuMoGa}方法在RIS-CQ上表现出色，在不同的数据集和模型下都有显著的提高。

Abstract
Referring Image Understanding (RIS) has been extensively studied over the past decade, leading to the development of advanced algorithms. However, there has been a lack of research investigating how existing algorithms should be benchmarked with complex language queries, which include more informative descriptions of surrounding objects and backgrounds (\eg \textit{"the black car."} vs. \textit{"the black car is parking on the road and beside the bus."}). Given the significant improvement in the semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, building upon the existing RefCOCO and Visual Genome datasets, we propose a new RIS benchmark with complex queries, namely \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific and informative queries, and enables a more realistic scenario of RIS research. Besides, we present a nichetargeting method to better task the RIS-CQ, called dual-modality graph alignment model (\textbf{\textsc{DuMoGa}), which outperforms a series of RIS methods.

摘要
总结过去的一 деcade，图像理解（RIS）已经得到了广泛的研究，导致了高级算法的开发。然而，现有的研究很少关注如何将现有算法 benchmarking WITH complex language queries，这些查询包括更加详细的周围对象和背景描述 (\eg "黑色车" vs. "黑色车在路上停放，和汽车一起"). given the significant improvement in semantic understanding capability of large pre-trained models, it is crucial to take a step further in RIS by incorporating complex language that resembles real-world applications. To close this gap, we propose a new RIS benchmark with complex queries, named \textbf{RIS-CQ}. The RIS-CQ dataset is of high quality and large scale, which challenges the existing RIS with enriched, specific, and informative queries, and enables a more realistic scenario of RIS research. besides, we present a niche-targeting method to better task the RIS-CQ, called dual-modality graph alignment model (\textbf{\textsc{DuMoGa}), which outperforms a series of RIS methods.

A Survey of Incremental Transfer Learning: Combining Peer-to-Peer Federated Learning and Domain Incremental Learning for Multicenter Collaboration

paper_url: http://arxiv.org/abs/2309.17192
repo_url: https://github.com/yixinghuang/itlsurvey
paper_authors: Yixing Huang, Christoph Bert, Ahmed Gomaa, Rainer Fietkau, Andreas Maier, Florian Putz
for: 这篇论文旨在解决因数据隐私限制而妨碍多中心协作深度学习模型的发展。
methods: 本文使用了 peer-to-peer 联邦学习和领域增加学习的结合，以解决数据隐私问题，并使用了不断学习技术来 preserve 模型性能。
results: 本文实现了多中心协作深度学习模型的开发，并评估了不同的常数调整基于循环学习方法的影响。研究发现，对于不同的数据不同性、分类器头设置、网络优化器、模型初始化、中心顺序和权重转移类型，不同的调整方法具有不同的影响。

Abstract
Due to data privacy constraints, data sharing among multiple clinical centers is restricted, which impedes the development of high performance deep learning models from multicenter collaboration. Naive weight transfer methods share intermediate model weights without raw data and hence can bypass data privacy restrictions. However, performance drops are typically observed when the model is transferred from one center to the next because of the forgetting problem. Incremental transfer learning, which combines peer-to-peer federated learning and domain incremental learning, can overcome the data privacy issue and meanwhile preserve model performance by using continual learning techniques. In this work, a conventional domain/task incremental learning framework is adapted for incremental transfer learning. A comprehensive survey on the efficacy of different regularization-based continual learning methods for multicenter collaboration is performed. The influences of data heterogeneity, classifier head setting, network optimizer, model initialization, center order, and weight transfer type have been investigated thoroughly. Our framework is publicly accessible to the research community for further development.

摘要
（注意：以下是简化中文版本，具体翻译结果可能与原文有所不同）

RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

paper_url: http://arxiv.org/abs/2309.17189
repo_url: None
paper_authors: Samuel Pegg, Kai Li, Xiaolin Hu
for: 高品质的单modal和多modal演讲分类
methods: 时频域的Recurrent Time-Frequency Separation Network (RTFS-Net)，组合时间和频率维度的独立模型
results: 比前一代SOTA模型更高的性能，仅使用10%的parameters和18%的MACs

Abstract
Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the previous SOTA method using only 10% of the parameters and 18% of the MACs. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

摘要
Audio-visual speech separation方法目的是将不同的感知模式融合在一起，以生成高质量的分离后的语音，从而提高下游任务如语音识别的性能。大多数现有的State-of-the-art（SOTA）模型都在时域上运行。然而，它们对音响特征的简单化方法通常会导致模型更大、更计算成本高的模型来达到SOTA性能。在这篇论文中，我们介绍了一种新的时域频域音视频混合方法：Recurrent Time-Frequency Separation Network（RTFS-Net），它在Short-Time Fourier Transform（STFT）生成的复杂时域频域缓冲中应用其算法。我们使用多层RNN来独立地模型和捕捉时间和频率维度上的音频信号，并引入了一种新的注意力机制来有效地混合音频和视频信息。此外，我们还提出了一种基于音频特征自然谱的新的掩码分离方法，以提高分离效果。RTFS-Net比前一个SOTA方法使用的参数数量减少了10%，计算资源减少了18%。这是首次在时域频域上实现了所有当代时域counterparts的音视频混合方法，并超越了它们。

TBD Pedestrian Data Collection: Towards Rich, Portable, and Large-Scale Natural Pedestrian Data

paper_url: http://arxiv.org/abs/2309.17187
repo_url: None
paper_authors: Allan Wang, Daisuke Sato, Yasser Corzo, Sonya Simkin, Aaron Steinfeld
for: 这个论文主要是为了研究人员 navigating 和行人行为，特别是通过机器学习方法来模型人员之间的互动和人员与机器人之间的互动。
methods: 作者提出了一种可搬式的数据收集系统，并与一种半自动化标注管道相结合。在这个管道中，作者设计了一个用于检查自动跟踪结果的人类标注Web应用程序。
results: 作者的系统可以在多种环境中进行大规模数据收集，并且可以快速生成标注结果。与现有的行人数据收集方法相比，作者的系统具有三个特点：结合顶部下看和自我中心视角，在人工智能的社会合适的环境中观察人类行为，以及人类验证的标注。

Abstract
Social navigation and pedestrian behavior research has shifted towards machine learning-based methods and converged on the topic of modeling inter-pedestrian interactions and pedestrian-robot interactions. For this, large-scale datasets that contain rich information are needed. We describe a portable data collection system, coupled with a semi-autonomous labeling pipeline. As part of the pipeline, we designed a label correction web app that facilitates human verification of automated pedestrian tracking outcomes. Our system enables large-scale data collection in diverse environments and fast trajectory label production. Compared with existing pedestrian data collection methods, our system contains three components: a combination of top-down and ego-centric views, natural human behavior in the presence of a socially appropriate "robot", and human-verified labels grounded in the metric space. To the best of our knowledge, no prior data collection system has a combination of all three components. We further introduce our ever-expanding dataset from the ongoing data collection effort -- the TBD Pedestrian Dataset and show that our collected data is larger in scale, contains richer information when compared to prior datasets with human-verified labels, and supports new research opportunities.

摘要
社交导航和行人行为研究已经转移到机器学习基于方法和融合到了对行人间交互和机器人间交互的模型研究。为此，需要大规模的数据集，具有丰富的信息。我们描述了一种可搬式数据收集系统，联合了半自动化标注管道。在管道中，我们设计了一个 labels 修正web应用程序，以便人类确认自动跟踪结果的准确性。我们的系统可以在多种环境下进行大规模数据收集，并快速生成标注结果。相比现有的行人数据收集方法，我们的系统具有三个组成部分：一种组合顶部视角和自我视角，人类在社交合适的“机器人”存在下展现自然的人类行为，以及人类验证的标签，围绕着度量空间定义。根据我们所知，没有任何先前的数据收集系统拥有这三个组成部分。我们进一步介绍了我们的持续扩展的数据集——TBD行人数据集，并证明我们收集到的数据规模更大，包含更多的信息，与先前的人类验证标签相比，支持新的研究机会。

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

paper_url: http://arxiv.org/abs/2309.17175
repo_url: None
paper_authors: Tianyu Huang, Yihan Zeng, Bowen Dong, Hang Xu, Songcen Xu, Rynson W. H. Lau, Wangmeng Zuo
for: 本研究旨在开拓文本3D生成领域中的开放词汇能力，提高生成的文本3D模型的多样性和可控性。
methods: 我们提出了一种基于Noisy Text Fields（NTFs）的 Conditional 3D生成模型，即TextField3D。在该模型中，而不是直接使用文本提示作为输入，我们在文本提示的各个字段中引入动态噪声，以扩展文本的各个维度的 latent space。此外，我们还提出了一种 NTFBind 模块，用于将视图不变的图像缓存码与 Noisy Fields 相对 align。
results: 我们的方法在多个方面超越了前期方法，包括大词汇量、文本一致性和响应时间低。我们的实验结果表明，TextField3D 可以实现开放词汇3D生成的可能性。

Abstract
Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.

摘要
近期研究强调3D表示方法在文本3D指导下进行学习。然而，有限的文本3D数据限制了生成 vocabulary 范围和文本控制。生成器可能轻易陷入某些文本提示的刻板概念，导致失去开放词汇生成能力。为解决这个问题，我们介绍了一种受控3D生成模型，即 TextField3D。具体来说，而不是直接使用文本提示作为输入，我们建议在给定文本提示的缓存空间中注入动态噪声，即噪声文本场（NTF）。这样，有限的3D数据可以被映射到适当的文本缓存空间，该空间通过 NTF 得到扩展。为此，我们提出了 NTFGen 模块，用于模型文本缓存代码在噪声场中。同时，我们提出了 NTFBind 模块，用于将视图不变的图像缓存代码与噪声场相匹配。这样，我们可以在geometry和texture之间进行条件生成。为了引导条件生成，我们构建了文本-3D分类器和文本-2.5D分类器的多模式抑制。相比之前的方法，TextField3D具有三大优点：1）大词汇，2）文本一致性，3）低延迟。广泛的实验表明，我们的方法实现了可能的开放词汇3D生成能力。

Domain-Adaptive Learning: Unsupervised Adaptation for Histology Images with Improved Loss Function Combination

paper_url: http://arxiv.org/abs/2309.17172
repo_url: None
paper_authors: Ravi Kant Gupta, Shounak Das, Amit Sethi
For: 本研究提出了一种新的频率领域适应（UDA）方法，用于针对 Hematoxylin & Eosin（H&E）染色的病理图像。现有的对抗性频率领域适应方法可能无法有效地将不同频率领域的多modal分布相互对接。* Methods: 我们的方法提出了一个新的损失函数，并与现有的损失函数束合使用，以特化解决病理图像的特殊挑战。我们利用病理图像的特有特征，如组织结构和细胞形态，来增强适应性。* Results: 我们的方法在准确性、稳定性和普适性方面表现出色，超过了当前领域的最佳技术。我们在FHIST数据集上进行了广泛的实验，结果显示，我们的提出的方法——频率领域适应学习（DAL）在病理图像上表现出了1.41%和6.56%的提升，比ViT基于和CNN基于的SoTA方法更高。

Abstract
This paper presents a novel approach for unsupervised domain adaptation (UDA) targeting H&E stained histology images. Existing adversarial domain adaptation methods may not effectively align different domains of multimodal distributions associated with classification problems. The objective is to enhance domain alignment and reduce domain shifts between these domains by leveraging their unique characteristics. Our approach proposes a novel loss function along with carefully selected existing loss functions tailored to address the challenges specific to histology images. This loss combination not only makes the model accurate and robust but also faster in terms of training convergence. We specifically focus on leveraging histology-specific features, such as tissue structure and cell morphology, to enhance adaptation performance in the histology domain. The proposed method is extensively evaluated in accuracy, robustness, and generalization, surpassing state-of-the-art techniques for histology images. We conducted extensive experiments on the FHIST dataset and the results show that our proposed method - Domain Adaptive Learning (DAL) significantly surpasses the ViT-based and CNN-based SoTA methods by 1.41% and 6.56% respectively.

摘要

Retail-786k: a Large-Scale Dataset for Visual Entity Matching

paper_url: http://arxiv.org/abs/2309.17164
repo_url: None
paper_authors: Bianca Lamm, Janis Keuper
for: 本研究旨在提供一个大规模的视觉实体匹配数据集，用于解决现有的实体匹配问题。
methods: 本研究使用了生成的广告刊物，包括多年的欧洲零售商的广告刊物，共计约786万个手动标注的高分辨率产品图像，其中包含约18千个不同的固定产品，分为约3千个实体。
results: 根据Price comparison任务，每个实体形成一个相似类型的产品集，但是使用标准的图像基于分类和检索算法不能够解决这个问题。因此，需要开发新的方法，可以将示例基于的视觉等同类转移到新数据上。本研究的目的是为这种算法提供 benchmark。

Abstract
Entity Matching (EM) defines the task of learning to group objects by transferring semantic concepts from example groups (=entities) to unseen data. Despite the general availability of image data in the context of many EM-problems, most currently available EM-algorithms solely rely on (textual) meta data. In this paper, we introduce the first publicly available large-scale dataset for "visual entity matching", based on a production level use case in the retail domain. Using scanned advertisement leaflets, collected over several years from different European retailers, we provide a total of ~786k manually annotated, high resolution product images containing ~18k different individual retail products which are grouped into ~3k entities. The annotation of these product entities is based on a price comparison task, where each entity forms an equivalence class of comparable products. Following on a first baseline evaluation, we show that the proposed "visual entity matching" constitutes a novel learning problem which can not sufficiently be solved using standard image based classification and retrieval algorithms. Instead, novel approaches which allow to transfer example based visual equivalent classes to new data are needed to address the proposed problem. The aim of this paper is to provide a benchmark for such algorithms. Information about the dataset, evaluation code and download instructions are provided under https://www.retail-786k.org/.

摘要
“Entity Matching（EM）定义为将知识传递到未见到的数据上，以将概念汇总到例子中。尽管在许多EM问题上有广泛的图像数据可用，现有大多数EM算法仅仅靠文本元数据。在这篇文章中，我们介绍了首次公开可用的大规模“视觉实体匹配”数据集，基于商业化的使用情况，从欧洲多家商家获取了多年的印刷广告单张，总计约786,000个手动标注、高分辨率产品图像，包含约18,000个不同的单独产品，分为约3,000个实体。这些产品实体的标注基于价格比较任务，每个实体都是一个可比较的产品集。我们透过首创基准评估发现，“视觉实体匹配”是一个新的学习问题，不能由标准图像分类和搜寻算法解决。相反，需要新的方法，以将例子基于的见识汇总到新数据。本文的目的是提供这个问题的参考基准。更多关于数据、评估代码和下载 instruction可以在https://www.retail-786k.org/取得。”

APNet: Urban-level Scene Segmentation of Aerial Images and Point Clouds

paper_url: http://arxiv.org/abs/2309.17162
repo_url: https://github.com/codename1995/APNet_ICCVW23
paper_authors: Weijie Wei, Martin R. Oswald, Fatemeh Karimi Nejadasl, Theo Gevers
for: 本研究强调 semantic segmentation 方法 для点云图像urban scenes中。
methods: 我们提出了一种名为 APNet的网络架构，它分为两个支线：一个点云支线和一个飞行图支线，其中点云支线输入是从点云中生成的。为了利用每个支线的不同特性，我们使用一种geometry-aware fusione module来结合每个支线的结果。
results: 我们的实验表明， fusione output consistently outperforms each individual network branch, and APNet achieves state-of-the-art performance of 65.2 mIoU on the SensatUrban dataset。

Abstract
In this paper, we focus on semantic segmentation method for point clouds of urban scenes. Our fundamental concept revolves around the collaborative utilization of diverse scene representations to benefit from different context information and network architectures. To this end, the proposed network architecture, called APNet, is split into two branches: a point cloud branch and an aerial image branch which input is generated from a point cloud. To leverage the different properties of each branch, we employ a geometry-aware fusion module that is learned to combine the results of each branch. Additional separate losses for each branch avoid that one branch dominates the results, ensure the best performance for each branch individually and explicitly define the input domain of the fusion network assuring it only performs data fusion. Our experiments demonstrate that the fusion output consistently outperforms the individual network branches and that APNet achieves state-of-the-art performance of 65.2 mIoU on the SensatUrban dataset. Upon acceptance, the source code will be made accessible.

摘要
在这篇论文中，我们关注点云Scene的semantic segmentation方法。我们的基本概念是通过不同场景表示的多样化合作来利用不同的上下文信息和网络架构。为此，我们提出了一种名为APNet的网络架构，其分为两个分支：一个点云分支和一个空中图分支，后者是根据点云生成的。为了利用每个分支的不同特性，我们采用了一种geometry-aware合并模块，该模块通过学习将每个分支的结果结合在一起。此外，我们还使用了每个分支的分立损失，以避免一个分支占据结果，保证每个分支的最佳性能，并且明确定义了拟合网络的输入领域，确保它只进行数据融合。我们的实验表明，拟合输出 consistently 超过了每个网络分支的结果，而APNet在SensatUrban数据集上实现了65.2 mIoU的state-of-the-art性能。接受后，源代码将公开。

Redistributing the Precision and Content in 3D-LUT-based Inverse Tone-mapping for HDR/WCG Display

paper_url: http://arxiv.org/abs/2309.17160
repo_url: https://github.com/andreguo/itmlut
paper_authors: Cheng Guo, Leidong Fan, Qian Zhang, Hanyuan Liu, Kanglin Liu, Xiuhua Jiang
for: 这个论文是为了提出一种基于AI的增强逻辑（ITM），将标准动态范围（SDR）视频转换为高动态范围（HDR）/宽色域（WCG）的方法。
methods: 该方法使用了人工智能（AI）学习，将三个不同精度的Look-up表（LUT）组合在一起，以提高效率和质量。
results: 实验结果表明，该方法可以减少转换错误，并且可以在不同的显示设备上提供更好的视觉效果。

Abstract
ITM(inverse tone-mapping) converts SDR (standard dynamic range) footage to HDR/WCG (high dynamic range /wide color gamut) for media production. It happens not only when remastering legacy SDR footage in front-end content provider, but also adapting on-theair SDR service on user-end HDR display. The latter requires more efficiency, thus the pre-calculated LUT (look-up table) has become a popular solution. Yet, conventional fixed LUT lacks adaptability, so we learn from research community and combine it with AI. Meanwhile, higher-bit-depth HDR/WCG requires larger LUT than SDR, so we consult traditional ITM for an efficiency-performance trade-off: We use 3 smaller LUTs, each has a non-uniform packing (precision) respectively denser in dark, middle and bright luma range. In this case, their results will have less error only in their own range, so we use a contribution map to combine their best parts to final result. With the guidance of this map, the elements (content) of 3 LUTs will also be redistributed during training. We conduct ablation studies to verify method's effectiveness, and subjective and objective experiments to show its practicability. Code is available at: https://github.com/AndreGuo/ITMLUT.

摘要

HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field

paper_url: http://arxiv.org/abs/2309.17128
repo_url: None
paper_authors: Xiaochen Zhao, Lizhen Wang, Jingxiang Sun, Hongwen Zhang, Jinli Suo, Yebin Liu
for: Addresses the problem of modeling an animatable 3D human head avatar under light-weight setups, which has not been well solved.
methods: Introduces a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template.
results: Achieves state-of-the-art performance for 3D head avatar animation, with high-resolution, realistic, and view-consistent synthesis of dynamic head appearance.

Abstract
The problem of modeling an animatable 3D human head avatar under light-weight setups is of significant importance but has not been well solved. Existing 3D representations either perform well in the realism of portrait images synthesis or the accuracy of expression control, but not both. To address the problem, we introduce a novel hybrid explicit-implicit 3D representation, Facial Model Conditioned Neural Radiance Field, which integrates the expressiveness of NeRF and the prior information from the parametric template. At the core of our representation, a synthetic-renderings-based condition method is proposed to fuse the prior information from the parametric model into the implicit field without constraining its topological flexibility. Besides, based on the hybrid representation, we properly overcome the inconsistent shape issue presented in existing methods and improve the animation stability. Moreover, by adopting an overall GAN-based architecture using an image-to-image translation network, we achieve high-resolution, realistic and view-consistent synthesis of dynamic head appearance. Experiments demonstrate that our method can achieve state-of-the-art performance for 3D head avatar animation compared with previous methods.

摘要
“模型动画3D人头化身的问题具有重要意义，但尚未得到完善的解决。现有的3D表示方式可以在真实性的肖像图像合成方面表现出色，但控制表达的准确性不佳；或者可以在表达控制方面表现出色，但肖像图像的真实性不佳。为解决这个问题，我们提出了一种新的混合式显式隐式3D表示方法，即Facial Model Conditioned Neural Radiance Field。我们的表示方法把NeRF的表达力和参数模板中的先知信息结合在一起，以实现高度的表现真实性和控制灵活性。另外，我们还解决了现有方法中的不一致形状问题，提高动画稳定性。此外，我们采用了一种总体的GAN基 architecture，使用图像到图像翻译网络，实现高分辨率、真实和视角一致的动态头型表 synthesis。实验表明，我们的方法可以与之前的方法相比，达到3D头化身动画的州际性表现。”

Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative Pretraining

paper_url: http://arxiv.org/abs/2309.17123
repo_url: https://github.com/peterhan91/diffchest
paper_authors: Tianyu Han, Laura Žigutytė, Luisa Huck, Marc Huppertz, Robert Siepmann, Yossi Gandelsman, Christian Blüthgen, Firas Khader, Christiane Kuhl, Sven Nebelung, Jakob Kather, Daniel Truhn
for: 这个研究旨在检测人工智能支持的自动诊断系统中的欺骗模式，以确保其可靠性，特别是在医疗领域。
methods: 我们提出了一种自我条件Diffusion模型，称为DiffChest，并将其训练在515,704个胸部X-RAY影像和194,956名病人的数据集上。DiffChest可以在每个病人水平解释分类结果，并可以显示出可能欺骗模型的变量因素。
results: 我们发现DiffChest可以实现高度的医疗读者一致性，具体而言，Fleiss的Kappa值在大多数影像找到的情况下都是0.8或更高。DiffChest可以正确地捕捉11.1%至100%的变量因素。此外，我们的预训 проце序可以将模型优化以捕捉输入影像中最重要的信息。DiffChest在11种胸部病情的诊断中表现出色，并在其他情况下至少具有足够的诊断精度。

Abstract
Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dataset of 515,704 chest radiographs from 194,956 patients from multiple healthcare centers in the United States and Europe. DiffChest explains classifications on a patient-specific level and visualizes the confounding factors that may mislead the model. We found high inter-reader agreement when evaluating DiffChest's capability to identify treatment-related confounders, with Fleiss' Kappa values of 0.8 or higher across most imaging findings. Confounders were accurately captured with 11.1% to 100% prevalence rates. Furthermore, our pretraining process optimized the model to capture the most relevant information from the input radiographs. DiffChest achieved excellent diagnostic accuracy when diagnosing 11 chest conditions, such as pleural effusion and cardiac insufficiency, and at least sufficient diagnostic accuracy for the remaining conditions. Our findings highlight the potential of pretraining based on diffusion models in medical image classification, specifically in providing insights into confounding factors and model robustness.

摘要
检测自动诊断助手系统中的误导性模式是确保其可靠性的关键，特别在医疗领域。现有的深度学习模型评估技术无法在诊断水平可视化干扰因素。我们提议一种自我条件 diffusion 模型，称为 DiffChest，并在515,704 个胸部X光图像和194,956 名病人从美国和欧洲多个医疗中心获取训练数据。DiffChest 可以在病人特定水平解释分类结果，并可视化可能误导模型的干扰因素。我们发现在评估 DiffChest 是否能够确定治疗相关干扰因子时， Fleiss κ 值在 0.8 或更高，对大多数影像发现都达到了0.8 或更高的一致度。干扰因素被正确地捕捉，存在11.1% 到100% 的发现率。此外，我们的预训练过程将模型优化为从输入胸部X光图像中捕捉最重要的信息。DiffChest 在11 种胸部疾病诊断中达到了出色的诊断精度，并在剩下的疾病诊断中至少达到了足够的诊断精度。我们的发现表明基于扩散模型的预训练在医疗影像分类中具有潜在的优势，特别是在提供干扰因素的透视和模型Robustness。

Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution Modeling

paper_url: http://arxiv.org/abs/2309.17105
repo_url: https://github.com/Lyman-Smoker/ConAQA
paper_authors: Yuan-Ming Li, Ling-An Zeng, Jing-Ke Meng, Wei-Shi Zheng
for: 本文目的是解决 continual learning 问题在 action quality assessment (AQA) 领域，即在不可见的情况下，模型可以 continual learning 地学习 AQA 任务。
methods: 我们提出了一种 Feature-Score Correlation-Aware Rehearsal 技术，以及一种 Action General-Specific Graph 技术，以mitigate 忘记现象。
results: 我们的方法在多个任务和多种动作类型下表现出色，可以减少忘记现象，并且比较有效和灵活。

Abstract
Action Quality Assessment (AQA) is a task that tries to answer how well an action is carried out. While remarkable progress has been achieved, existing works on AQA assume that all the training data are visible for training in one time, but do not enable continual learning on assessing new technical actions. In this work, we address such a Continual Learning problem in AQA (Continual-AQA), which urges a unified model to learn AQA tasks sequentially without forgetting. Our idea for modeling Continual-AQA is to sequentially learn a task-consistent score-discriminative feature distribution, in which the latent features express a strong correlation with the score labels regardless of the task or action types. From this perspective, we aim to mitigate the forgetting in Continual-AQA from two aspects. Firstly, to fuse the features of new and previous data into a score-discriminative distribution, a novel Feature-Score Correlation-Aware Rehearsal is proposed to store and reuse data from previous tasks with limited memory size. Secondly, an Action General-Specific Graph is developed to learn and decouple the action-general and action-specific knowledge so that the task-consistent score-discriminative features can be better extracted across various tasks. Extensive experiments are conducted to evaluate the contributions of proposed components. The comparisons with the existing continual learning methods additionally verify the effectiveness and versatility of our approach.

摘要

paper_url: http://arxiv.org/abs/2309.17104
repo_url: None
paper_authors: Tiantian Gong, Guodong Du, Junsheng Wang, Yongkang Ding, Liyan Zhang
for: Addresses the practical issue of incomplete text-based person re-identification (ReID) in real-world applications, where person images and text descriptions are not completely matched and contain partially missing modality data.
methods: Proposes a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework, which includes cross-modal nearest neighbor construction, relation graphs, and prototype-aware cross-modal alignment loss to handle incomplete data.
results: Consistently outperforms state-of-the-art text-image ReID approaches on several benchmarks with different missing ratios, demonstrating the effectiveness of the proposed method in handling incomplete data.

Abstract
Traditional text-based person re-identification (ReID) techniques heavily rely on fully matched multi-modal data, which is an ideal scenario. However, due to inevitable data missing and corruption during the collection and processing of cross-modal data, the incomplete data issue is usually met in real-world applications. Therefore, we consider a more practical task termed the incomplete text-based ReID task, where person images and text descriptions are not completely matched and contain partially missing modality data. To this end, we propose a novel Prototype-guided Cross-modal Completion and Alignment (PCCA) framework to handle the aforementioned issues for incomplete text-based ReID. Specifically, we cannot directly retrieve person images based on a text query on missing modality data. Therefore, we propose the cross-modal nearest neighbor construction strategy for missing data by computing the cross-modal similarity between existing images and texts, which provides key guidance for the completion of missing modal features. Furthermore, to efficiently complete the missing modal features, we construct the relation graphs with the aforementioned cross-modal nearest neighbor sets of missing modal data and the corresponding prototypes, which can further enhance the generated missing modal features. Additionally, for tighter fine-grained alignment between images and texts, we raise a prototype-aware cross-modal alignment loss that can effectively reduce the modality heterogeneity gap for better fine-grained alignment in common space. Extensive experimental results on several benchmarks with different missing ratios amply demonstrate that our method can consistently outperform state-of-the-art text-image ReID approaches.

摘要
传统的文本基于人识别（ReID）技术强调完全匹配的多模态数据，这是理想的情况。然而，实际应用中经常遇到数据缺失和损坏问题，因此我们考虑了一个更实际的任务：受限文本基于ReID任务，其中人像图像和文本描述不完全匹配，包含部分缺失的modal数据。为此，我们提出了一种 novel Prototype-guided Cross-modal Completion and Alignment（PCCA）框架来解决上述问题。具体来说，我们无法直接根据缺失modal数据来检索人像图像。因此，我们提出了跨模态最近邻构建策略，通过计算跨模态相似性来完成缺失modal特征。此外，为了有效完成缺失modal特征，我们构建了跨模态最近邻集和对应的原型，可以进一步增强生成的缺失modal特征。此外，为了更紧密地对齐图像和文本，我们提出了原型意识的跨模态对齐损失，可以更好地减少模态差异，以便更紧密地对齐在共同空间中。广泛的实验结果表明，我们的方法可以在多个benchmark上与状态ola分别获得优于状态艺。

Guiding Instruction-based Image Editing via Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2309.17102
repo_url: https://github.com/tsujuifu/pytorch_mgie
paper_authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan
for: 这个论文主要用于提高图像修改的可控性和灵活性，通过自然的命令而不需要详细的描述或区域mask。
methods: 这个论文使用多modal大语言模型（MLLM）来提高图像修改的可控性和灵活性，通过LM来实现跨modal的理解和视觉相关的回应生成。
results: 这个论文的实验结果表明，使用MLLM可以大幅提高图像修改的自动度和人工评价，同时保持竞争性的推理效率。

Abstract
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

摘要
instruction-based图像编辑提高图像修改的可控性和灵活性，通过自然的命令而无需详细的描述或地域掩码。然而，人类的指令有时候太简短，无法capture和跟踪。多模态大语言模型（MLLM）表现出跨模态理解和视觉意识回快生成的潜力。我们研究如何通过MLLM实现图像修改指令，并提出了MLLM导向图像修改（MGIE）。MGIE学习表达 instrucciones和提供显式导航，修改模型同时捕捉这种视觉想象并进行修改。我们评估了各种修改方面，包括批处理修改、全局图像优化和本地修改。广泛的实验结果表明，表达 instrucciones是指令基于图像修改的关键因素，而我们的MGIE可以在自动指标和人类评估中带来明显的改进，同时保持竞争力强的推理效率。

paper_url: http://arxiv.org/abs/2309.17093
repo_url: https://github.com/leolee99/pau
paper_authors: Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, Heng Tao Shen
For: The paper is written for improving the reliability of cross-modal retrieval methods by quantifying the uncertainty arisen from inherent data ambiguity.* Methods: The paper proposes a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework, which constructs learnable prototypes for each modality, uses Dempster-Shafer Theory and Subjective Logic Theory to build an evidential theoretical framework, and induces accurate uncertainty and reliable predictions for cross-modal retrieval.* Results: The paper demonstrates the effectiveness of the PAU model through extensive experiments on four major benchmark datasets, achieving accurate uncertainty and reliable predictions for cross-modal retrieval.Here’s the information in Simplified Chinese text:* For: 本文是为了提高跨模态检索方法的可靠性，通过量化数据本身的不确定性。* Methods: 本文提出了一种基于 проtotypes的 Aleatoric Uncertainty Quantification (PAU) 框架，通过构建每个模式的学习可变的原型，利用 Dempster-Shafer 理论和主观逻辑理论来建立证据理论框架，实现准确的不确定性和可靠的预测。* Results: 本文通过对四个主要的 benchmark 数据集进行了广泛的实验，证明了 PAU 模型的效iveness，实现了准确的不确定性和可靠的预测。

Abstract
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.

摘要
跨模态检索方法建立视觉语言模态之间的相似关系，通过共同学习一个公共表示空间。然而，预测结果经常不可靠，这是由于 Aleatoric uncertainty 引起的，这种不确定性来自于低质量数据，如损坏图像、快速视频和不细节的文本。在这篇论文中，我们提出了一种新的 Prototype-based Aleatoric Uncertainty Quantification（PAU）框架，以提供可靠的预测。具体来说，我们首先构建了每个模态的多种可学习prototype来表示整个 semantics 子空间。然后，我们使用 Dempster-Shafer 理论和主观逻辑理论来建立证据框架，将证据与 Dirichlet 分布参数相关联。PAU 模型可以准确地量ify Aleatoric 不确定性，并提供可靠的预测 для跨模态检索。我们在四个主要的 benchmark 数据集上进行了广泛的实验，结果表明我们的方法的有效性。代码可以在 https://github.com/leolee99/PAU 中下载。

SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning

paper_url: http://arxiv.org/abs/2309.17083
repo_url: https://github.com/dahlian00/segrcdb
paper_authors: Risa Shinoda, Ryo Hayamizu, Kodai Nakashima, Nakamasa Inoue, Rio Yokota, Hirokatsu Kataoka
for: 这个论文旨在提高视觉模型的训练效果，使其可以使用有限多个标注图像进行训练。methods: 该论文提出了一个新的数据集SegRCDB，该数据集基于一种公式驱动的监督学习方法，可以在无需实际图像或手动Semantic标注的情况下进行预训练。results: 预训练使用SegRCDB得到了高于COCO-Stuff的mIoU，并且在ADE-20k和Cityscapes上进行精度调整也有同样的表现。这些结果表明SegRCDB有很高的潜在价值，可以为semantic segmentation预训练和研究提供帮助。

Abstract
Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what matters in semantic segmentation pre-training has not been fully investigated. In this paper, we propose the Segmentation Radial Contour DataBase (SegRCDB), which for the first time applies formula-driven supervised learning for semantic segmentation. SegRCDB enables pre-training for semantic segmentation without real images or any manual semantic labels. SegRCDB is based on insights about what is important in pre-training for semantic segmentation and allows efficient pre-training. Pre-training with SegRCDB achieved higher mIoU than the pre-training with COCO-Stuff for fine-tuning on ADE-20k and Cityscapes with the same number of training images. SegRCDB has a high potential to contribute to semantic segmentation pre-training and investigation by enabling the creation of large datasets without manual annotation. The SegRCDB dataset will be released under a license that allows research and commercial use. Code is available at: https://github.com/dahlian00/SegRCDB

摘要
<>将文本翻译为简化字符串。<>预训练是一种强大的策略，可以帮助图像模型高效地在有限的标注图像上训练。在semantic segmentation中，创建标注mask需要很大的劳动和时间，因此建立大规模的预训练数据集 WITH semantic标签很难。此外，预训练中的什么都没有充分研究。在这篇论文中，我们提出了Segmentation Radial Contour DataBase (SegRCDB)，这是第一次应用式驱动学习来进行semantic segmentation的预训练。SegRCDB可以在无需真实图像或任何手动semantic标签的情况下进行预训练。SegRCDB基于预训练中对semantic segmentation的重要因素的理解，允许高效的预训练。预训练使用SegRCDB达到了与COCO-Stuff的同样数量的训练图像上的mIoU高于fine-tuning。SegRCDB具有推动semantic segmentation预训练和研究的潜力，因为它允许创建大规模的数据集无需手动标注。SegRCDB数据集将在允许研究和商业使用的license下发布。代码可以在以下地址找到：https://github.com/dahlian00/SegRCDB

Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications

paper_url: http://arxiv.org/abs/2309.17076
repo_url: None
paper_authors: Vladislav Dordiuk, Maksim Dzhigil, Konstantin Ushenin
for: 这个研究旨在探讨在3D瓷器分割任务中如何使用重量对称性来提高模型的准确率和参数数量。
methods: 该研究使用了卷积神经网络，并通过对重量进行对称处理来提高模型的泛化能力。
results: 研究发现，通过对重量进行对称处理，可以提高模型的准确率，并且可以降低模型的参数数量，甚至可以使用非常小的训练集来进行模型训练。

Abstract
3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.

摘要
三角形网格分割是生物医学应用中非常重要的任务。人体具有左右对称和一些器官位置的变化。这使得我们可以预期对于旋转和反转不变层在卷积神经网络中的积分层。在这个研究中，我们研究了3D网格分割任务中的重量对称的影响。我们分析了血管结构疾病（液体膜瘤）和常见解剖结构（心脏腔和心脏壁）的3D网格分割问题。本地几何特征被编码为signed distance函数的采样，神经网络对每个网格节点进行预测。我们发现，在神经网络具有至少三层卷积层的情况下，Weight Symmetry可以增加1%到3%的准确率，并且可以降低训练集的数量，最多降低8倍，而不会影响性能。这也适用于非常小的训练集。

DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation

paper_url: http://arxiv.org/abs/2309.17074
repo_url: None
paper_authors: Shengkun Tang, Yaqing Wang, Caiwen Ding, Yi Liang, Yao Li, Dongkuan Xu
for: 提高扩散模型的生成效率，适用于实时应用场景。
methods: 提出了一种早离开框架，采用层wise uncertainty estimation module (UEM) 来适应不同层的计算资源分配，以提高生成效率。
results: 对多个数据集进行了广泛的实验，并证明了与全层模型的性能和效率之间的良好平衡。而且，对基eline模型也带来了更好的性能提升。代码和模型已经公开发布 для重现。

Abstract
Diffusion models achieve great success in generating diverse and high-fidelity images. The performance improvements come with low generation speed per image, which hinders the application diffusion models in real-time scenarios. While some certain predictions benefit from the full computation of the model in each sample iteration, not every iteration requires the same amount of computation, potentially leading to computation waste. In this work, we propose DeeDiff, an early exiting framework that adaptively allocates computation resources in each sampling step to improve the generation efficiency of diffusion models. Specifically, we introduce a timestep-aware uncertainty estimation module (UEM) for diffusion models which is attached to each intermediate layer to estimate the prediction uncertainty of each layer. The uncertainty is regarded as the signal to decide if the inference terminates. Moreover, we propose uncertainty-aware layer-wise loss to fill the performance gap between full models and early-exited models. With such loss strategy, our model is able to obtain comparable results as full-layer models. Extensive experiments of class-conditional, unconditional, and text-guided generation on several datasets show that our method achieves state-of-the-art performance and efficiency trade-off compared with existing early exiting methods on diffusion models. More importantly, our method even brings extra benefits to baseline models and obtains better performance on CIFAR-10 and Celeb-A datasets. Full code and model are released for reproduction.

摘要
Diffusion models 取得了高品质和多样化的图像生成成功。然而，每几个图像生成 Speed 较低，对于实时应用而言是一个障碍。一些预测可以从全部模型的计算中获得优化，但不是每个迭代都需要相同的计算量，这可能会导致计算浪费。在这个工作中，我们提出了DeeDiff，一个早期终止框架，可以适当地分配computation资源，以提高对于扩散模型的生成效率。具体来说，我们将时间步长自适应不确定性估计模组（UEM）添加到各个中继层，以估计各个层的预测不确定性。这个不确定性被视为终止决策的信号。此外，我们还提出了层别不确定性感知损失，以填补全层模型和早期终止模型之间的性能差距。这种损失策略使我们的模型能够和全层模型相比较得到相似的表现。我们在多个标准 datasets上进行了广泛的实验，包括类别参数、无条件和文本参数生成。结果显示，我们的方法在效率和表现之间实现了绝佳的调适。此外，我们的方法甚至对基eline模型带来更好的表现，在CIFAR-10和Celeb-A datasets上取得了更高的表现。我们的代码和模型都公开发布，以便重现。

GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation

paper_url: http://arxiv.org/abs/2309.17059
repo_url: None
paper_authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Zheyuan Zhou, Kerui Hu
for: 本研究旨在提供一种高效和有效的多框架监测驱动 depth estimation，以实现自动驾驶系统中3D信息的感知。
methods: 我们提出了GSDC transformer，一种能够学习缓度关系的灵活注意力机制，以实现精细缓度的cue fusion。此外，我们还使用了空间排序和精细排序来降低计算复杂性。
results: 我们在KITTI dataset上实现了state-of-the-art表现，并且实现了高效的cue fusion速度。

Abstract
Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perform well in both types of scenes. However, adaptive cue fusion relies on attention mechanisms, where the quadratic complexity limits the granularity of cue representation. Additionally, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction. To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation. We utilize deformable attention to learn cue relationships at a fine scale, while sparse attention reduces computational requirements when granularity increases. To compensate for the precision drop in dynamic scenes, we represent scene attributes in the form of super tokens without relying on precise shapes. Within each super token attributed to dynamic scenes, we gather its relevant cues and learn local dense relationships to enhance cue fusion. Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed.

摘要
几何 estimation 提供了一种alternativeapproach для感知3D信息在自主驾驶中。单框depth estimation 已经取得了 significativesuccess by learning various types of cues and specializing in either static or dynamic scenes。最近，这些笔记 fusion 成为了一个吸引人的话题， aiming to enable the combined cues to perform well in both types of scenes。然而，adaptive cue fusion rely on attention mechanisms， where the quadratic complexity limits the granularity of cue representation。 In addition, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction。 To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation。 We utilize deformable attention to learn cue relationships at a fine scale, while sparse attention reduces computational requirements when granularity increases。 To compensate for the precision drop in dynamic scenes, we represent scene attributes in the form of super tokens without relying on precise shapes。 Within each super token attributed to dynamic scenes, we gather its relevant cues and learn local dense relationships to enhance cue fusion。 Our method achieves state-of-the-art performance on the KITTI dataset with efficient fusion speed。

Imagery Dataset for Condition Monitoring of Synthetic Fibre Ropes

paper_url: http://arxiv.org/abs/2309.17058
repo_url: None
paper_authors: Anju Rani, Daniel O. Arroyo, Petar Durdevic
for: automatized visual inspection of synthetic fiber ropes (SFRs) to detect defects and assess remaining useful life (RUL)
methods: computer vision applications, including object detection, classification, and segmentation
results: a comprehensive dataset of 6,942 raw images representing both normal and defective SFRs to support the development of robust defect detection algorithms

Abstract
Automatic visual inspection of synthetic fibre ropes (SFRs) is a challenging task in the field of offshore, wind turbine industries, etc. The presence of any defect in SFRs can compromise their structural integrity and pose significant safety risks. Due to the large size and weight of these ropes, it is often impractical to detach and inspect them frequently. Therefore, there is a critical need to develop efficient defect detection methods to assess their remaining useful life (RUL). To address this challenge, a comprehensive dataset has been generated, comprising a total of 6,942 raw images representing both normal and defective SFRs. The dataset encompasses a wide array of defect scenarios which may occur throughout their operational lifespan, including but not limited to placking defects, cut strands, chafings, compressions, core outs and normal. This dataset serves as a resource to support computer vision applications, including object detection, classification, and segmentation, aimed at detecting and analyzing defects in SFRs. The availability of this dataset will facilitate the development and evaluation of robust defect detection algorithms. The aim of generating this dataset is to assist in the development of automated defect detection systems that outperform traditional visual inspection methods, thereby paving the way for safer and more efficient utilization of SFRs across a wide range of applications.

摘要
自动化视觉检测 синтетиче纤维绳 (SFR) 在海上、风力发电等领域是一项复杂的任务。SFR 中任何缺陷都可能会损害其结构完整性，对人员和设备安全造成重要风险。由于这些绳子的大小和重量，常常无法定期检查它们。因此，有一项急需开发高效缺陷检测方法，以评估它们的剩余有用寿（RUL）。为解决这个挑战，我们已经生成了一个全面的数据集，包括总共 6,942 张原始图像，表示正常和缺陷 SFR 的场景。这个数据集包括了各种可能发生在 SFR 的操作寿命中的缺陷enario，包括但不限于斑点缺陷、切断绳、擦伤、压缩、核心缺陷和正常。这个数据集作为计算机视觉应用程序的资源，可以支持对 SFR 中的缺陷进行检测和分类、分割等计算机视觉方法。数据集的可用性将促进了检测和分析 SFR 中的缺陷的自动化系统的开发和评估，从而为 SFR 在各种应用中的更安全和更高效的使用做出了重要贡献。

A 5-Point Minimal Solver for Event Camera Relative Motion Estimation

paper_url: http://arxiv.org/abs/2309.17054
repo_url: None
paper_authors: Ling Gao, Hang Su, Daniel Gehrig, Marco Cannici, Davide Scaramuzza, Laurent Kneip
for: Linear motion estimation using event-based cameras
methods: Derive correct non-linear parametrization of eventails (manifolds generated by lines in the space-time volume of events) and introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections.
results: Generate more stable relative motion estimates than other methods and consistently achieve a 100% success rate in estimating linear velocity, outperforming existing closed-form solvers.

Abstract
Event-based cameras are ideal for line-based motion estimation, since they predominantly respond to edges in the scene. However, accurately determining the camera displacement based on events continues to be an open problem. This is because line feature extraction and dynamics estimation are tightly coupled when using event cameras, and no precise model is currently available for describing the complex structures generated by lines in the space-time volume of events. We solve this problem by deriving the correct non-linear parametrization of such manifolds, which we term eventails, and demonstrate its application to event-based linear motion estimation, with known rotation from an Inertial Measurement Unit. Using this parametrization, we introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections, which can be fused into a single, averaged linear velocity when considering multiple lines. We demonstrate on both synthetic and real data that our solver generates more stable relative motion estimates than other methods while capturing more inliers than clustering based on spatio-temporal planes. In particular, our method consistently achieves a 100% success rate in estimating linear velocity where existing closed-form solvers only achieve between 23% and 70%. The proposed eventails contribute to a better understanding of spatio-temporal event-generated geometries and we thus believe it will become a core building block of future event-based motion estimation algorithms.

摘要
(Note: The translation is in Simplified Chinese, using the traditional Chinese characters for "event" and "tail".) Event-based 摄像机 идеаль для线性运动估计，因为它们主要响应Scene中的 Edge。 However, accurately determining the camera displacement based on events remains an open problem. This is because line feature extraction and dynamics estimation are closely tied when using event cameras, and no precise model is currently available for describing the complex structures generated by lines in the space-time volume of events. We address this problem by deriving the correct non-linear parametrization of such manifolds, which we term eventails, and demonstrate its application to event-based linear motion estimation, with known rotation from an Inertial Measurement Unit. Using this parametrization, we introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections, which can be fused into a single, averaged linear velocity when considering multiple lines. We demonstrate on both synthetic and real data that our solver generates more stable relative motion estimates than other methods while capturing more inliers than clustering based on spatio-temporal planes. In particular, our method consistently achieves a 100% success rate in estimating linear velocity where existing closed-form solvers only achieve between 23% and 70%. The proposed eventails contribute to a better understanding of spatio-temporal event-generated geometries and we thus believe it will become a core building block of future event-based motion estimation algorithms.

On Uniform Scalar Quantization for Learned Image Compression

paper_url: http://arxiv.org/abs/2309.17051
repo_url: None
paper_authors: Haotian Zhang, Li Li, Dong Liu
for: 本文旨在探讨了在梯度基本训练中使用非 differentiable归一化的图像压缩问题。
methods: 作者提出了一种基于杂化热度控制的方法，并进行了系统的理论分析，包括误差和梯度估计风险的分析。
results: 作者的方法在多种 represntative 图像压缩网络上表现出了更高的性能，并且提供了两个小技巧，一是设置合适的下界参数，二是使用零中心归一化和部分停止梯度。

Abstract
Learned image compression possesses a unique challenge when incorporating non-differentiable quantization into the gradient-based training of the networks. Several quantization surrogates have been proposed to fulfill the training, but they were not systematically justified from a theoretical perspective. We fill this gap by contrasting uniform scalar quantization, the most widely used category with rounding being its simplest case, and its training surrogates. In principle, we find two factors crucial: one is the discrepancy between the surrogate and rounding, leading to train-test mismatch; the other is gradient estimation risk due to the surrogate, which consists of bias and variance of the gradient estimation. Our analyses and simulations imply that there is a tradeoff between the train-test mismatch and the gradient estimation risk, and the tradeoff varies across different network structures. Motivated by these analyses, we present a method based on stochastic uniform annealing, which has an adjustable temperature coefficient to control the tradeoff. Moreover, our analyses enlighten us as to two subtle tricks: one is to set an appropriate lower bound for the variance parameter of the estimated quantized latent distribution, which effectively reduces the train-test mismatch; the other is to use zero-center quantization with partial stop-gradient, which reduces the gradient estimation variance and thus stabilize the training. Our method with the tricks is verified to outperform the existing practices of quantization surrogates on a variety of representative image compression networks.

摘要
学习图像压缩存在独特挑战，因为在梯度基本训练网络时引入不可微化量化。许多量化代理已被提出，但它们没有系统地从理论角度得到正确的正则化。我们填充这一差距，通过对uniform整数量化和其训练代理进行对比。在理论上，我们发现两个因素非常重要：一是量化代理和圆拟的差异，导致训练和测试之间的差异；另一个是因为代理而导致的梯度估计风险，它包括梯度估计的偏差和方差。我们的分析和实验表明，存在训练和测试之间的交易，而这种交易随着不同的网络结构而变化。我们被这些分析激励，并提出一种基于随机均衡的方法，其中包括可调温度系数以控制交易。此外，我们的分析还让我们发现了两个微妙的技巧：一是设置合适的下界参数，以降低训练和测试之间的差异；另一个是使用零中量化，以降低梯度估计方差，从而稳定训练。我们的方法，包括这两个技巧，在一系列代表性的图像压缩网络上被证明为比现有实践更高效。

UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight Modeling

paper_url: http://arxiv.org/abs/2309.17036
repo_url: None
paper_authors: Linghao Yang, Yanmin Wu, Yu Deng, Rui Tian, Xinggang Hu, Tiefeng Ma
for: 这篇论文的目的是实现环境中未知静止物体的追踪和模型化，并且能够同时进行自我运动追踪和物体运动追踪。
methods: 这篇论文使用了一个新的SLAM背端，它结合了自我运动追踪、物体运动追踪和模型化，并且使用了一个新的像素级异步物体追踪器（AOT），让追踪器能够对于不同任务和提示下有效地追踪目标未知物体。
results: 这篇论文的结果显示，在实验和实际应用中，这个系统具有了前所未有的稳定性和精度，并且在追踪和模型化中具有了很高的效率和可靠性。

Abstract
Tracking and modeling unknown rigid objects in the environment play a crucial role in autonomous unmanned systems and virtual-real interactive applications. However, many existing Simultaneous Localization, Mapping and Moving Object Tracking (SLAMMOT) methods focus solely on estimating specific object poses and lack estimation of object scales and are unable to effectively track unknown objects. In this paper, we propose a novel SLAM backend that unifies ego-motion tracking, rigid object motion tracking, and modeling within a joint optimization framework. In the perception part, we designed a pixel-level asynchronous object tracker (AOT) based on the Segment Anything Model (SAM) and DeAOT, enabling the tracker to effectively track target unknown objects guided by various predefined tasks and prompts. In the modeling part, we present a novel object-centric quadric parameterization to unify both static and dynamic object initialization and optimization. Subsequently, in the part of object state estimation, we propose a tightly coupled optimization model for object pose and scale estimation, incorporating hybrids constraints into a novel dual sliding window optimization framework for joint estimation. To our knowledge, we are the first to tightly couple object pose tracking with light-weight modeling of dynamic and static objects using quadric. We conduct qualitative and quantitative experiments on simulation datasets and real-world datasets, demonstrating the state-of-the-art robustness and accuracy in motion estimation and modeling. Our system showcases the potential application of object perception in complex dynamic scenes.

摘要
<>translation-environment: zh-CN Tracking和模型未知静止物体在自动驾驶系统和虚拟实际交互应用中扮演着关键性的角色。然而，许多现有的同时地理位、地图和移动物体跟踪（SLAMMOT）方法仅仅关注特定物体姿态的估计，而不能有效地跟踪未知物体。在这篇论文中，我们提出了一种新的SLAM后端，它将ego-动态跟踪、静止物体动态跟踪和模型化集成到一个共同优化框架中。在感知部分，我们设计了基于Segment Anything Model（SAM）和DeAOT的像素级异步物体跟踪器（AOT），使得跟踪器能够根据不同的任务和提示有效地跟踪目标未知物体。在模型部分，我们提出了一种新的物体-центric四元参数化，以统一静止和动态物体的初始化和优化。在物体状态估计部分，我们提出了一种紧密相互关联的优化模型，将物体姿态和Scale的估计集成到一个新的双滑动窗口优化框架中。到我们所知，我们是第一个通过quadric紧密地跟踪动态和静止物体的 pose和Scale。我们在模拟数据集和实际数据集上进行了质量和量化实验，展示了对运动估计和模型化的状态艺术。我们的系统展示了对复杂动态场景的物体感知的潜在应用。Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China. The translation is based on the Google Translate API, and may not be perfect or idiomatic in all cases.

Unveiling Document Structures with YOLOv5 Layout Detection

paper_url: http://arxiv.org/abs/2309.17033
repo_url: None
paper_authors: Herman Sugiharto, Yorissa Silviana, Yani Siti Nurpazrin
for: 本研究旨在快速识别文档布局和抽取无结构数据。methods: 本研究使用 cutting-edge 计算机视觉模型 YOLOv5 进行文档布局识别和无结构数据抽取。results: YOLOv5 模型在文档布局识别任务中表现出众，准确率为 0.91，准确率为 0.971， F1 分数为 0.939，ROC曲线下的面积为 0.975。这个系统可以有效地提高无结构数据抽取的效率。

Abstract
The current digital environment is characterized by the widespread presence of data, particularly unstructured data, which poses many issues in sectors including finance, healthcare, and education. Conventional techniques for data extraction encounter difficulties in dealing with the inherent variety and complexity of unstructured data, hence requiring the adoption of more efficient methodologies. This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data. The present study establishes a conceptual framework for delineating the notion of "objects" as they pertain to documents, incorporating various elements such as paragraphs, tables, photos, and other constituent parts. The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data, hence improving the effectiveness of data extraction. In the conducted examination, the YOLOv5 model exhibits notable effectiveness in the task of document layout identification, attaining a high accuracy rate along with a precision value of 0.91, a recall value of 0.971, an F1-score of 0.939, and an area under the receiver operating characteristic curve (AUC-ROC) of 0.975. The remarkable performance of this system optimizes the process of extracting textual and tabular data from document images. Its prospective applications are not limited to document analysis but can encompass unstructured data from diverse sources, such as audio data. This study lays the foundation for future investigations into the wider applicability of YOLOv5 in managing various types of unstructured data, offering potential for novel applications across multiple domains.

摘要
现今数字环境中，尤其是在金融、医疗和教育等领域，数据的普遍存在和不结构化的特点带来了许多问题。传统的数据EXTRACTING技术在处理不结构化数据的自然多样性和复杂性时遇到困难，因此需要采用更高效的方法ologies。本研究利用YOLOv5 cutting-edge计算机视觉模型，为了快速识别文档布局和提取不结构化数据。本研究提出了对文档中的"对象"概念的定义，包括段落、表格、照片等元素。主要目标是创建一个自主的系统，可以快速识别文档布局并提取不结构化数据，从而改善数据EXTRACTING的效率。在实验中，YOLOv5模型在文档布局识别任务中表现出了remarkable的效果，实现了高精度率、精度值0.91、回归值0.971、F1值0.939和ROC曲线下的接收操作特征值0.975。这个系统的出色表现可以优化文档图像中的文本和表格数据EXTRACTING过程。其潜在应用不限于文档分析，还可以涵盖多种不结构化数据的管理，如音频数据。本研究为未来对YOLOv5在不同类型的不结构化数据管理方面的进一步研究提供了基础。

HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World

paper_url: http://arxiv.org/abs/2309.17024
repo_url: None
paper_authors: Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, Marc Pollefeys
for:这个研究的目的是开发一种可以与人类交互并协助完成物理世界任务的智能助手。methods:这个研究使用了一种大规模的egosistent human interaction数据集，名为HoloAssist，其中两个人共同完成物理搅动任务。任务执行者通过穿着一种混合现实头盔记录了七个同步数据流。任务指导者通过实时观看执行者的 egocentric 视频，给出指导。results:通过对数据进行扩充，并观察参与者的各种行为，我们提供了关键的洞察，包括人工指导者如何 corrrect 错误， intervene 在任务完成过程中，并将指导链接到环境。HoloAssist 涵盖了 166 小时的数据， captured by 350 个唯一的 instructor-performer 对。此外，我们构建了一些 benchmark，包括 mistake detection， intervention type prediction，和 hand forecasting，并进行了详细分析。我们期望 HoloAssist 将成为建立智能助手交互与人类的重要资源。数据可以在 https://holoassist.github.io/ 下载。

Abstract
Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task instructor watches the performer's egocentric video in real time and guides them verbally. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/.

摘要

Segment Anything Model is a Good Teacher for Local Feature Learning

paper_url: http://arxiv.org/abs/2309.16992
repo_url: https://github.com/vignywang/samfeat
paper_authors: Jingqian Wu, Rongtao Xu, Zach Wood-Doughty, Changwei Wang
for: 提高地方特征描述性能
methods: 使用 SAM 模型作为导师，通过Pixel Semantic Relational Distillation (PSRD) 和 Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) 等技术进行地方特征学习和描述
results: 在多个任务上达到了更高的性能，比如图像匹配和长期视觉地址Local feature detection and description are crucial for many computer vision tasks, but data-driven methods rely on pixel-level correspondence, which is challenging to obtain. This paper proposes SAMFeat, which uses a pre-trained SAM model as a teacher to guide local feature learning and improve performance on limited datasets. The proposed method includes Pixel Semantic Relational Distillation (PSRD) and Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), and an Edge Attention Guidance (EAG) to improve the accuracy of local feature detection and description. The results show that SAMFeat outperforms previous local features on various tasks, such as image matching on HPatches and long-term visual localization on Aachen Day-Night.

Abstract
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Pixel Semantic Relational Distillation (PSRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.

摘要
本文提出了一种名为SAMFeat的新方法，用于提高计算机视觉任务中的本地特征检测和描述。SAMFeat利用了一个名为SAM（分类任务模型）的基本模型，该模型在1100万张图像上进行训练，并且作为教师来引导本地特征学习。为此，我们首先构建了一个auxiliary任务，即像素semantic relational distillation（PSRD）任务，该任务将SAM模型学习的分类信息转化为本地特征描述网络中的特征关系。其次，我们开发了一种名为弱式监督对比学习基于 semantic grouping（WSC）的技术，该技术利用了SAM模型生成的semantic grouping来提高本地特征描述的度量空间。最后，我们设计了一种Edge Attention Guidance（EAG）技术，以提高本地特征检测和描述的准确率，并且使网络更加注意到引导Edge region。SAMFeat在多个任务上，如HPatches图像匹配和Aachen日夜长期视觉本地化，表现出了与之前的本地特征相比的超越性。 codes可以在https://github.com/vignywang/SAMFeat上下载。

Text-image Alignment for Diffusion-based Perception

paper_url: http://arxiv.org/abs/2310.00031
repo_url: None
paper_authors: Neehar Kondapaneni, Markus Marks, Manuel Knott, Rogério Guimarães, Pietro Perona
For: The paper is written for exploring the use of diffusion models for visual tasks and improving the perceptual performance of diffusion-based models.* Methods: The paper uses automatically generated captions to improve text-image alignment and enhance the cross-attention maps of the model, leading to better perceptual performance.* Results: The paper achieves state-of-the-art (SOTA) results in diffusion-based semantic segmentation on ADE20K and overall SOTA in depth estimation on NYUv2. The method also generalizes to the cross-domain setting and achieves SOTA results in object detection on Watercolor2K and segmentation on Dark Zurich-val and Nighttime Driving.

Abstract
Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However, the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically, it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps, leading to better perceptual performance. Our approach improves upon the current SOTA in diffusion-based semantic segmentation on ADE20K and the current overall SOTA in depth estimation on NYUv2. Furthermore, our method generalizes to the cross-domain setting; we use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our object detection model, trained on Pascal VOC, achieves SOTA results on Watercolor2K. Our segmentation method, trained on Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: https://www.vision.caltech.edu/tadp/

摘要
填挤模型是一类生成模型，具有吸引人的文本到图像合成能力，并促使了传统机器学习任务中的新一波创新方法。然而，如何利用这些生成模型的感知知识来解决视觉任务仍然是一个打开的问题。具体来说，使用推荐界面在扩散幕中应用 diffusion 模型是一个未解决的问题。我们发现，自动生成的标题可以改善文本-图像对齐，并使模型的交叉注意力图显著提高，从而提高模型的感知性能。我们的方法超过当前 SOTA 在扩散基于 semantic segmentation 任务上的 ADE20K，以及当前总体 SOTA 在 depth estimation 任务上的 NYUv2。此外，我们的方法可以跨领域调整，我们使用模型个性化和标题修改来对我们的模型进行对目标领域的调整，并在不对齐基elines上获得改进。我们的对象检测模型，训练在 Pascal VOC，在 Watercolor2K 上达到了 SOTA 结果。我们的 segmentation 方法，训练在 Cityscapes，在 Dark Zurich-val 和 Nighttime Driving 上达到了 SOTA 结果。更多信息请参考我们的项目页面：https://www.vision.caltech.edu/tadp/

SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features

paper_url: http://arxiv.org/abs/2309.16987
repo_url: None
paper_authors: Song Wang, Zhu Wang, Can Li, Xiaojuan Qi, Hayden Kwok-Hay So
for: Event-based multi-object tracking (MOT) in real-world settings with complex background and camera motion.
methods: SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects, and a simultaneous object detector provides updated spatial information.
results: SpikeMOT achieves high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.

Abstract
In comparison to conventional RGB cameras, the superior temporal resolution of event cameras allows them to capture rich information between frames, making them prime candidates for object tracking. Yet in practice, despite their theoretical advantages, the body of work on event-based multi-object tracking (MOT) remains in its infancy, especially in real-world settings where events from complex background and camera motion can easily obscure the true target motion. In this work, an event-based multi-object tracker, called SpikeMOT, is presented to address these challenges. SpikeMOT leverages spiking neural networks to extract sparse spatiotemporal features from event streams associated with objects. The resulting spike train representations are used to track the object movement at high frequency, while a simultaneous object detector provides updated spatial information of these objects at an equivalent frame rate. To evaluate the effectiveness of SpikeMOT, we introduce DSEC-MOT, the first large-scale event-based MOT benchmark incorporating fine-grained annotations for objects experiencing severe occlusions, frequent trajectory intersections, and long-term re-identification in real-world contexts. Extensive experiments employing DSEC-MOT and another event-based dataset, named FE240hz, demonstrate SpikeMOT's capability to achieve high tracking accuracy amidst challenging real-world scenarios, advancing the state-of-the-art in event-based multi-object tracking.

摘要
contrast to traditional RGB 摄像头，事件摄像头的高时间分辨率使其能够捕捉到 frames 之间的质量丰富信息，使其成为目标跟踪的优良候选。然而，在实际应用中，尚未有充分的研究对事件基据多对象跟踪（MOT）进行了深入的探索，特别是在实际场景中，背景和摄像头的运动会轻松隐藏真实的目标运动。在这种情况下，一种基于事件的多对象跟踪器，称为SpikeMOT，被提出来解决这些挑战。SpikeMOT 利用脉冲神经网络提取事件流中对象的稀疏 spatial temporal 特征。这些脉冲表示被跟踪对象的运动，同时一个同步的对象探测器提供了这些对象的新的空间信息，并在相同的帧率上进行跟踪。为了评估SpikeMOT 的效果，我们提出了 DSEC-MOT，这是首个包含细化注释的事件基据 MOT benchmark，这些注释包括对象受到严重遮挡、频繁的轨迹交叠以及长期重新识别。通过使用 DSEC-MOT 和另一个事件基据数据集 named FE240hz，我们进行了广泛的实验，demonstrating SpikeMOT 在真实世界场景中实现高跟踪精度，从而提高事件基据 MOT 的国际先进水平。

Perceptual Tone Mapping Model for High Dynamic Range Imaging

paper_url: http://arxiv.org/abs/2309.16975
repo_url: None
paper_authors: Imran Mehmood, Xinye Shi, M. Usman Khan, Ming Ronnier Luo
for: 该论文旨在解决高Dynamic Range（HDR）图像到标准动态范围（SDR）显示器的显示问题，保持HDR图像的感知质量。
methods: 该论文使用了CIECAMA16感知颜色属性，包括亮度、颜色彩度和色调，来创建一种基于感知属性的滤镜操作（TMO）。
results: 对比和主观评估表明，该模型在对比、颜色彩度和整体图像质量方面表现出色，超过了现有的TMO。

Abstract
One of the key challenges in tone mapping is to preserve the perceptual quality of high dynamic range (HDR) images when mapping them to standard dynamic range (SDR) displays. Traditional tone mapping operators (TMOs) compress the luminance of HDR images without considering the surround and display conditions emanating into suboptimal results. Current research addresses this challenge by incorporating perceptual color appearance attributes. In this work, we propose a TMO (TMOz) that leverages CIECAM16 perceptual attributes, i.e., brightness, colorfulness, and hue. TMOz accounts for the effects of both the surround and the display conditions to achieve more optimal colorfulness reproduction. The perceptual brightness is compressed, and the perceptual color scales, i.e., colorfulness and hue are derived from HDR images by employing CIECAM16 color adaptation equations. A psychophysical experiment was conducted to automate the brightness compression parameter. The model employs fully automatic and adaptive approach, obviating the requirement for manual parameter selection. TMOz was evaluated in terms of contrast, colorfulness and overall image quality. The objective and subjective evaluation methods revealed that the proposed model outperformed the state-of-the-art TMOs.

摘要
一个关键挑战在声音映射是保持高动态范围（HDR）图像的感知质量 cuando mapping 到标准动态范围（SDR）显示器。传统的声音映射运算符（TMO）压缩 HDR 图像的亮度无论考虑围层和显示条件，导致SUBOPTIMAL 结果。当前的研究利用感知色度属性来解决这个挑战。在这种工作中，我们提出了一种基于 CIECAM16 感知属性的 TMO（TMOz），包括亮度、色彩强度和色调。TMOz 考虑了围层和显示条件的影响，以实现更优化的色彩强度复制。HDR 图像中的感知亮度被压缩，并使用 CIECAM16 色色映射方程 derive 感知色彩和色调。我们进行了一次心理学实验，自动化了亮度压缩参数。模型使用了完全自动和适应的方法，不需要手动参数选择。TMOz 被评估基于对比度、色彩强度和整体图像质量。对象和主观评估方法表明，提案的模型在 state-of-the-art TMOs 中表现更好。

Synthetic Data Generation and Deep Learning for the Topological Analysis of 3D Data

paper_url: http://arxiv.org/abs/2309.16968
repo_url: None
paper_authors: Dylan Peek, Matt P. Skerritt, Stephan Chalup
for: 这个研究使用深度学习来估算由稀疏、无序点云场景表示的三维拓扑结构。
methods: 研究使用了新的标注数据集来训练神经网络并评估它们能够估算这些拓扑结构的 genus。这些数据使用了随机的HOMEOMORPHIC DEFORMATIONS来让学习可见的拓扑特征。
results: 研究表明，深度学习模型可以EXTRACT这些特征，并且与基于PERSISTENT HOMOLOGY的现有拓扑数据分析工具相比，它们具有一些优点。此外，研究还使用了semantic segmentation来提供更多的地形信息，并与拓扑标签相结合。

Abstract
This research uses deep learning to estimate the topology of manifolds represented by sparse, unordered point cloud scenes in 3D. A new labelled dataset was synthesised to train neural networks and evaluate their ability to estimate the genus of these manifolds. This data used random homeomorphic deformations to provoke the learning of visual topological features. We demonstrate that deep learning models could extract these features and discuss some advantages over existing topological data analysis tools that are based on persistent homology. Semantic segmentation was used to provide additional geometric information in conjunction with topological labels. Common point cloud multi-layer perceptron and transformer networks were both used to compare the viability of these methods. The experimental results of this pilot study support the hypothesis that, with the aid of sophisticated synthetic data generation, neural networks can perform segmentation-based topological data analysis. While our study focused on simulated data, the accuracy achieved suggests a potential for future applications using real data.

摘要

nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance

paper_url: http://arxiv.org/abs/2309.16967
repo_url: https://github.com/kent0n-li/medical-image-segmentation
paper_authors: Yunxiang Li, Bowen Jing, Zihan Li, Jing Wang, You Zhang
for: 这个研究的目的是提出一个能够整合底层模型和专业化神经网络的新方法，以提高医疗影像分类的精度和可靠性。
methods: 这个方法使用了Segment Anything Model (SAM)和nnUNet两种神经网络， synergistically 整合它们以实现更高精度和更好的适应能力。
results: 实验结果显示，这个方法可以在不同的训练数据大小下进行几次学习，并且可以在医疗影像分类中实现更高的精度和可靠性。

Abstract
The recent developments of foundation models in computer vision, especially the Segment Anything Model (SAM), allow scalable and domain-agnostic image segmentation to serve as a general-purpose segmentation tool. In parallel, the field of medical image segmentation has benefited significantly from specialized neural networks like the nnUNet, which is trained on domain-specific datasets and can automatically configure the network to tailor to specific segmentation challenges. To combine the advantages of foundation models and domain-specific models, we present nnSAM, which synergistically integrates the SAM model with the nnUNet model to achieve more accurate and robust medical image segmentation. The nnSAM model leverages the powerful and robust feature extraction capabilities of SAM, while harnessing the automatic configuration capabilities of nnUNet to promote dataset-tailored learning. Our comprehensive evaluation of nnSAM model on different sizes of training samples shows that it allows few-shot learning, which is highly relevant for medical image segmentation where high-quality, annotated data can be scarce and costly to obtain. By melding the strengths of both its predecessors, nnSAM positions itself as a potential new benchmark in medical image segmentation, offering a tool that combines broad applicability with specialized efficiency. The code is available at https://github.com/Kent0n-Li/Medical-Image-Segmentation.

摘要
近年来，计算机视觉领域内的基础模型，特别是Segment Anything Model（SAM），允许扫描领域和领域无关的图像分割，成为一种通用的图像分割工具。同时，医疗图像分割领域也受到专门的神经网络，如nnUNet的启用，这种神经网络通过特定数据集进行自动配置，以适应特定的分割挑战。为了融合基础模型和域специфи的模型的优点，我们提出了nnSAM模型，它将SAM模型与nnUNet模型 synergistically 集成，以实现更高精度和更加稳定的医疗图像分割。nnSAM模型利用SAM模型的强大和稳定的特征提取能力，同时利用nnUNet模型的自动配置能力，以适应特定数据集的学习。我们对不同训练样本大小的nnSAM模型进行了全面的评估，发现它具有几shot学习能力，这是医疗图像分割中非常有价值的特点，因为高质量、标注过的数据可能是昂贵和困难的获得。通过融合两者的优点，nnSAM模型在医疗图像分割中positioned itself为一种新的benchmark，提供了一种可以同时拥有通用性和特定性的工具。代码可以在https://github.com/Kent0n-Li/Medical-Image-Segmentation中找到。

AdaPose: Towards Cross-Site Device-Free Human Pose Estimation with Commodity WiFi

paper_url: http://arxiv.org/abs/2309.16964
repo_url: None
paper_authors: Yunjiao Zhou, Jianfei Yang, He Huang, Lihua Xie
for: WiFi-based pose estimation 技术的发展，特别是在智能家居和元宇宙人物生成方面。
methods: 提出了一种适应领域的 pose estimation 算法，名为 AdaPose，用于弱监督的 WiFi CSI poses 估计。
results: 实验结果表明，AdaPose 能够有效地消除领域差异，从而推动 WiFi-based pose estimation 技术在智能城市中的普及应用。

Abstract
WiFi-based pose estimation is a technology with great potential for the development of smart homes and metaverse avatar generation. However, current WiFi-based pose estimation methods are predominantly evaluated under controlled laboratory conditions with sophisticated vision models to acquire accurately labeled data. Furthermore, WiFi CSI is highly sensitive to environmental variables, and direct application of a pre-trained model to a new environment may yield suboptimal results due to domain shift. In this paper, we proposes a domain adaptation algorithm, AdaPose, designed specifically for weakly-supervised WiFi-based pose estimation. The proposed method aims to identify consistent human poses that are highly resistant to environmental dynamics. To achieve this goal, we introduce a Mapping Consistency Loss that aligns the domain discrepancy of source and target domains based on inner consistency between input and output at the mapping level. We conduct extensive experiments on domain adaptation in two different scenes using our self-collected pose estimation dataset containing WiFi CSI frames. The results demonstrate the effectiveness and robustness of AdaPose in eliminating domain shift, thereby facilitating the widespread application of WiFi-based pose estimation in smart cities.

摘要

COMNet: Co-Occurrent Matching for Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.16959
repo_url: None
paper_authors: Yukun Su, Jingliang Deng, Zonghan Li
for: 提高图像水平的弱监督semantic segmentation的质量
methods: 提出了一种名为Co-Occurrent Matching Network（COMNet）的新网络，通过在具有共同类别的图像对进行间相匹配，以及在单个图像上进行内部匹配，来提高分类网络的核心区域匹配。
results: 在Pascal VOC 2012和MS-COCO datasets上，我们的网络可以有效地提高基eline模型的性能，并实现新的状态网络。

Abstract
Image-level weakly supervised semantic segmentation is a challenging task that has been deeply studied in recent years. Most of the common solutions exploit class activation map (CAM) to locate object regions. However, such response maps generated by the classification network usually focus on discriminative object parts. In this paper, we propose a novel Co-Occurrent Matching Network (COMNet), which can promote the quality of the CAMs and enforce the network to pay attention to the entire parts of objects. Specifically, we perform inter-matching on paired images that contain common classes to enhance the corresponded areas, and construct intra-matching on a single image to propagate the semantic features across the object regions. The experiments on the Pascal VOC 2012 and MS-COCO datasets show that our network can effectively boost the performance of the baseline model and achieve new state-of-the-art performance.

摘要
Image-levelweakly supervised semantic segmentation是一个复杂的任务，近年来得到了广泛研究。大多数常见的解决方案利用类活化图(CAM)来定位对象区域。然而，由分类网络生成的响应图通常会专注于特征性的对象部分。在这篇论文中，我们提出了一种新的协同匹配网络（COMNet），可以提高CAM的质量并使网络对整个对象部分进行注意。具体来说，我们在含共同类的图像对进行交叉匹配以强制相应区域的匹配，并在单个图像上进行内部匹配以传播对象区域中的semantic特征。实验表明，我们的网络可以有效提高基eline模型的性能并实现新的状态对抗性表现。

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

paper_url: http://arxiv.org/abs/2309.16956
repo_url: None
paper_authors: Runnan Chen, Xinge Zhu, Nenglun Chen, Dawei Wang, Wei Li, Yuexin Ma, Ruigang Yang, Tongliang Liu, Wenping Wang
for: 本研究目的是学习免需大量标注点云的3D场景识别，通过从计算机支持设计（CAD）模型和语言学习3D场景表示。
methods: 我们提出了Model2Scene模型，它首先在CAD模型中随机混合数据，然后使用深度几何体系减少领域差，最后通过语言编码和点特征强制对齐来预训练3D网络。
results: 我们的Model2Scene模型可以在无标签3D物体突出检测、efficient 3D场景识别和零shot 3Dsemantic segmentation等下游任务中提供优秀表现，特别是在ScanNet和S3DIS数据集上得到了46.08%和55.49%的平均精度。

Abstract
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud, which is tedious and expensive to acquire. In this paper, we propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages. The main challenges are the domain gaps between the CAD models and the real scene's objects, including model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object). To handle the above challenges, Model2Scene first simulates a crowded scene by mixing data-augmented CAD models. Next, we propose a novel feature regularization operation, termed Deep Convex-hull Regularization (DCR), to project point features into a unified convex hull space, reducing the domain gap. Ultimately, we impose contrastive loss on language embedding and the point features of CAD models to pre-train the 3D network. Extensive experiments verify the learned 3D scene representation is beneficial for various downstream tasks, including label-free 3D object salient detection, label-efficient 3D scene perception and zero-shot 3D semantic segmentation. Notably, Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets, respectively. The code will be publicly available.

摘要
当前成功的3D场景识别方法都基于大规模标注的点云数据，但这些数据是费时和成本高的获得的。在这篇论文中，我们提出了Model2Scene，一种新的思路，它可以自由地学习3D场景表示从计算机支持设计（CAD）模型和语言中。主要挑战是从CAD模型到场景中的对象之间的领域差异，包括模型到场景（从单个模型到场景）和synthetic-to-real（从 sintetic模型到实际场景中的对象）。为了解决以上挑战，Model2Scene首先将CAD模型混合数据进行加工，然后我们提出了一种新的特征规范操作，称为深度凹陷规范（DCR），用于将点特征 проек到一个统一的凹陷空间，从而减少领域差异。最后，我们对语言表示和CAD模型的点特征进行偏置损失，以预训练3D网络。广泛的实验表明，学习的3D场景表示对下游任务具有很好的效果，包括无标签3D对象突出检测、标签有效3D场景识别和零标签3Dsemantic分割。特别是，Model2Scene在无标签3D对象突出检测任务中取得了非常出色的平均精度（mAP）46.08%和55.49%，分别在ScanNet和S3DIS datasets上。代码将公开。

CrossZoom: Simultaneously Motion Deblurring and Event Super-Resolving

paper_url: http://arxiv.org/abs/2309.16949
repo_url: https://github.com/bestrivenzc/CZ-Net
paper_authors: Chi Zhang, Xiang Zhang, Mingyuan Lin, Cheng Li, Chu He, Wen Yang, Gui-Song Xia, Lei Yu
for:This paper aims to improve the performance of frame-event based vision applications by bridging the resolution gap between traditional and neuromorphic event cameras.methods:The proposed method, called CrossZoom, uses a novel unified neural network (CZ-Net) to jointly recover sharp latent sequences and high-resolution events from blurry inputs. The method leverages scale-variant properties and effectively fuses cross-modality information to achieve cross-enhancement.results:Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed method. The method can improve the temporal resolution of images and the spatial resolution of events, leading to better performance in frame-event based vision applications.

Abstract
Even though the collaboration between traditional and neuromorphic event cameras brings prosperity to frame-event based vision applications, the performance is still confined by the resolution gap crossing two modalities in both spatial and temporal domains. This paper is devoted to bridging the gap by increasing the temporal resolution for images, i.e., motion deblurring, and the spatial resolution for events, i.e., event super-resolving, respectively. To this end, we introduce CrossZoom, a novel unified neural Network (CZ-Net) to jointly recover sharp latent sequences within the exposure period of a blurry input and the corresponding High-Resolution (HR) events. Specifically, we present a multi-scale blur-event fusion architecture that leverages the scale-variant properties and effectively fuses cross-modality information to achieve cross-enhancement. Attention-based adaptive enhancement and cross-interaction prediction modules are devised to alleviate the distortions inherent in Low-Resolution (LR) events and enhance the final results through the prior blur-event complementary information. Furthermore, we propose a new dataset containing HR sharp-blurry images and the corresponding HR-LR event streams to facilitate future research. Extensive qualitative and quantitative experiments on synthetic and real-world datasets demonstrate the effectiveness and robustness of the proposed method. Codes and datasets are released at https://bestrivenzc.github.io/CZ-Net/.

摘要
即使 tradicional 和 neuromorphic event camera 的合作带来了 frame-event 基于视觉应用程序的繁荣，性能仍然受到两种模式之间的分辨率差距的限制。这篇论文旨在跨过这一差距，通过提高图像的时间分辨率（即运动抖抖）和事件的空间分辨率（即事件超分辨）来bridging this gap。为此，我们介绍了 CrossZoom，一种新的协调 neural Network（CZ-Net），可以同时恢复锐化时间序列中的输入和高分辨度（HR）事件。具体来说，我们提出了一种多尺度抖事件融合架构，利用模式的尺度variant property，以实现跨模态信息的有效融合。我们还设计了注意力机制和交互预测模块，以解决LR事件中的自然偏差，并使得最终结果得到了预先抖事件补偿的信息的帮助。此外，我们提出了一个新的数据集，包括高分辨度锐化图像和对应的高分辨度-低分辨度事件流，以便未来的研究。我们的实验结果表明，我们的方法效果和稳定。codes和数据集可以在https://bestrivenzc.github.io/CZ-Net/ 获取。

Incremental Rotation Averaging Revisited and More: A New Rotation Averaging Benchmark

paper_url: http://arxiv.org/abs/2309.16924
repo_url: None
paper_authors: Xiang Gao, Hainan Cui, Shuhan Shen
for: 提高渐进参数估计基于旋转平均方法的精度和稳定性
methods: 引入IRAv4方法，使用任务特定相关联顺序设置为更可靠和准确的参照系
results: 比较IRAv4方法与其他主流旋转平均方法的性能表现， demonstate IRAv4方法的效果

Abstract
In order to further advance the accuracy and robustness of the incremental parameter estimation-based rotation averaging methods, in this paper, a new member of the Incremental Rotation Averaging (IRA) family is introduced, which is termed as IRAv4. As the most significant feature of the IRAv4, a task-specific connected dominating set is extracted to serve as a more reliable and accurate reference for rotation global alignment. In addition, to further address the limitations of the existing rotation averaging benchmark of relying on the slightly outdated Bundler camera calibration results as ground truths and focusing solely on rotation estimation accuracy, this paper presents a new COLMAP-based rotation averaging benchmark that incorporates a cross check between COLMAP and Bundler, and employ the accuracy of both rotation and downstream location estimation as evaluation metrics, which is desired to provide a more reliable and comprehensive evaluation tool for the rotation averaging research. Comprehensive comparisons between the proposed IRAv4 and other mainstream rotation averaging methods on this new benchmark demonstrate the effectiveness of our proposed approach.

摘要
为了进一步提高幂加计数器估计基于旋转均值方法的准确性和可靠性，本文提出了一新的幂加计数器均值方法（IRAv4）。这个方法的最重要特点是从任务特定的连接dominating set中提取一个更可靠和准确的旋转全局参照。此外，为了进一步解决现有的旋转均值标准 benchmark 的局限性，这篇文章提出了一个基于 COLMAP 的新的旋转均值标准 benchmark，该标准具有跨Check между COLMAP 和 Bundler，并使用两者的准确性作为评价指标。这种新的评价工具被期望能够提供更可靠和全面的评价工具 для旋转均值研究。对于提出的 IRAv4 和其他主流的旋转均值方法进行了广泛的比较，实验结果表明了我们提出的方法的效果。

YOLOR-Based Multi-Task Learning

paper_url: http://arxiv.org/abs/2309.16921
repo_url: https://github.com/WongKinYiu/yolor
paper_authors: Hung-Shuo Chang, Chien-Yao Wang, Richard Robert Wang, Gene Chou, Hong-Yuan Mark Liao
for: 这篇论文目的是学习多个任务，使用单个模型，并且共同提高所有任务的泛化和共同 semantics。
methods: 这篇论文提出了基于 You Only Learn One Representation (YOLOR) 网络架构的方法，该架构特点是结合显式和隐式知识，从数据观察和学习的 latents 来提高共享表示。
results: 该方法可以同时进行对象检测、实例 segmentation、semantic segmentation和图像描述，并且可以保持低参数计数和不需要预训练。

Abstract
Multi-task learning (MTL) aims to learn multiple tasks using a single model and jointly improve all of them assuming generalization and shared semantics. Reducing conflicts between tasks during joint learning is difficult and generally requires careful network design and extremely large models. We propose building on You Only Learn One Representation (YOLOR), a network architecture specifically designed for multitasking. YOLOR leverages both explicit and implicit knowledge, from data observations and learned latents, respectively, to improve a shared representation while minimizing the number of training parameters. However, YOLOR and its follow-up, YOLOv7, only trained two tasks at once. In this paper, we jointly train object detection, instance segmentation, semantic segmentation, and image captioning. We analyze tradeoffs and attempt to maximize sharing of semantic information. Through our architecture and training strategies, we find that our method achieves competitive performance on all tasks while maintaining a low parameter count and without any pre-training. We will release code soon.

摘要
多任务学习（MTL）目标是使用单个模型学习多个任务，并共同提高所有任务的泛化和共享含义。在共同学习过程中避免任务之间冲突是困难的，通常需要非常精心设计网络和庞大的模型。我们提出基于你仅学习一个表示（YOLOR）网络架构，该架构特点是通过数据观察和学习含义来提高共享表示，同时尽量降低训练参数数量。然而，YOLOR和其后续的YOLOv7只是同时训练了两个任务。在这篇论文中，我们同时训练对象检测、实例分割、semantic segmentation和图文描述。我们分析了贸易和共享含义的权衡，并尝试 Maximize shared semantic information。通过我们的架构和训练策略，我们发现我们的方法在所有任务上具有竞争性的性能，同时保持低参数数量，无需预训练。我们即将发布代码。

Investigating Shift Equivalence of Convolutional Neural Networks in Industrial Defect Segmentation

paper_url: http://arxiv.org/abs/2309.16902
repo_url: https://github.com/xiaozhen228/caps
paper_authors: Zhen Qu, Xian Tao, Fei Shen, Zhengtao Zhang, Tao Li
for: 这个研究是为了解决产业瑕疵分类任务中的一个问题，即模型的出力一致性（也就是等效性）不足。
methods: 这篇研究提出了一个新的分割层组合，即分配注意力多项样本抽取（CAPS），并将其替换到传统的抽取层组合中。此外，还提出了一个适应窗口模组来适应图像边界变化，以及一个 ком成果模组来融合所有下推的特征以改善分类性能。
results: 实验结果显示，提案的方法在微表面瑕疵（MSD）数据集和四个实际的产业瑕疵数据集上具有更高的出力一致性和分类性能，较前state-of-the-art方法更好。

Abstract
In industrial defect segmentation tasks, while pixel accuracy and Intersection over Union (IoU) are commonly employed metrics to assess segmentation performance, the output consistency (also referred to equivalence) of the model is often overlooked. Even a small shift in the input image can yield significant fluctuations in the segmentation results. Existing methodologies primarily focus on data augmentation or anti-aliasing to enhance the network's robustness against translational transformations, but their shift equivalence performs poorly on the test set or is susceptible to nonlinear activation functions. Additionally, the variations in boundaries resulting from the translation of input images are consistently disregarded, thus imposing further limitations on the shift equivalence. In response to this particular challenge, a novel pair of down/upsampling layers called component attention polyphase sampling (CAPS) is proposed as a replacement for the conventional sampling layers in CNNs. To mitigate the effect of image boundary variations on the equivalence, an adaptive windowing module is designed in CAPS to adaptively filter out the border pixels of the image. Furthermore, a component attention module is proposed to fuse all downsampled features to improve the segmentation performance. The experimental results on the micro surface defect (MSD) dataset and four real-world industrial defect datasets demonstrate that the proposed method exhibits higher equivalence and segmentation performance compared to other state-of-the-art methods.Our code will be available at https://github.com/xiaozhen228/CAPS.

摘要
在工业缺陷分割任务中，像素准确率和交集覆盖率（IoU）是常用的评估分割性能的指标，但模型的输出一致性（也称Equivalence）通常被忽略。即使只有小的输入图像shift，分割结果也可能出现显著的变化。现有方法主要通过数据扩展或反馈抑制来提高网络对平移变换的Robustness，但它们的平移Equivalence在测试集上表现不佳或受非线性活化函数的影响。此外，输入图像的边界变化会导致分割结果的变化，从而增加了平移Equivalence的限制。为解决这个问题，一种新的下/上采样层组合called component attention polyphase sampling（CAPS）被提出，以取代传统的采样层在CNN中。为了减少图像边界变化对Equivalence的影响，CAPS中的适应窗口模块可以逐个窗口过滤出图像边界的像素。此外，一种组件注意模块被提出，以将所有下采样的特征进行混合，以提高分割性能。实验结果表明，提出的方法在微表面缺陷（MSD） dataset和四个实际工业缺陷dataset上 exhibits higher Equivalence and segmentation performance compared to other state-of-the-art methods。我们的代码将在https://github.com/xiaozhen228/CAPS上提供。

2023-09-29

MARL: Multi-scale Archetype Representation Learning for Urban Building Energy Modeling

SCoRe: Submodular Combinatorial Representation Learning for Real-World Class-Imbalanced Settings

PRIME: Prioritizing Interpretability in Failure Mode Extraction

Prior Mismatch and Adaptation in PnP-ADMM with a Nonconvex Convergence Analysis

Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition

Fewshot learning on global multimodal embeddings for earth observation tasks

Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study

Denoising and Selecting Pseudo-Heatmaps for Semi-Supervised Human Pose Estimation

Towards Few-Call Model Stealing via Active Self-Paced Knowledge Distillation and Diffusion-Based Image Generation

DataDAM: Efficient Dataset Distillation with Attention Matching

Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

Multi-task View Synthesis with Neural Radiance Fields

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

FACTS: First Amplify Correlations and Then Slice to Discover Bias

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images

Forward Flow for Novel View Synthesis of Dynamic Scenes

Prompt-based test-time real image dehazing: a novel pipeline

Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings

Towards Free Data Selection with General-Purpose Models

See Beyond Seeing: Robust 3D Object Detection from Point Clouds via Cross-Modal Hallucination

Multi-Depth Branches Network for Efficient Image Super-Resolution

Telling Stories for Common Sense Zero-Shot Action Recognition

Development of a Deep Learning Method to Identify Acute Ischemic Stroke Lesions on Brain CT

Efficient Large Scale Medical Image Dataset Preparation for Machine Learning Applications

Information Flow in Self-Supervised Learning

Unpaired Optical Coherence Tomography Angiography Image Super-Resolution via Frequency-Aware Inverse-Consistency GAN

Effect of structure-based training on 3D localization precision and quality

Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors

A Survey on Deep Learning Techniques for Action Anticipation

EGVD: Event-Guided Video Deraining

Glioma subtype classification from histopathological images using in-domain and out-of-domain transfer learning: An experimental study

When Epipolar Constraint Meets Non-local Operators in Multi-View Stereo

Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing

Robots That Can See: Leveraging Human Pose for Trajectory Prediction

Towards Complex-query Referring Image Segmentation: A Novel Benchmark

A Survey of Incremental Transfer Learning: Combining Peer-to-Peer Federated Learning and Domain Incremental Learning for Multicenter Collaboration

RTFS-Net: Recurrent time-frequency modelling for efficient audio-visual speech separation

TBD Pedestrian Data Collection: Towards Rich, Portable, and Large-Scale Natural Pedestrian Data

TextField3D: Towards Enhancing Open-Vocabulary 3D Generation with Noisy Text Fields

Domain-Adaptive Learning: Unsupervised Adaptation for Histology Images with Improved Loss Function Combination

Retail-786k: a Large-Scale Dataset for Visual Entity Matching

APNet: Urban-level Scene Segmentation of Aerial Images and Point Clouds

Redistributing the Precision and Content in 3D-LUT-based Inverse Tone-mapping for HDR/WCG Display

HAvatar: High-fidelity Head Avatar via Facial Model Conditioned Neural Radiance Field

Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative Pretraining

Continual Action Assessment via Task-Consistent Score-Discriminative Feature Distribution Modeling

Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval

SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning

Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications

DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation

GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation

Imagery Dataset for Condition Monitoring of Synthetic Fibre Ropes

A 5-Point Minimal Solver for Event Camera Relative Motion Estimation

On Uniform Scalar Quantization for Learned Image Compression

UniQuadric: A SLAM Backend for Unknown Rigid Object 3D Tracking and Light-Weight Modeling

Unveiling Document Structures with YOLOv5 Layout Detection

HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World

Segment Anything Model is a Good Teacher for Local Feature Learning

Text-image Alignment for Diffusion-based Perception

SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features

Perceptual Tone Mapping Model for High Dynamic Range Imaging

Synthetic Data Generation and Deep Learning for the Topological Analysis of 3D Data

nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance

AdaPose: Towards Cross-Site Device-Free Human Pose Estimation with Commodity WiFi

COMNet: Co-Occurrent Matching for Weakly Supervised Semantic Segmentation

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

CrossZoom: Simultaneously Motion Deblurring and Event Super-Resolving

Incremental Rotation Averaging Revisited and More: A New Rotation Averaging Benchmark

YOLOR-Based Multi-Task Learning

Investigating Shift Equivalence of Convolutional Neural Networks in Industrial Defect Segmentation