2023-07-11

cs.CV

cs.CV - 2023-07-11

On the Vulnerability of DeepFake Detectors to Attacks Generated by Denoising Diffusion Models

paper_url: http://arxiv.org/abs/2307.05397
repo_url: None
paper_authors: Marija Ivanovska, Vitomir Štruc
for: 本研究检测恶意深伪图像的漏洞，以确保检测器能够检测新型生成模型生成的图像修改。
methods: 本研究使用单个图像深伪检测器，并对 FaceForensics++ 数据集进行实验，包括不同的面换和面reenactment 技术生成的 Deepfakes。
results: 结果表明，只需要使用一步恒温扩散过程，可以显著降低所有测试检测器的准确率，而无需 introduce 可见的图像变化。

Abstract
The detection of malicious Deepfakes is a constantly evolving problem, that requires continuous monitoring of detectors, to ensure they are able to detect image manipulations generated by the latest emerging models. In this paper, we present a preliminary study that investigates the vulnerability of single-image Deepfake detectors to attacks created by a representative of the newest generation of generative methods, i.e. Denoising Diffusion Models (DDMs). Our experiments are run on FaceForensics++, a commonly used benchmark dataset, consisting of Deepfakes generated with various techniques for face swapping and face reenactment. The analysis shows, that reconstructing existing Deepfakes with only one denoising diffusion step significantly decreases the accuracy of all tested detectors, without introducing visually perceptible image changes.

摘要
<> translate "The detection of malicious Deepfakes is a constantly evolving problem, that requires continuous monitoring of detectors, to ensure they are able to detect image manipulations generated by the latest emerging models. In this paper, we present a preliminary study that investigates the vulnerability of single-image Deepfake detectors to attacks created by a representative of the newest generation of generative methods, i.e. Denoising Diffusion Models (DDMs). Our experiments are run on FaceForensics++, a commonly used benchmark dataset, consisting of Deepfakes generated with various techniques for face swapping and face reenactment. The analysis shows, that reconstructing existing Deepfakes with only one denoising diffusion step significantly decreases the accuracy of all tested detectors, without introducing visually perceptible image changes." into 简化字Here's the translation:<>恶意深刻检测是一个不断发展的问题，需要持续监测检测器，以确保它们可以检测新兴模型生成的图像 manipulate。在这篇论文中，我们提出了一项初步研究，检查单个图像深刻检测器对使用 Denoising Diffusion Models (DDMs) 生成的攻击是否有攻击。我们在 FaceForensics++ 常用的测试集上进行了实验，该测试集包含了不同的面部交换和面部重现技术生成的 Deepfakes。分析结果表明，只有一步 Denoising Diffusion Models 重建可以在所有测试检测器中降低准确性，而无需添加可见的图像变化。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Self-supervised adversarial masking for 3D point cloud representation learning

paper_url: http://arxiv.org/abs/2307.05325
repo_url: https://github.com/szacho/pointcam
paper_authors: Michał Szachniewicz, Wojciech Kozłowski, Michał Stypułkowski, Maciej Zięba
for: 学习深度表示的3D点云数据自助方法
methods: 提出了一种新的对抗方法，通过学习掩码函数来提高自助方法的性能
results: 对多个下游任务进行评估，实现了状态略于竞争或者优于其他方法的表现

Abstract
Self-supervised methods have been proven effective for learning deep representations of 3D point cloud data. Although recent methods in this domain often rely on random masking of inputs, the results of this approach can be improved. We introduce PointCAM, a novel adversarial method for learning a masking function for point clouds. Our model utilizes a self-distillation framework with an online tokenizer for 3D point clouds. Compared to previous techniques that optimize patch-level and object-level objectives, we postulate applying an auxiliary network that learns how to select masks instead of choosing them randomly. Our results show that the learned masking function achieves state-of-the-art or competitive performance on various downstream tasks. The source code is available at https://github.com/szacho/pointcam.

摘要
自我监督方法在三维点云数据中学习深度表示方面已经得到证明。虽然最近的方法中常采用随机掩码输入，但我们认为这些结果可以进一步改进。我们引入PointCAM，一种新的对抗方法，用于学习掩码函数。我们的模型采用了自我蒸馈框架，并使用在线分词器来处理三维点云数据。与之前的技术相比，我们不再选择Random掩码，而是通过一个助动器网络来学习选择掩码。我们的实验结果表明，学习的掩码函数可以达到当前顶峰性或竞争性的性能在多种下游任务中。源代码可以在https://github.com/szacho/pointcam中下载。

Class Instance Balanced Learning for Long-Tailed Classification

paper_url: http://arxiv.org/abs/2307.05322
repo_url: None
paper_authors: Marc-Antoine Lavoie, Steven Waslander
for: 本研究旨在提高深度神经网络在长尾图像分类任务中的性能，即在训练数据中类别频率差异较大的情况下。
methods: 本研究提出了一种新的类实例平衡损失函数（CIBL），该函数根据训练批处的类实例频率进行权重调整，以优化长尾图像分类任务中的表现。
results: 研究结果表明，采用CIBL损失函数可以提高长尾图像分类任务中的表现，并且可以根据需要调整表现的类别分布。此外，将线性分类器头换为高斯分类器可以在更少的训练轮数下达到类似性能。

Abstract
The long-tailed image classification task remains important in the development of deep neural networks as it explicitly deals with large imbalances in the class frequencies of the training data. While uncommon in engineered datasets, this imbalance is almost always present in real-world data. Previous approaches have shown that combining cross-entropy and contrastive learning can improve performance on the long-tailed task, but they do not explore the tradeoff between head and tail classes. We propose a novel class instance balanced loss (CIBL), which reweights the relative contributions of a cross-entropy and a contrastive loss as a function of the frequency of class instances in the training batch. This balancing favours the contrastive loss for more common classes, leading to a learned classifier with a more balanced performance across all class frequencies. Furthermore, increasing the relative weight on the contrastive head shifts performance from common (head) to rare (tail) classes, allowing the user to skew the performance towards these classes if desired. We also show that changing the linear classifier head with a cosine classifier yields a network that can be trained to similar performance in substantially fewer epochs. We obtain competitive results on both CIFAR-100-LT and ImageNet-LT.

摘要
长尾图像分类任务仍然是深度神经网络的发展中非常重要的问题，因为它直接面临巨大的类频率偏置在训练数据中。虽然在工程化数据集中不常见，但在实际世界数据中它总是存在。先前的方法已经显示过将cross-entropy和对比学习结合可以提高长尾任务的性能，但它们不探讨类别头和尾类之间的负担平衡。我们提出了一种新的类实例平衡损失函数（CIBL），它在训练批处理中类别实例的频率上重新调整cross-entropy和对比损失的相对贡献。这种平衡使得对于更常见的类别，增加对于对比损失的权重，从而使得学习出来的分类器在所有类别频率上具有更好的平衡性。此外，通过增加对于对比头的权重，可以使得性能倾斜向常见类别（头）和罕见类别（尾）之间，这样用户可以根据需要调整性能的方向。我们还证明了在cosine类ifier头上改变线性类ifier可以在相当多的epoch内达到相同的性能。我们在CIFAR-100-LT和ImageNet-LT上获得了竞争性的结果。

Navigating Uncertainty: The Role of Short-Term Trajectory Prediction in Autonomous Vehicle Safety

paper_url: http://arxiv.org/abs/2307.05288
repo_url: https://github.com/sharmasushil/navigating-uncertainty-trajectory-prediction
paper_authors: Sushil Sharma, Ganesh Sistu, Lucie Yahiaoui, Arindam Das, Mark Halton, Ciarán Eising
for: The paper is written for the task of short-term trajectory prediction for autonomous vehicles, with a focus on safe and efficient driving.
methods: The paper uses a synthetic dataset created using the CARLA simulator, which includes a variety of complex scenarios such as pedestrians crossing the road and vehicles overtaking. The authors also develop an end-to-end model using convolutional neural networks (CNN) and long short-term memory (LSTM) networks to predict short-term trajectories.
results: The paper reports that the proposed model can handle corner cases such as slowing down near zebra crossings and stopping when pedestrians cross the road without the need for explicit encoding of the surrounding environment. The authors also release their dataset and model to the research community for further research and development.

Abstract
Autonomous vehicles require accurate and reliable short-term trajectory predictions for safe and efficient driving. While most commercial automated vehicles currently use state machine-based algorithms for trajectory forecasting, recent efforts have focused on end-to-end data-driven systems. Often, the design of these models is limited by the availability of datasets, which are typically restricted to generic scenarios. To address this limitation, we have developed a synthetic dataset for short-term trajectory prediction tasks using the CARLA simulator. This dataset is extensive and incorporates what is considered complex scenarios - pedestrians crossing the road, vehicles overtaking - and comprises 6000 perspective view images with corresponding IMU and odometry information for each frame. Furthermore, an end-to-end short-term trajectory prediction model using convolutional neural networks (CNN) and long short-term memory (LSTM) networks has also been developed. This model can handle corner cases, such as slowing down near zebra crossings and stopping when pedestrians cross the road, without the need for explicit encoding of the surrounding environment. In an effort to accelerate this research and assist others, we are releasing our dataset and model to the research community. Our datasets are publicly available on https://github.com/sharmasushil/Navigating-Uncertainty-Trajectory-Prediction .

摘要
自动驾驶车辆需要准确和可靠的短期轨迹预测，以便安全有效地驾驶。现在大多数商业自动驾驶车辆使用状态机器based算法进行轨迹预测，然而，最近努力主要集中在数据驱动系统上。常见的设计方法受到数据集的限制，这些数据集通常只包括一般情况。为解决这一问题，我们已经开发了使用 CARLA simulate器生成的 sintetic 轨迹预测数据集。这个数据集非常广泛，包括了考虑到复杂情况 - 人行道上的人跨道、车辆超越 -，并包括 6000 个视角图像和对应的 IMU 和 odometry 信息。此外，我们还开发了一个使用 convolutional neural networks (CNN) 和 long short-term memory (LSTM) 网络的终端到终端短期轨迹预测模型。这个模型可以处理弯曲情况，如在人行道上减速和当人跨道时停车，无需显式编码周围环境。为了加速这些研究并帮助其他人，我们将数据集和模型发布到研究社区。我们的数据集公开在 GitHub 上，详情请参考。

Unbiased Scene Graph Generation via Two-stage Causal Modeling

paper_url: http://arxiv.org/abs/2307.05276
repo_url: None
paper_authors: Shuzhou Sun, Shuaifeng Zhi, Qing Liao, Janne Heikkilä, Li Liu
for: 本研究旨在提出一种基于 causal inference 的 scene graph生成任务减偏方法，以提高 scene graph 生成模型的不偏性。
methods: 本研究使用了 causal modeling 技术，包括 structural causal model (SCM) 和 population loss (P-Loss)，以及 adaptive logit adjustment (AL-Adjustment) 等方法来解决 scene graph 生成任务中的减偏问题。
results: 实验结果表明，使用本研究提出的 two-stage causal modeling (TsCM) 方法可以在 popular scene graph backbones 和 benchmarks 上达到 state-of-the-art 的 mean recall rate，并且 TsCM 可以更好地平衡 head 和 tail 关系的准确率。

Abstract
Despite the impressive performance of recent unbiased Scene Graph Generation (SGG) methods, the current debiasing literature mainly focuses on the long-tailed distribution problem, whereas it overlooks another source of bias, i.e., semantic confusion, which makes the SGG model prone to yield false predictions for similar relationships. In this paper, we explore a debiasing procedure for the SGG task leveraging causal inference. Our central insight is that the Sparse Mechanism Shift (SMS) in causality allows independent intervention on multiple biases, thereby potentially preserving head category performance while pursuing the prediction of high-informative tail relationships. However, the noisy datasets lead to unobserved confounders for the SGG task, and thus the constructed causal models are always causal-insufficient to benefit from SMS. To remedy this, we propose Two-stage Causal Modeling (TsCM) for the SGG task, which takes the long-tailed distribution and semantic confusion as confounders to the Structural Causal Model (SCM) and then decouples the causal intervention into two stages. The first stage is causal representation learning, where we use a novel Population Loss (P-Loss) to intervene in the semantic confusion confounder. The second stage introduces the Adaptive Logit Adjustment (AL-Adjustment) to eliminate the long-tailed distribution confounder to complete causal calibration learning. These two stages are model agnostic and thus can be used in any SGG model that seeks unbiased predictions. Comprehensive experiments conducted on the popular SGG backbones and benchmarks show that our TsCM can achieve state-of-the-art performance in terms of mean recall rate. Furthermore, TsCM can maintain a higher recall rate than other debiasing methods, which indicates that our method can achieve a better tradeoff between head and tail relationships.

摘要
尽管现代无偏Scene Graph Generation（SGG）方法已经表现出色，当前的偏见文献主要关注长条分布问题，而忽略了另一种偏见源，即semantic confusion，这使SGG模型容易生成错误的关系预测。在这篇论文中，我们探索了基于 causal inference 的 SGG 任务中的偏见处理方法。我们的中心思想是，在 causality 中的罕见机制shift (SMS) 可以独立地 intervene 多种偏见，从而可能保持 head category 性能 while pursuing the prediction of high-informative tail relationships。然而，噪声数据导致 SGG 任务中的隐藏偏见，因此构建的 causal 模型总是 causal-insufficient，不能得到 SMS 的好处。为此，我们提议 Two-stage Causal Modeling (TsCM) 方法，该方法将 long-tailed distribution 和 semantic confusion 作为 SGG 任务中的隐藏偏见，并将 causal intervention 分成两个阶段。第一阶段是 causal representation learning，我们使用一种新的 Population Loss (P-Loss) 来 intervene in semantic confusion confounder。第二阶段引入 Adaptive Logit Adjustment (AL-Adjustment)，以消除 long-tailed distribution confounder，以完成 causal calibration learning。这两个阶段是模型无关的，可以在任何 seeking unbiased predictions 的 SGG 模型中使用。我们在流行的 SGG 背景和标准 benchmark 上进行了广泛的实验，结果表明，我们的 TsCM 可以在 terms of mean recall rate achieve state-of-the-art performance。此外，TsCM 可以保持 higher recall rate than other debiasing methods，这表明我们的方法可以更好地平衡 head 和 tail 关系。

APRF: Anti-Aliasing Projection Representation Field for Inverse Problem in Imaging

paper_url: http://arxiv.org/abs/2307.05270
repo_url: None
paper_authors: Zixuan Chen, Lingxiao Yang, Jianhuang Lai, Xiaohua Xie
for: 提高SVCT重建的精度和准确性，降低噪声和扭曲 artifacts。
methods: 使用自我监督学习的Anti-Aliasing Projection Representation Field（APRF）方法，通过空间约束来建立连续的投影视图对应关系，从而提高重建的精度和准确性。
results: 对CT图像进行SVCT重建，比州当前方法更加准确和精细，减少了噪声和扭曲 artifacts。

Abstract
Sparse-view Computed Tomography (SVCT) reconstruction is an ill-posed inverse problem in imaging that aims to acquire high-quality CT images based on sparsely-sampled measurements. Recent works use Implicit Neural Representations (INRs) to build the coordinate-based mapping between sinograms and CT images. However, these methods have not considered the correlation between adjacent projection views, resulting in aliasing artifacts on SV sinograms. To address this issue, we propose a self-supervised SVCT reconstruction method -- Anti-Aliasing Projection Representation Field (APRF), which can build the continuous representation between adjacent projection views via the spatial constraints. Specifically, APRF only needs SV sinograms for training, which first employs a line-segment sampling module to estimate the distribution of projection views in a local region, and then synthesizes the corresponding sinogram values using center-based line integral module. After training APRF on a single SV sinogram itself, it can synthesize the corresponding dense-view (DV) sinogram with consistent continuity. High-quality CT images can be obtained by applying re-projection techniques on the predicted DV sinograms. Extensive experiments on CT images demonstrate that APRF outperforms state-of-the-art methods, yielding more accurate details and fewer artifacts. Our code will be publicly available soon.

摘要
《简洁视图计算tomography（SVCT）重建是一个不定性 inverse problem 在成像中，旨在基于稀疏样本来获得高质量 CT 图像。现有研究使用 Implicit Neural Representations（INRs）构建坐标基于的映射 между sinogram 和 CT 图像。然而，这些方法没有考虑邻近投影视图之间的相关性，导致 SV sinogram 中的抽象artefacts。为解决这问题，我们提出了一种自我超级vised SVCT重建方法——Anti-Aliasing Projection Representation Field（APRF）。APRF 可以通过空间约束建立连续的 Representation между邻近投影视图。具体来说，APRF 需要 SV sinogram 进行训练，首先使用 line-segment 抽象模块来估计当地区域的投影视图分布，然后使用 center-based line integral module Synthesize 相应的 sinogram 值。经过训练 APRF 可以在单个 SV sinogram 上预测 dense-view（DV） sinogram，并保持一致性。通过应用重oprojection 技术，可以从预测的 DV sinogram 中获得高质量 CT 图像。实验表明，APRF 在 CT 图像中比 state-of-the-art 方法更高精度、 fewer artifacts。我们很快将代码公开。

OpenAL: An Efficient Deep Active Learning Framework for Open-Set Pathology Image Classification

paper_url: http://arxiv.org/abs/2307.05254
repo_url: None
paper_authors: Linhao Qu, Yingfan Ma, Zhiwei Yang, Manning Wang, Zhijian Song
for: 本研究旨在解决实际临床任务中批处理数据集中存在非目标类样本的情况下，现有的活动学习方法不能有效工作。
methods: 本文提出了一种开集活动学习（OpenAL）框架，可以有效地从未标记样本池中选择目标类和非目标类样本进行问题。
results: 对于细致的生物学像分类任务，OpenAL可以显著提高目标类样本的问题质量，并与当前状态艺技活动学习方法相比，达到更高的性能。

Abstract
Active learning (AL) is an effective approach to select the most informative samples to label so as to reduce the annotation cost. Existing AL methods typically work under the closed-set assumption, i.e., all classes existing in the unlabeled sample pool need to be classified by the target model. However, in some practical clinical tasks, the unlabeled pool may contain not only the target classes that need to be fine-grainedly classified, but also non-target classes that are irrelevant to the clinical tasks. Existing AL methods cannot work well in this scenario because they tend to select a large number of non-target samples. In this paper, we formulate this scenario as an open-set AL problem and propose an efficient framework, OpenAL, to address the challenge of querying samples from an unlabeled pool with both target class and non-target class samples. Experiments on fine-grained classification of pathology images show that OpenAL can significantly improve the query quality of target class samples and achieve higher performance than current state-of-the-art AL methods. Code is available at https://github.com/miccaiif/OpenAL.

摘要
活动学习（AL）是一种有效的方法，选择最有用的样本用于标注，以降低标注成本。现有的AL方法通常在关闭集合假设下工作，即所有在未标注样本池中存在的类都需要被目标模型分类。然而，在一些实际的医疗任务中，未标注池可能包含不仅目标类，还有无关的医疗任务类。现有的AL方法无法在这种场景下工作，因为它们往往选择大量的非目标样本。在这篇论文中，我们将这种场景描述为开集AL问题，并提出一种高效的框架，OpenAL，用于从未标注池中查询目标类和非目标类样本。实验表明，OpenAL可以明显提高目标类样本的查询质量，并在当前状态艺术AL方法的基础上 достичь更高的性能。代码可以在https://github.com/miccaiif/OpenAL上获取。

Evidence-based Hand Hygiene. Can You Trust the Fluorescent-based Assessment Methods?

paper_url: http://arxiv.org/abs/2307.05650
repo_url: None
paper_authors: Száva Bánsághi, Viola Sári, Péter Szerémy, Ákos Lehotsky, Bence Takács, Brigitta K. Tóth, Tamás Haidegger
for:这个研究的目的是调查不同专家是否对同一个UV图像表示不同的评估结果，并与微生物鉴定结果进行对比。methods:这个研究使用了4种不同的UV盒子设备，通过CCD摄像头拍摄受试者手部在UV光下的图像，并由4名独立的感染控制专家 manually标注图像中未经杀灭的区域。results:研究发现，专家之间对同一个UV图像的评估结果存在高度不一致性（即interrater reliability），并且与微生物鉴定结果有较弱的相关性。在8个受试者手部中，有50%的情况下，人工评估和微生物鉴定结果之间存在了10%以上的差异。这表明，使用 fluorescent 方法评估手洁度的数据质量不具备建立 patient safety 质量监控系统的基础。

Abstract
Healthcare-Associated Infections present a major threat to patient safety globally. According to studies, more than 50% of HAI could be prevented by proper hand hygiene. Effectiveness of hand hygiene is regularly evaluated with the fluorescent method: performing hand hygiene with a handrub containing an ultra violet (UV) fluorescent marker. Typically, human experts evaluate the hands under UV-A light, and decide whether the applied handrub covered the whole hand surface. The aim of this study was to investigate how different experts judge the same UV-pattern, and compare that to microbiology for objective validation. Hands of volunteer participants were contaminated with high concentration of a Staphylococcus epidermidis suspension. Hands were incompletely disinfected with UV-labeled handrub. Four different UV-box type devices were used to take CCD pictures of the hands under UV light. Size of inadequately disinfected areas on the hands were determined in two different ways. First, based on microbiology; the areas where colonies were grown were measured. Second, four independent senior infection control specialists were asked to mark the missed areas on printed image, captured under UV light. 8 hands of healthy volunteers were examined. Expert evaluations were highly uncorrelated (regarding interrater reliability) and inconsistent. Microbiology results weakly correlated with the expert evaluations. In half of the cases, there were more than 10% difference in the size of properly disinfected area, as measured by microbiology versus human experts. Considering the result of the expert evaluations, variability was disconcertingly high. Evaluating the fluorescent method is challenging, even for highly experienced professionals. A patient safety quality assurance system cannot be built on these data quality.

摘要
全球医疗相关感染（Healthcare-Associated Infections，HAI）对患者安全构成了重大威胁。studies表明，more than 50% of HAI可以通过正确的手洁涤透杂方法预防。手洁涤透效果 Regularly evaluated with fluorescent method: perform hand hygiene with a handrub containing an ultraviolet (UV) fluorescent marker。通常，人类专家通过UV-A光评估手部，并决定是否涂抹全面覆盖手部。本研究的目的是investigate how different experts judge the same UV pattern，and compare that to microbiology for objective validation。参与者的手部被高浓度的Staphylococcus epidermidis悬浮涂抹后，手部不完全消毒。使用UV标记的手洁涤透器拍摄手部Under UV光的CCD图像。手部不完全消毒的区域的大小Determined in two different ways。First, based on microbiology; the areas where colonies were grown were measured。Second, four independent senior infection control specialists were asked to mark the missed areas on printed image, captured under UV light。8名健康志愿者的手部被 исследова。专家评估结果高度不相关（interrater reliability）和不一致。微生物结果与专家评估结果只有微弱相关性。在80%的情况下，通过微生物测量的完全消毒区域与专家评估结果存在10%以上的差异。这种Result of expert evaluations is disconcertingly high。评估fluorescent方法是复杂的，甚至 для高级专业人员。一个patient safety quality assurance system cannot be built on these data quality。

DRMC: A Generalist Model with Dynamic Routing for Multi-Center PET Image Synthesis

paper_url: http://arxiv.org/abs/2307.05249
repo_url: None
paper_authors: Zhiwen Yang, Yang Zhou, Hui Zhang, Bingzheng Wei, Yubo Fan, Yan Xu
for: 多中心 positron发射Tomography（PET）图像合成，恢复低剂量PET图像。
methods: 我们开发了一种通用模型，该模型在多中心 studiossa共享架构和参数，以利用多中心之间的共同知识。但是，这种通用模型可能会受到中心干扰问题的影响，即不同中心的数据分布不同，导致不同中心的梯度方向不一致或甚至相反。为解决这问题，我们引入了一种新的动态路由策略，该策略在不同中心的数据之间建立了跨层连接，将数据分配给不同的专家。
results: 我们的通用模型并用动态路由策略（DRMC）在多中心 studiossa表现出色，可以快速恢复低剂量PET图像。

Abstract
Multi-center positron emission tomography (PET) image synthesis aims at recovering low-dose PET images from multiple different centers. The generalizability of existing methods can still be suboptimal for a multi-center study due to domain shifts, which result from non-identical data distribution among centers with different imaging systems/protocols. While some approaches address domain shifts by training specialized models for each center, they are parameter inefficient and do not well exploit the shared knowledge across centers. To address this, we develop a generalist model that shares architecture and parameters across centers to utilize the shared knowledge. However, the generalist model can suffer from the center interference issue, \textit{i.e.} the gradient directions of different centers can be inconsistent or even opposite owing to the non-identical data distribution. To mitigate such interference, we introduce a novel dynamic routing strategy with cross-layer connections that routes data from different centers to different experts. Experiments show that our generalist model with dynamic routing (DRMC) exhibits excellent generalizability across centers. Code and data are available at: https://github.com/Yaziwel/Multi-Center-PET-Image-Synthesis.

摘要

paper_url: http://arxiv.org/abs/2307.05241
repo_url: https://github.com/gama-ufsc/brain-age
paper_authors: Bruno Machado Pacheco, Victor Hugo Rocha de Oliveira, Augusto Braga Fernandes Antunes, Saulo Domingos de Souza Pedro, Danilo Silva
for: 这研究旨在探讨深度学习模型在脑年龄预测方面的可靠性和效率，以及这些模型在脑健康和衰老过程中的应用前景。
methods: 我们在这研究中使用了深度学习模型，并在这些模型的预训练阶段使用了脑相关任务来提高模型的性能。
results: 我们在ADNI数据集上进行了实验，并发现了脑年龄预测模型的性能有所提高，但是更好的模型性能不一定意味着更可靠的脑年龄生物标记。

Abstract
Brain age prediction using neuroimaging data has shown great potential as an indicator of overall brain health and successful aging, as well as a disease biomarker. Deep learning models have been established as reliable and efficient brain age estimators, being trained to predict the chronological age of healthy subjects. In this paper, we investigate the impact of a pre-training step on deep learning models for brain age prediction. More precisely, instead of the common approach of pre-training on natural imaging classification, we propose pre-training the models on brain-related tasks, which led to state-of-the-art results in our experiments on ADNI data. Furthermore, we validate the resulting brain age biomarker on images of patients with mild cognitive impairment and Alzheimer's disease. Interestingly, our results indicate that better-performing deep learning models in terms of brain age prediction on healthy patients do not result in more reliable biomarkers.

摘要
��ж�edge age prediction using neuroimaging data has shown great potential as an indicator of overall brain health and successful aging, as well as a disease biomarker. Deep learning models have been established as reliable and efficient brain age estimators, being trained to predict the chronological age of healthy subjects. In this paper, we investigate the impact of a pre-training step on deep learning models for brain age prediction. More precisely, instead of the common approach of pre-training on natural imaging classification, we propose pre-training the models on brain-related tasks, which led to state-of-the-art results in our experiments on ADNI data. Furthermore, we validate the resulting brain age biomarker on images of patients with mild cognitive impairment and Alzheimer's disease. Interestingly, our results indicate that better-performing deep learning models in terms of brain age prediction on healthy patients do not result in more reliable biomarkers.Here's the word-for-word translation:��ж�edge age prediction using neuroimaging data has shown great potential as an indicator of overall brain health and successful aging, as well as a disease biomarker. Deep learning models have been established as reliable and efficient brain age estimators, being trained to predict the chronological age of healthy subjects. In this paper, we investigate the impact of a pre-training step on deep learning models for brain age prediction. More precisely, instead of the common approach of pre-training on natural imaging classification, we propose pre-training the models on brain-related tasks, which led to state-of-the-art results in our experiments on ADNI data. Furthermore, we validate the resulting brain age biomarker on images of patients with mild cognitive impairment and Alzheimer's disease. Interestingly, our results indicate that better-performing deep learning models in terms of brain age prediction on healthy patients do not result in more reliable biomarkers.

Generative Pretraining in Multimodality

paper_url: http://arxiv.org/abs/2307.05222
repo_url: https://github.com/baaivision/emu
paper_authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang
for: 本研究旨在开发一种基于Transformer的多Modal基础模型，可以无需特殊设定处理不同的多Modal数据，并在一个模型上进行权重学习。
methods: 该模型使用一种混合的输入序列，将视觉信号编码成嵌入，并与文本token组成一个混合输入序列。然后通过一个简单的损失函数进行一对多的权重学习，以实现多Modal任务的同时进行。
results: 对多种零shot/几shot任务进行评估，包括图像描述、视觉问答、视频问答和文本到图像生成等，模型表现出色，与当前最大多Modal模型相比，具有更高的性能。此外，通过调整指令，模型还可以实现多Modal助手的功能，如语音助手和图像生成助手等。

Abstract
We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.

摘要
我们介绍Emu，一个基于Transformer的多媒体基础模型，可以无障碍地生成图像和文本在多媒体上下文中。这个“食物”模型可以将任何单一模式或多媒体资料输入（例如：混合图像、文本和视频）通过一个“一模型 для所有”的采取过程进行采样训练。首先，视觉信号被编码成嵌入，与文本token共同形成混合输入序列。Emu然后以终端训练的方式，实现统一的目标，即预测下一个文本token或调整下一个视觉嵌入在多媒体序列中。这种多元的多媒体特性使得可以大规模地探索不同的预训练数据源，例如：录像带中的混合帧和文本、网页中的混合图像和文本，以及网页级的图像-文本对和视频-文本对。Emu可以作为一个通用的多媒体界面，用于图像-文本和文本-图像任务，并且支持在Context中生成图像和文本。在零shot/几shot任务中，包括图像描述、视觉问题回答、视频问题回答和文本-图像生成等，Emu展示了较好的性能，与现有的大型多媒体模型相比。此外，我们还展示了增强多媒体助手的能力，例如：透过执行调整来实现多媒体助手。

The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework

paper_url: http://arxiv.org/abs/2307.05201
repo_url: None
paper_authors: Chao Wang, Zheng Tang
for: 提高视频分类任务中的知识填充效率和准确率
methods: 使用学生 substage 和对应 substage 的相关性来实现知识采样，并使用进程式扩散训练方法来解决教师和学生差距过大导致的准确性损失
results: 在真实数据集和 simulate 数据集上进行了广泛的实验，并证明了我们提出的方法可以比既有的知识填充方法在视频分类任务中提高知识填充效率和准确率

Abstract
In the context of label-efficient learning on video data, the distillation method and the structural design of the teacher-student architecture have a significant impact on knowledge distillation. However, the relationship between these factors has been overlooked in previous research. To address this gap, we propose a new weakly supervised learning framework for knowledge distillation in video classification that is designed to improve the efficiency and accuracy of the student model. Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages. We also employ the progressive cascade training method to address the accuracy loss caused by the large capacity gap between the teacher and the student. Additionally, we propose a pseudo-label optimization strategy to improve the initial data label. To optimize the loss functions of different distillation substages during the training process, we introduce a new loss method based on feature distribution. We conduct extensive experiments on both real and simulated data sets, demonstrating that our proposed approach outperforms existing distillation methods in terms of knowledge distillation for video classification tasks. Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.

摘要
在视频数据上进行标签效率学习中，distillation方法和教师学生架构的结构设计具有重要的影响。然而，这些因素之间的关系在前期研究中受到了忽略。为了bridging这个差距，我们提出了一种新的弱监督学习框架 для知识储存在视频分类中，旨在提高学生模型的效率和准确性。我们的方法利用学生子阶段的概念，基于学生子阶段和相应的子阶段之间的相关性来进行知识储存。此外，我们采用了进度式遮盖训练方法，以Address the accuracy loss caused by the large capacity gap between the teacher and the student。此外，我们还提出了一种 Pseudo-label优化策略，以提高初始数据标签。在训练过程中，我们引入了一种基于特征分布的新损失方法，以便在不同的储存子阶段中优化损失函数。我们在实际和模拟数据集上进行了广泛的实验，并证明了我们的提出的方法在视频分类任务中的知识储存性能比既有的储存方法高。我们的提出的子阶段基于储存方法具有潜在的推动未来标签效率学习的前景。

ResMatch: Residual Attention Learning for Local Feature Matching

paper_url: http://arxiv.org/abs/2307.05180
repo_url: https://github.com/acuooooo/resmatch
paper_authors: Yuxin Deng, Jiayi Ma
for: 本研究旨在探讨特征匹配学习中自然语言注意力机制的工作方式，以便更好地理解和改进特征匹配和筛选器的学习。
methods: 本研究提出了一种基于传统特征匹配和筛选器的十字和自我注意力机制，并在十字和自我注意力机制中注入描述符相似性和相对位置相关性，以便学习差异匹配和筛选器函数。此外，我们还提出了一种精简的注意力学习策略，可以在每个点的邻域内进行精简的注意力计算，以提高计算效率。
results: 我们通过了多种实验，包括特征匹配、pose估计和视觉地理位置估计，证明了我们的网络在特征匹配和筛选器学习方面具有优越性。

Abstract
Attention-based graph neural networks have made great progress in feature matching learning. However, insight of how attention mechanism works for feature matching is lacked in the literature. In this paper, we rethink cross- and self-attention from the viewpoint of traditional feature matching and filtering. In order to facilitate the learning of matching and filtering, we inject the similarity of descriptors and relative positions into cross- and self-attention score, respectively. In this way, the attention can focus on learning residual matching and filtering functions with reference to the basic functions of measuring visual and spatial correlation. Moreover, we mine intra- and inter-neighbors according to the similarity of descriptors and relative positions. Then sparse attention for each point can be performed only within its neighborhoods to acquire higher computation efficiency. Feature matching networks equipped with our full and sparse residual attention learning strategies are termed ResMatch and sResMatch respectively. Extensive experiments, including feature matching, pose estimation and visual localization, confirm the superiority of our networks.

摘要
Traditional feature matching and filtering 的视角下，我们重新思考了交叉和自身注意力机制。为了促进匹配和筛选的学习，我们在交叉和自身注意力分数中注入了描述符之间的相似性和相对位置的相关性。这样，注意力可以专注于学习差异匹配和筛选函数，基于视觉和空间相关性的基本函数。此外，我们根据描述符之间的相似性和相对位置来 Mine intra-和inter-邻居。然后，对每个点进行精度的注意力分配，只在其邻居中进行 sparse attention，以提高计算效率。具有我们的全局和精度匹配学习策略的特征匹配网络被称为 ResMatch 和 sResMatch。广泛的实验，包括特征匹配、pose estimation和视觉地标定，证明了我们的网络的优越性。

HistoColAi: An Open-Source Web Platform for Collaborative Digital Histology Image Annotation with AI-Driven Predictive Integration

paper_url: http://arxiv.org/abs/2307.07525
repo_url: None
paper_authors: Cristian Camilo Pulgarín-Ospina, Rocío del Amor, Adrián Colomera, Julio Silva-Rodríguez, Valery Naranjo
for: 该论文旨在提供一个高效的在线图像标注工具，以便在数字patology中进行图像分析。
methods: 该论文使用了深度学习基本的方法来支持图像分析，并提供了一个用于数字 histological 图像的可视化和标注工具。
results: 该论文包括了一个使用该工具进行皮肤细胞肿瘤诊断的用例，以及一项用户体验研究，证明了该工具的可用性。

Abstract
Digital pathology has become a standard in the pathology workflow due to its many benefits. These include the level of detail of the whole slide images generated and the potential immediate sharing of cases between hospitals. Recent advances in deep learning-based methods for image analysis make them of potential aid in digital pathology. However, a major limitation in developing computer-aided diagnostic systems for pathology is the lack of an intuitive and open web application for data annotation. This paper proposes a web service that efficiently provides a tool to visualize and annotate digitized histological images. In addition, to show and validate the tool, in this paper we include a use case centered on the diagnosis of spindle cell skin neoplasm for multiple annotators. A usability study of the tool is also presented, showing the feasibility of the developed tool.

摘要
数字病理学已成为病理过程中的标准，它具有许多优点，如整个报告图像的细节水平和医院之间的案例 immediate 共享。在最近的深度学习基本方法中，图像分析方面的进步可能为数字病理学提供帮助。然而，开发计算机助动诊系统的主要限制是病理图像数据标注的intuitive和开放的web应用程序的缺乏。这篇论文提议一种高效的web服务，可以让用户观看和标注数字化 histological 图像。此外，为证明和验证工具的可用性，我们在本文中包括了多个标注者中心的皮肤癌诊断用例。此外，我们还进行了一项用户体验研究，表明开发的工具可行。

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

paper_url: http://arxiv.org/abs/2307.05158
repo_url: https://github.com/idiap/multimodal_gaze_target_prediction
paper_authors: Anshul Gupta, Samy Tafasca, Jean-Marc Odobez
for: 本文主要针对的问题是预测人员的视线方向，这是一个复杂的任务，需要理解人员的视线和场景内容，以及人员的境界和情况（是否操作？交流？观察别人？注意力？），以检测视线的干扰或应用人类注意力的约束。
methods: 本文提出了一种基于多Modal的听说见解决方案，利用明确 derivation的深度和pose特征，通过注意力机制进行组合。该架构可以在隐私敏感的场景中广泛应用，如Surveillance和医疗领域，因为不能泄露个人可识别信息。
results: 在GazeFollow和VideoAttentionTarget公共数据集上，本文进行了广泛的实验，取得了领先的性能，并在隐私设定中表现出了非常竞争力的结果。

Abstract
Predicting where a person is looking is a complex task, requiring to understand not only the person's gaze and scene content, but also the 3D scene structure and the person's situation (are they manipulating? interacting or observing others? attentive?) to detect obstructions in the line of sight or apply attention priors that humans typically have when observing others. In this paper, we hypothesize that identifying and leveraging such priors can be better achieved through the exploitation of explicitly derived multimodal cues such as depth and pose. We thus propose a modular multimodal architecture allowing to combine these cues using an attention mechanism. The architecture can naturally be exploited in privacy-sensitive situations such as surveillance and health, where personally identifiable information cannot be released. We perform extensive experiments on the GazeFollow and VideoAttentionTarget public datasets, obtaining state-of-the-art performance and demonstrating very competitive results in the privacy setting case.

摘要
预测人员看向的位置是一项复杂任务，需要理解人员的 gaze 和场景内容，以及人员的情况（是否操作？与他人互动？注意？），以探测视线方向上的障碍物或应用人类注意力的偏好。在这篇论文中，我们假设可以通过利用显式 derive 的多模态cue来更好地实现这些偏好。我们因此提议一种模块化多模态架构，可以结合这些cue使用注意力机制。这种架构可以自然地在隐私保护 Situation 中使用，如监视和医疗，无需发布个人隐私信息。我们在 GazeFollow 和 VideoAttentionTarget 公共数据集上进行了广泛的实验，得到了状态对应的表现，并在隐私设定情况下示出了非常竞争力的结果。

ExFaceGAN: Exploring Identity Directions in GAN’s Learned Latent Space for Synthetic Identity Generation

paper_url: http://arxiv.org/abs/2307.05151
repo_url: https://github.com/fdbtrs/exfacegan
paper_authors: Fadi Boutros, Marcel Klemt, Meiling Fang, Arjan Kuijper, Naser Damer
for: 本研究旨在提出一种框架，即ExFaceGAN，以分离预训练的GAN中的人脸信息，以便生成多个任意的人脸样本。
methods: ExFaceGAN使用了一种新的方法，即学习人脸方向边界，以分离GAN的隐藏空间。这个方法可以在不需要专门的架构或属性分类器的情况下，生成多个人脸样本。
results: ExFaceGAN在三种SOTA GAN方法的预训练空间中进行了集成，并得到了丰富的实验结果，证明了ExFaceGAN的一致性和有效性。此外，通过使用ExFaceGAN生成的数据，我们还证明了这些数据可以成功地训练人脸识别模型。

Abstract
Deep generative models have recently presented impressive results in generating realistic face images of random synthetic identities. To generate multiple samples of a certain synthetic identity, previous works proposed to disentangle the latent space of GANs by incorporating additional supervision or regularization, enabling the manipulation of certain attributes. Others proposed to disentangle specific factors in unconditional pretrained GANs latent spaces to control their output, which also requires supervision by attribute classifiers. Moreover, these attributes are entangled in GAN's latent space, making it difficult to manipulate them without affecting the identity information. We propose in this work a framework, ExFaceGAN, to disentangle identity information in pretrained GANs latent spaces, enabling the generation of multiple samples of any synthetic identity. Given a reference latent code of any synthetic image and latent space of pretrained GAN, our ExFaceGAN learns an identity directional boundary that disentangles the latent space into two sub-spaces, with latent codes of samples that are either identity similar or dissimilar to a reference image. By sampling from each side of the boundary, our ExFaceGAN can generate multiple samples of synthetic identity without the need for designing a dedicated architecture or supervision from attribute classifiers. We demonstrate the generalizability and effectiveness of ExFaceGAN by integrating it into learned latent spaces of three SOTA GAN approaches. As an example of the practical benefit of our ExFaceGAN, we empirically prove that data generated by ExFaceGAN can be successfully used to train face recognition models (\url{https://github.com/fdbtrs/ExFaceGAN}).

摘要
深度生成模型最近几年来已经展示了生成真实面部图像的辉煌成绩。为生成特定的合成人脸图像，先前的工作提出了在GAN的含义空间中拓展准确的约束或正则化，以便控制特定的特征。其他人则在预训练GAN的含义空间中特征化特定因素，以控制其输出，但这也需要由特征分类器提供超级视。然而，这些特征在GAN的含义空间中相互杂化，使其难以分离而不影响人脸信息。我们在这个工作中提出了一个框架，即ExFaceGAN，以分离预训练GAN的含义空间中的人脸信息。给定任意合成图像的参考含义代码和预训练GAN的含义空间，我们的ExFaceGAN学习了一个方向性边界，将预训练GAN的含义空间分解成两个子空间，每个子空间的含义代码都是与参考图像的含义相似或不相似的样本。通过采样每个边界两侧的样本，我们的ExFaceGAN可以生成多个基于参考图像的合成人脸图像，无需设计专门的架构或由特征分类器提供超级视。我们在三个SOTA GAN方法的学习含义空间中集成了ExFaceGAN，并证明了它的一致性和效果。例如，我们通过实际证明，通过ExFaceGAN生成的数据可以成功地训练面Recognition模型（参考链接：https://github.com/fdbtrs/ExFaceGAN）。

Unveiling the Invisible: Enhanced Detection and Analysis of Deteriorated Areas in Solar PV Modules Using Unsupervised Sensing Algorithms and 3D Augmented Reality

paper_url: http://arxiv.org/abs/2307.05136
repo_url: None
paper_authors: Adel Oulefki, Yassine Himeur, Thaweesak Trongtiraku, Kahina Amara, Sos Agaian, Samir Benbelkacem, Mohamed Amine Guerroudji, Mohamed Zemmouri, Sahla Ferhat, Nadia Zenati, Shadi Atalla, Wathiq Mansoor
for: 提高太阳能电池模块的维护效率和能量产量
methods: 使用不监督学习算法和3D增强现实视觉化来自动识别和分析太阳能电池模块中的异常
results: 通过计算机模拟和实际图像数据验证，提出的方法可以准确地识别受损区域，并且可以大幅降低太阳能电池维护成本。

Abstract
Solar Photovoltaic (PV) is increasingly being used to address the global concern of energy security. However, hot spot and snail trails in PV modules caused mostly by crakes reduce their efficiency and power capacity. This article presents a groundbreaking methodology for automatically identifying and analyzing anomalies like hot spots and snail trails in Solar Photovoltaic (PV) modules, leveraging unsupervised sensing algorithms and 3D Augmented Reality (AR) visualization. By transforming the traditional methods of diagnosis and repair, our approach not only enhances efficiency but also substantially cuts down the cost of PV system maintenance. Validated through computer simulations and real-world image datasets, the proposed framework accurately identifies dirty regions, emphasizing the critical role of regular maintenance in optimizing the power capacity of solar PV modules. Our immediate objective is to leverage drone technology for real-time, automatic solar panel detection, significantly boosting the efficacy of PV maintenance. The proposed methodology could revolutionize solar PV maintenance, enabling swift, precise anomaly detection without human intervention. This could result in significant cost savings, heightened energy production, and improved overall performance of solar PV systems. Moreover, the novel combination of unsupervised sensing algorithms with 3D AR visualization heralds new opportunities for further research and development in solar PV maintenance.

摘要

DFR: Depth from Rotation by Uncalibrated Image Rectification with Latitudinal Motion Assumption

paper_url: http://arxiv.org/abs/2307.05129
repo_url: https://github.com/zhangtaxue/dfr
paper_authors: Yongcong Zhang, Yifei Xue, Ming Liao, Huiqing Zhang, Yizhen Lao
for: 解决不精度的静止摄像机拍摄问题，提高摄像机拍摄的精度和效率。
methods: 提出了一种基于旋转的深度估计方法，通过分析两个图像的匹配点来直接计算图像的恢复变换。同时，提出了一种自适应缓冲策略来降低投射变换后的几何扭曲。
results: 对于 synthetic 和实际数据进行了广泛的实验，结果表明，提出的方法在效率和精度两个方面比现有方法有显著的优势。

Abstract
Despite the increasing prevalence of rotating-style capture (e.g., surveillance cameras), conventional stereo rectification techniques frequently fail due to the rotation-dominant motion and small baseline between views. In this paper, we tackle the challenge of performing stereo rectification for uncalibrated rotating cameras. To that end, we propose Depth-from-Rotation (DfR), a novel image rectification solution that analytically rectifies two images with two-point correspondences and serves for further depth estimation. Specifically, we model the motion of a rotating camera as the camera rotates on a sphere with fixed latitude. The camera's optical axis lies perpendicular to the sphere's surface. We call this latitudinal motion assumption. Then we derive a 2-point analytical solver from directly computing the rectified transformations on the two images. We also present a self-adaptive strategy to reduce the geometric distortion after rectification. Extensive synthetic and real data experiments demonstrate that the proposed method outperforms existing works in effectiveness and efficiency by a significant margin.

摘要
尽管旋转风格捕捉（例如surveillance camera）的使用逐渐增加，但传统的斯特瑞套件技术经常失败，这是因为旋转动作占主导地位，基线间距离小。在这篇论文中，我们面临了不调整的旋转相机中的斯特瑞套件问题。为解决这问题，我们提出了深度从旋转（DfR），一种新的图像正则化解决方案。具体来说，我们模拟了旋转相机的运动为旋转在球面上的Fixed latitude的运动，相机的光学轴沿着球面表面垂直。我们称这为纬度运动假设。然后，我们从直接计算两个图像的正则化变换而 derivation of a 2-point analytical solver。我们还提出了一种自适应策略来减少正则化后的几何扭曲。广泛的 sintetic和实际数据实验表明，我们提出的方法在效果和效率方面与现有方法相比，具有显著的优势。

One-Shot Learning for Periocular Recognition: Exploring the Effect of Domain Adaptation and Data Bias on Deep Representations

paper_url: http://arxiv.org/abs/2307.05128
repo_url: None
paper_authors: Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, Josef Bigun
for: This paper focuses on the challenge of biometric recognition under extreme data scarcity, specifically for One-Shot periocular recognition.methods: The authors use widely used CNN models and analyze the behavior of deep representations in these models under data scarcity. They also employ Domain Adaptation and evaluate the method’s robustness concerning data normalization and generalization.results: The authors achieve state-of-the-art results on the Cross-Eyed dataset, reducing the EER by 67% and 79% in the Close-World and Open-World protocols, respectively. They also demonstrate that traditional algorithms like SIFT can outperform CNNs in certain situations, such as limited data or unseen classes.

Abstract
One weakness of machine-learning algorithms is the need to train the models for a new task. This presents a specific challenge for biometric recognition due to the dynamic nature of databases and, in some instances, the reliance on subject collaboration for data collection. In this paper, we investigate the behavior of deep representations in widely used CNN models under extreme data scarcity for One-Shot periocular recognition, a biometric recognition task. We analyze the outputs of CNN layers as identity-representing feature vectors. We examine the impact of Domain Adaptation on the network layers' output for unseen data and evaluate the method's robustness concerning data normalization and generalization of the best-performing layer. We improved state-of-the-art results that made use of networks trained with biometric datasets with millions of images and fine-tuned for the target periocular dataset by utilizing out-of-the-box CNNs trained for the ImageNet Recognition Challenge and standard computer vision algorithms. For example, for the Cross-Eyed dataset, we could reduce the EER by 67% and 79% (from 1.70% and 3.41% to 0.56% and 0.71%) in the Close-World and Open-World protocols, respectively, for the periocular case. We also demonstrate that traditional algorithms like SIFT can outperform CNNs in situations with limited data or scenarios where the network has not been trained with the test classes like the Open-World mode. SIFT alone was able to reduce the EER by 64% and 71.6% (from 1.7% and 3.41% to 0.6% and 0.97%) for Cross-Eyed in the Close-World and Open-World protocols, respectively, and a reduction of 4.6% (from 3.94% to 3.76%) in the PolyU database for the Open-World and single biometric case.

摘要
我们发现，使用 widely used convolutional neural network (CNN) 模型进行一shot periocular recognition task 时，需要进行训练，这会带来一些挑战，主要是因为数据库的动态性和需要Subject collaboration для数据采集。在这篇论文中，我们研究了深度表示的 CNN 模型在极端数据缺乏情况下的行为，我们分析了 CNN 层的输出作为标识特征向量，并评估了领域适应对不可见数据的影响。我们还评估了方法的数据 нормализа和最佳层的泛化性。我们发现，通过使用 out-of-the-box CNNs 训练 ImageNet Recognition Challenge 和标准计算机视觉算法，可以提高 state-of-the-art 结果。例如，在 Cross-Eyed 数据集上，我们可以降低 EER 值by 67% 和 79% (从 1.70% 和 3.41% 降至 0.56% 和 0.71%) 在 Close-World 和 Open-World 协议中。我们还证明，传统算法如 SIFT 可以在有限数据或测试类不同于网络训练类的情况下表现更好。SIFT 独立地降低了 EER 值by 64% 和 71.6% (从 1.7% 和 3.41% 降至 0.6% 和 0.97%) 在 Cross-Eyed 数据集上，并在 PolyU 数据库中降低了 4.6% (从 3.94% 降至 3.76%)。

Hyperspherical Embedding for Point Cloud Completion

paper_url: http://arxiv.org/abs/2307.05634
repo_url: https://github.com/haomengz/hyperpc
paper_authors: Junming Zhang, Haomeng Zhang, Ram Vasudevan, Matthew Johnson-Roberson
for: 提高3D点云补充任务的完teness和精度。
methods: 提出了一种卷积扩展模块，可以将encoder中提取的嵌入特征转换到卷积扩展模块，并在这个模块中进行正则化。这使得输出的卷积扩展嵌入具有更好的稳定性和更紧凑的分布。
results: 实验结果显示，在单任务和多任务学习中，提出的方法可以有效地提高点云补充任务的完teness和精度。

Abstract
Most real-world 3D measurements from depth sensors are incomplete, and to address this issue the point cloud completion task aims to predict the complete shapes of objects from partial observations. Previous works often adapt an encoder-decoder architecture, where the encoder is trained to extract embeddings that are used as inputs to generate predictions from the decoder. However, the learned embeddings have sparse distribution in the feature space, which leads to worse generalization results during testing. To address these problems, this paper proposes a hyperspherical module, which transforms and normalizes embeddings from the encoder to be on a unit hypersphere. With the proposed module, the magnitude and direction of the output hyperspherical embedding are decoupled and only the directional information is optimized. We theoretically analyze the hyperspherical embedding and show that it enables more stable training with a wider range of learning rates and more compact embedding distributions. Experiment results show consistent improvement of point cloud completion in both single-task and multi-task learning, which demonstrates the effectiveness of the proposed method.

摘要
大多数实际世界3D测量从深度传感器是不完整的，为了解决这个问题，点云完成任务目标是预测对象的完整形状从部分观察记录。先前的工作通常采用了编码器-解码器架构，其中编码器训练以提取嵌入 Vector，并将其用于解码器生成预测。然而，学习的嵌入 Vector 有稀疏分布在特征空间，这会导致测试时的泛化结果更差。为解决这些问题，这篇论文提出了偏球模块，它将编码器输出的嵌入 Vector 转换并 норralize，使其在单位偏球上。通过该模块，输出偏球嵌入的大小和方向分解，只有方向信息被优化。我们 theoretically 分析了偏球嵌入，并证明它使得更稳定地训练，并在更广泛的学习速率范围内更加紧凑地表示。实验结果表明，提案方法在单任务和多任务学习中都具有稳定的改进效果，这证明了提案的方法的有效性。

Offline and Online Optical Flow Enhancement for Deep Video Compression

paper_url: http://arxiv.org/abs/2307.05092
repo_url: None
paper_authors: Chuanbo Tang, Xihua Sheng, Zhuoyuan Li, Haotian Zhang, Li Li, Dong Liu
for: 提高深度视频压缩网络的效率，使其更好地利用视频帧之间的时间相似性。
methods: 在两个阶段进行增强：在线阶段使用梯度下降算法对视频进行适应性优化，而在离线阶段使用训练过的光流估计网络进行光流估计，并与传统视频压缩方案（如H.266/VVC）的运动信息进行结合。
results: 对一种现有的深度视频压缩方案DCVC进行实验，实验结果表明，将在线和离线增强结合使用可以在测试视频上平均获得12.8%的比特率减少，而无需增加解码器的模型或计算复杂度。

Abstract
Video compression relies heavily on exploiting the temporal redundancy between video frames, which is usually achieved by estimating and using the motion information. The motion information is represented as optical flows in most of the existing deep video compression networks. Indeed, these networks often adopt pre-trained optical flow estimation networks for motion estimation. The optical flows, however, may be less suitable for video compression due to the following two factors. First, the optical flow estimation networks were trained to perform inter-frame prediction as accurately as possible, but the optical flows themselves may cost too many bits to encode. Second, the optical flow estimation networks were trained on synthetic data, and may not generalize well enough to real-world videos. We address the twofold limitations by enhancing the optical flows in two stages: offline and online. In the offline stage, we fine-tune a trained optical flow estimation network with the motion information provided by a traditional (non-deep) video compression scheme, e.g. H.266/VVC, as we believe the motion information of H.266/VVC achieves a better rate-distortion trade-off. In the online stage, we further optimize the latent features of the optical flows with a gradient descent-based algorithm for the video to be compressed, so as to enhance the adaptivity of the optical flows. We conduct experiments on a state-of-the-art deep video compression scheme, DCVC. Experimental results demonstrate that the proposed offline and online enhancement together achieves on average 12.8% bitrate saving on the tested videos, without increasing the model or computational complexity of the decoder side.

摘要
视频压缩通过利用视频帧之间的时间重复来实现，通常是通过计算运动信息来实现。运动信息通常被表示为光流在大多数现有的深度视频压缩网络中。然而，这些网络经常采用预训练的光流估计网络进行运动估计。然而，光流可能不适合视频压缩，因为以下两点：首先，光流估计网络通常是为了尽可能准确地进行间帧预测，但光流本身可能需要太多比特来编码。其次，光流估计网络通常是在 sintetic 数据上训练的，可能无法适应实际视频。我们解决这两个限制，在两个阶段进行增强：离线阶段和在线阶段。在离线阶段，我们使用已经训练的光流估计网络，并将其与传统非深度视频压缩方案，如 H.266/VVC 提供的运动信息进行精度调整。在线阶段，我们使用一种基于梯度下降的算法来优化压缩中的缓存特征，以提高视频的适应性。我们在一个现有的深度视频压缩方案，DCVC 上进行了实验，实验结果表明，我们的离线和在线增强结合使得在测试视频上 average 12.8% 比特率减少，而无需增加解码器的模型或计算复杂度。

SAR-NeRF: Neural Radiance Fields for Synthetic Aperture Radar Multi-View Representation

paper_url: http://arxiv.org/abs/2307.05087
repo_url: None
paper_authors: Zhengxin Lei, Feng Xu, Jiangtao Wei, Feng Cai, Feng Wang, Ya-Qiu Jin
for: 本研究旨在提出一种基于NeRF的SAR图像生成模型，以增强SAR图像的多视图表示和泛化能力。
methods: 该模型根据SAR探测机制和神经网络结合，使用可导渠Rendering方程式来表示SAR图像的生成。
results: 经过量化实验表明，SAR-NeRF可以有效地表示SAR图像的多视图特征，并且可以在少量学习setup下提高SAR目标分类精度。 Specifically, with only 12 images per class, the model achieved a 10-type classification accuracy of 91.6%.

Abstract
SAR images are highly sensitive to observation configurations, and they exhibit significant variations across different viewing angles, making it challenging to represent and learn their anisotropic features. As a result, deep learning methods often generalize poorly across different view angles. Inspired by the concept of neural radiance fields (NeRF), this study combines SAR imaging mechanisms with neural networks to propose a novel NeRF model for SAR image generation. Following the mapping and projection pinciples, a set of SAR images is modeled implicitly as a function of attenuation coefficients and scattering intensities in the 3D imaging space through a differentiable rendering equation. SAR-NeRF is then constructed to learn the distribution of attenuation coefficients and scattering intensities of voxels, where the vectorized form of 3D voxel SAR rendering equation and the sampling relationship between the 3D space voxels and the 2D view ray grids are analytically derived. Through quantitative experiments on various datasets, we thoroughly assess the multi-view representation and generalization capabilities of SAR-NeRF. Additionally, it is found that SAR-NeRF augumented dataset can significantly improve SAR target classification performance under few-shot learning setup, where a 10-type classification accuracy of 91.6\% can be achieved by using only 12 images per class.

摘要
SAR图像具有高敏感性，因此在不同视角下 exhibit 显著的变化，这使得深度学习方法很难generalize。为了解决这问题，本研究提出了基于NeRF的SAR图像生成模型。通过mapping和projection原理，我们模型了SAR图像为attenuation coefficients和scattering intensities的函数在3D图像空间中。然后，我们构建了SAR-NeRF模型来学习voxels中的分布attenuation coefficients和scattering intensities。我们通过多个实验证明了SAR-NeRF的多视角表示和泛化能力。此外，我们发现SAR-NeRF的增强集合可以大幅提高SAR目标分类性能，特别是在少量学习 setup 下，只需使用12张图像每类就可以达到91.6%的10种分类精度。

Estimating label quality and errors in semantic segmentation data via any model

paper_url: http://arxiv.org/abs/2307.05080
repo_url: None
paper_authors: Vedang Lad, Jonas Mueller
for: 提高 semantic segmentation 数据集的标注质量，减少人工标注错误。
methods: 使用 probabilistic 预测来自 segmentation 模型，对每个像素的标注进行评分，以便优先级排序需要审核的数据。
results: 通过在 DeepLabV3+ 和 FPN segmentation 模型上进行多种标注质量评分方法的研究，发现使用 soft-minimum 方法可以最 effectively 检测多种标注错误。

Abstract
The labor-intensive annotation process of semantic segmentation datasets is often prone to errors, since humans struggle to label every pixel correctly. We study algorithms to automatically detect such annotation errors, in particular methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled. This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset, which is critical in sensitive applications such as medical imaging and autonomous vehicles. Widely applicable, our label quality scores rely on probabilistic predictions from a trained segmentation model -- any model architecture and training procedure can be utilized. Here we study 7 different label quality scoring methods used in conjunction with a DeepLabV3+ or a FPN segmentation model to detect annotation errors in a version of the SYNTHIA dataset. Precision-recall evaluations reveal a score -- the soft-minimum of the model-estimated likelihoods of each pixel's annotated class -- that is particularly effective to identify images that are mislabeled, across multiple types of annotation error.

摘要
人工准备繁琐的标注过程经常存在错误，因为人们很难对每个像素进行正确的标注。我们研究自动检测标注错误的算法，特别是用于评估标注质量的方法，以便优先级化审核数据，以确保高质量的训练/评估数据集，这对敏感应用如医学成像和自动驾驶来说非常重要。我们的标注质量分数可以适用于任何模型架构和训练过程。在这里，我们研究了7种不同的标注质量分数方法，与DeepLabV3+或FPN segmentation模型结合使用，以检测SYNTHIA数据集中的标注错误。精度-回快评估表明，使用模型估计每个像素的类别概率的软最小值分数是特别有效地标识错tilted图像，并且适用于多种标注错误类型。

Disentangled Contrastive Image Translation for Nighttime Surveillance

paper_url: http://arxiv.org/abs/2307.05038
repo_url: None
paper_authors: Guanzhou Lan, Bin Zhao, Xuelong Li
for: 本研究旨在提高夜间监控质量，增强安全性。
methods: 本文提出了一种夜间监控到日间监控的翻译方法，包括一种学习物理约束（色彩不变），以及一种分离表示（auxiliary pretext task）。
results: 对比 existed 方法，本研究的方法在高精度翻译下表现出色，并且可以自动提取 semantics。

Abstract
Nighttime surveillance suffers from degradation due to poor illumination and arduous human annotations. It is challengable and remains a security risk at night. Existing methods rely on multi-spectral images to perceive objects in the dark, which are troubled by low resolution and color absence. We argue that the ultimate solution for nighttime surveillance is night-to-day translation, or Night2Day, which aims to translate a surveillance scene from nighttime to the daytime while maintaining semantic consistency. To achieve this, this paper presents a Disentangled Contrastive (DiCo) learning method. Specifically, to address the poor and complex illumination in the nighttime scenes, we propose a learnable physical prior, i.e., the color invariant, which provides a stable perception of a highly dynamic night environment and can be incorporated into the learning pipeline of neural networks. Targeting the surveillance scenes, we develop a disentangled representation, which is an auxiliary pretext task that separates surveillance scenes into the foreground and background with contrastive learning. Such a strategy can extract the semantics without supervision and boost our model to achieve instance-aware translation. Finally, we incorporate all the modules above into generative adversarial networks and achieve high-fidelity translation. This paper also contributes a new surveillance dataset called NightSuR. It includes six scenes to support the study on nighttime surveillance. This dataset collects nighttime images with different properties of nighttime environments, such as flare and extreme darkness. Extensive experiments demonstrate that our method outperforms existing works significantly. The dataset and source code will be released on GitHub soon.

摘要
夜间监测受到质量下降的影响，主要是因为低光照和复杂的人工标注。这是一个安全风险。现有方法使用多spectral图像来感知夜间 объекts，但这些图像受到低分辨率和颜色缺失的限制。我们认为夜间监测的最终解决方案是夜晚到白天（Night2Day）翻译，它目的是在保持semantic相同性的前提下，将夜间监测场景翻译成白天场景。为此，本文提出了一种分解对比（DiCo）学习方法。Specifically，我们提出了一个学习可能的物理先验（color invariant），以稳定夜间场景中的高度动态环境，并将其integrated into neural networks的学习管道。Targeting Surveillance scenes，我们开发了一种分解表示（disentangled representation），它是一个auxiliary pretext task，通过对比学习，将surveillance scene分解成背景和前景。这种策略可以不supervision提取 semantics，并且提高我们的模型实现instancet-aware翻译。最后，我们将所有模块集成到生成对抗网络中，实现高效翻译。此外，我们还提供了一个新的监测 Datasetcalled NightSuR，它包括六个场景，以支持夜间监测的研究。这个数据集收集了不同夜间环境的夜间图像，例如炸彩和极暗。我们的方法在实验中表现出色，与现有方法相比，具有显著的优势。数据集和源代码将很快在 GitHub上发布。

Towards Anytime Optical Flow Estimation with Event Cameras

paper_url: http://arxiv.org/abs/2307.05033
repo_url: https://github.com/yaozhuwa/eva-flow
paper_authors: Yaozu Ye, Hao Shi, Kailun Yang, Ze Wang, Xiaoting Yin, Yaonan Wang, Kaiwei Wang
for: 该研究旨在使用事件摄像头实现高 Frame Rate 低延迟的光流估计，并提供高 Frame Rate 事件摄像头的数据集。
methods: 该研究使用了 Unified Voxel Grid 和 EVent-based Anytime Flow estimation 网络（EVA-Flow），以及 Stacked Spatiotemporal Motion Refinement（SMR）模块来实现高 Frame Rate 低延迟的光流估计。
results: 该研究实现了竞争性的性能，包括超低延迟（5毫秒）、最快的推理（9.2毫秒）、时间密集的运动估计（200Hz）和强大的总体化。

Abstract
Event cameras are capable of responding to log-brightness changes in microseconds. Its characteristic of producing responses only to the changing region is particularly suitable for optical flow estimation. In contrast to the super low-latency response speed of event cameras, existing datasets collected via event cameras, however, only provide limited frame rate optical flow ground truth, (e.g., at 10Hz), greatly restricting the potential of event-driven optical flow. To address this challenge, we put forward a high-frame-rate, low-latency event representation Unified Voxel Grid, sequentially fed into the network bin by bin. We then propose EVA-Flow, an EVent-based Anytime Flow estimation network to produce high-frame-rate event optical flow with only low-frame-rate optical flow ground truth for supervision. The key component of our EVA-Flow is the stacked Spatiotemporal Motion Refinement (SMR) module, which predicts temporally-dense optical flow and enhances the accuracy via spatial-temporal motion refinement. The time-dense feature warping utilized in the SMR module provides implicit supervision for the intermediate optical flow. Additionally, we introduce the Rectified Flow Warp Loss (RFWL) for the unsupervised evaluation of intermediate optical flow in the absence of ground truth. This is, to the best of our knowledge, the first work focusing on anytime optical flow estimation via event cameras. A comprehensive variety of experiments on MVSEC, DESC, and our EVA-FlowSet demonstrates that EVA-Flow achieves competitive performance, super-low-latency (5ms), fastest inference (9.2ms), time-dense motion estimation (200Hz), and strong generalization. Our code will be available at https://github.com/Yaozhuwa/EVA-Flow.

摘要
事件摄像机可以在微秒级响应差值变化。其特点是仅响应变化区域，特别适合光流估计。相比事件摄像机的极低延迟响应速度，现有的事件摄像机采集的数据集只提供有限帧率光流真实值，例如10Hz，很大程度地限制了事件驱动的光流的潜力。为解决这个挑战，我们提出了高帧率低延迟事件表示Unified Voxel Grid，逐步传输到网络中bin处。然后，我们提出了EVENT-based Anytime Flow estimation Network（EVA-Flow），用于生成高帧率事件光流，只使用低帧率光流真实值作为监督。EVA-Flow的关键组件是堆叠的空间时间运动级化（SMR）模块，可以预测时间密集的光流，并通过空间时间运动级化进行精度提高。SMR模块使用的时间密集特征扭曲提供了隐式的监督 для中间光流。此外，我们还提出了Rectified Flow Warp Loss（RFWL），用于无监督评估中间光流的 absence of ground truth。这是我们知道的第一个关注在事件摄像机上的任何时间光流估计工作。我们的实验表明，EVA-Flow在MVSEC、DESC和我们自己的EVA-FlowSet上达到了竞争性表现，超低延迟（5ms），最快执行（9.2ms），时间密集运动估计（200Hz）和强大总体化。我们的代码将在https://github.com/Yaozhuwa/EVA-Flow上发布。

TRansPose: Large-Scale Multispectral Dataset for Transparent Object

paper_url: http://arxiv.org/abs/2307.05016
repo_url: None
paper_authors: Jeongyun Kim, Myung-Hwan Jeon, Sangwoo Jung, Wooseong Yang, Minwoo Jung, Jaeho Shin, Ayoung Kim
For: 本研究目的是提供一个大规模多spectral数据集，以促进透明物体研究。* Methods: 本研究使用了STEREO RGB-D、热色光照相机和物体位置数据来构建大规模的TRansPose数据集。* Results: 本研究提供了333,819帧影像和4,000,560个标注数据，包括透明物体和非透明物体的实例分布、实例分布和深度资讯等。

Abstract
Transparent objects are encountered frequently in our daily lives, yet recognizing them poses challenges for conventional vision sensors due to their unique material properties, not being well perceived from RGB or depth cameras. Overcoming this limitation, thermal infrared cameras have emerged as a solution, offering improved visibility and shape information for transparent objects. In this paper, we present TRansPose, the first large-scale multispectral dataset that combines stereo RGB-D, thermal infrared (TIR) images, and object poses to promote transparent object research. The dataset includes 99 transparent objects, encompassing 43 household items, 27 recyclable trashes, 29 chemical laboratory equivalents, and 12 non-transparent objects. It comprises a vast collection of 333,819 images and 4,000,056 annotations, providing instance-level segmentation masks, ground-truth poses, and completed depth information. The data was acquired using a FLIR A65 thermal infrared (TIR) camera, two Intel RealSense L515 RGB-D cameras, and a Franka Emika Panda robot manipulator. Spanning 87 sequences, TRansPose covers various challenging real-life scenarios, including objects filled with water, diverse lighting conditions, heavy clutter, non-transparent or translucent containers, objects in plastic bags, and multi-stacked objects. TRansPose dataset can be accessed from the following link: https://sites.google.com/view/transpose-dataset

摘要
TRansPose 是一个大规模多spectral数据集，包含了333819张图像和4000056个注释，用于促进透明物体研究。该数据集包含99个透明物体，其中43个家用品、27个可回收垃圾、29个化学实验室 equipments 和12个不透明物体。它包括87个序列，涵盖了各种实际生活中的挑战，如水填充的物体、多样的照明条件、拥挤的环境、不透明或半透明容器、物体在塑料袋中、多层物体等。TRansPose 数据集可以从以下链接获取：https://sites.google.com/view/transpose-dataset。

Test-Time Training on Video Streams

paper_url: http://arxiv.org/abs/2307.05014
repo_url: https://github.com/molyswu/hand_detection
paper_authors: Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang
for: further improve a trained model at test time
methods: online test-time training (TTT) with masked autoencoders
results: significant improvement (45%-66%) in instance and panoptic segmentation tasks compared to fixed-model baseline, and outperformed offline TTT with more information

Abstract
Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

摘要
先前的工作已经确立了测试时重定型（TTT）为一种通用的改进已经训练的模型的框架。在测试每个实例之前，模型会在同一个实例上使用自我超级vised任务，如图像重建WithMasked autoencoders进行训练。我们将TTT扩展到流动设定， где多个测试实例（视频帧在我们的情况下）会在时间顺序中到达。我们的扩展是在线 TTT：当前模型从前一个模型初始化，然后在当前帧和前一个小窗口的帧上进行训练。在线 TTT在四个任务上表现出色，在三个实际世界数据集上显著超过固定模型基线。相对提升为45%和66% дляINSTANCE和panoptic segmentation。 surprisingly，在线 TTT还超过了其停机 variant，即训练所有测试视频帧的整体信息，无论时间顺序。这与之前使用 synthetic videos 的发现不同。我们认为地域性是在线 TTT 的优势。我们通过拓展和基于偏差-variance 质量的理论进行分析地域性的角色。

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

paper_url: http://arxiv.org/abs/2307.05000
repo_url: None
paper_authors: Cong Wang, Di Kang, Yanpei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang
for: 提高 AR/VR 和视频会议应用中人头的写实感和动态运动
methods: 使用神经点表示和神经体积渲染过程，避免使用 mesh-based 方法的固定连接和硬对应
results: 在三个 Multiface 数据集上进行实验，表现比前一代方法更好，特别是在处理困难的 facial 区域时Here’s the full translation of the abstract in Simplified Chinese:
for: 提高 AR/VR 和视频会议应用中人头的写实感和动态运动
methods: 使用神经点表示和神经体积渲染过程，避免使用 mesh-based 方法的固定连接和硬对应
results: 在三个 Multiface 数据集上进行实验，表现比前一代方法更好，特别是在处理困难的 facial 区域时I hope that helps!

Abstract
Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.

摘要
displaying photorealistic and dynamically moving human heads is crucial for creating a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose （Name）(Method), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our （Name）is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.

$\mathrm{SAM^{Med}}$: A medical image annotation framework based on large vision model

paper_url: http://arxiv.org/abs/2307.05617
repo_url: None
paper_authors: Chenglong Wang, Dexuan Li, Sucheng Wang, Chengxiu Zhang, Yida Wang, Yun Liu, Guang Yang
for: 这个研究旨在应用大量数据量计算机视觉模型，尤其是Segment Anything Model（SAM），以提高医疗影像标注的效率和精度。
methods: 本研究提出了一个增强的框架，名为$\mathrm{SAM^{Med}$，它利用SAM的能力以及提示学习的方法来测试医疗影像标注的下游任务。 $\mathrm{SAM^{Med}$框架包括两个子模组，namely $\mathrm{SAM^{assist}$和$\mathrm{SAM^{auto}$.
results: 研究结果显示，$\mathrm{SAM^{Med}$在医疗影像标注任务中具有优秀的效率和精度。 Specifically, the proposed SAP-Net model achieved an average Dice coefficient of 0.80 and 0.82 for kidney and liver segmentation, respectively, with only five annotated slices.

Abstract
Recently, large vision model, Segment Anything Model (SAM), has revolutionized the computer vision field, especially for image segmentation. SAM presented a new promptable segmentation paradigm that exhibit its remarkable zero-shot generalization ability. An extensive researches have explore the potential and limits of SAM in various downstream tasks. In this study, we presents $\mathrm{SAM^{Med}$, an enhanced framework for medical image annotation that leverages the capabilities of SAM. $\mathrm{SAM^{Med}$ framework consisted of two submodules, namely $\mathrm{SAM^{assist}$ and $\mathrm{SAM^{auto}$. The $\mathrm{SAM^{assist}$ demonstrates the generalization ability of SAM to the downstream medical segmentation task using the prompt-learning approach. Results show a significant improvement in segmentation accuracy with only approximately 5 input points. The $\mathrm{SAM^{auto}$ model aims to accelerate the annotation process by automatically generating input prompts. The proposed SAP-Net model achieves superior segmentation performance with only five annotated slices, achieving an average Dice coefficient of 0.80 and 0.82 for kidney and liver segmentation, respectively. Overall, $\mathrm{SAM^{Med}$ demonstrates promising results in medical image annotation. These findings highlight the potential of leveraging large-scale vision models in medical image annotation tasks.

摘要
最近，大型视觉模型Segment Anything Model（SAM）在计算机视觉领域中引起了革命性的变革，特别是在图像分割方面。SAM提出了一种新的可Promptable分割 парадигма，其表现出了强大的零学习泛化能力。多个研究已经探讨了SAM在不同下游任务中的潜力和局限性。在这项研究中，我们提出了 $\mathrm{SAM^{Med}$ 框架，这是基于 SAM 的医疗图像注释框架。 $\mathrm{SAM^{Med}$ 框架由两个子模块组成：$\mathrm{SAM^{assist}$ 和 $\mathrm{SAM^{auto}$。 $\mathrm{SAM^{assist}$ 通过示例学习方法来表明 SAM 在下游医疗图像分割任务中的泛化能力。结果表明，只需要约5个输入点，就可以 achieve significanly 提高分割精度。 $\mathrm{SAM^{auto}$ 模型则目的是加速注释过程，通过自动生成输入点来减少人工干预。我们提出的 SAP-Net 模型在只有5个注释 slice 的情况下，实现了平均的 dice 系数为 0.80 和 0.82 для肾脏和肝脏分割任务。总的来说， $\mathrm{SAM^{Med}$ 表现出了出色的结果在医疗图像注释任务中。这些发现 highlights 大型视觉模型在医疗图像注释任务中的潜力。

A Multi-view Impartial Decision Network for Frontotemporal Dementia Diagnosis

paper_url: http://arxiv.org/abs/2307.04981
repo_url: None
paper_authors: Guoyao Deng, Ke Zou, Meng Wang, Xuedong Yuan, Sancong Ying, Huazhu Fu
for: 本研究旨在提出一种可靠的多视图偏函数磁共振成像（fMRI）诊断前置部 tumor（FTD）的方法，以解决现有的FTD诊断方法存在的两个限制。
methods: 我们提出了一种可靠的多视图不偏诊断网络（MID-Net），使用多个专家模型提取fMRI图像中丰富的神经网络信息，并使用Dirichlet分布来描述专家类别概率分布。我们还提出了一种新的不偏决定器（IDer），以合并不同专家意见而不需要额外计算成本。
results: 我们的MID-Net在高质量FTD fMRI数据集上进行了广泛的实验，并证明了其超过了之前的方法，并提供了高不确定性的硬编译例子。我们认为，我们的方法代表了在多专家条件下可靠的FTD决策的一个重要一步。

Abstract
Frontotemporal Dementia (FTD) diagnosis has been successfully progress using deep learning techniques. However, current FTD identification methods suffer from two limitations. Firstly, they do not exploit the potential of multi-view functional magnetic resonance imaging (fMRI) for classifying FTD. Secondly, they do not consider the reliability of the multi-view FTD diagnosis. To address these limitations, we propose a reliable multi-view impartial decision network (MID-Net) for FTD diagnosis in fMRI. Our MID-Net provides confidence for each view and generates a reliable prediction without any conflict. To achieve this, we employ multiple expert models to extract evidence from the abundant neural network information contained in fMRI images. We then introduce the Dirichlet Distribution to characterize the expert class probability distribution from an evidence level. Additionally, a novel Impartial Decision Maker (IDer) is proposed to combine the different opinions inductively to arrive at an unbiased prediction without additional computation cost. Overall, our MID-Net dynamically integrates the decisions of different experts on FTD disease, especially when dealing with multi-view high-conflict cases. Extensive experiments on a high-quality FTD fMRI dataset demonstrate that our model outperforms previous methods and provides high uncertainty for hard-to-classify examples. We believe that our approach represents a significant step toward the deployment of reliable FTD decision-making under multi-expert conditions. We will release the codes for reproduction after acceptance.

摘要
前rontemporal dementia（FTD）诊断已成功应用深入学习技术。然而，当前FTD诊断方法受到两种限制。首先，它们不利用多视图功能磁共振成像（fMRI）来分类FTD。其次，它们不考虑多视图FTD诊断的可靠性。为了解决这些限制，我们提议一种可靠的多视图偏见网络（MID-Net） дляFTD诊断。我们的MID-Net提供每个视图的信任度，并生成不受冲突的预测。为实现这一目标，我们采用多个专家模型来提取丰富的神经网络信息，包括fMRI图像中的讯息。然后，我们引入Dirichlet分布来描述专家类别概率分布的证据水平。此外，我们还提出了一种新的偏见决策器（IDer），用于在不同专家意见之间协调决策，从而实现无偏见的预测。总的来说，我们的MID-Net可以动态集成不同专家的FTD病变诊断，特别是在多视图高冲突的案例中。广泛的实验表明，我们的模型在高质量FTD fMRI数据集上表现出色，并提供高度不确定的答案。我们认为，我们的方法代表了多视图FTD诊断领域中一个重要的突破，并将为多个专家决策提供可靠的基础。我们将在接受后发布代码。

Diffusion idea exploration for art generation

paper_url: http://arxiv.org/abs/2307.04978
repo_url: None
paper_authors: Nikhil Verma
For: The paper is written for generating creative art using text and rough sketches as guiding information.* Methods: The paper uses state-of-the-art diffusion models to generate images, which start with a pattern of random dots and convert it into a design image based on the guiding information fed into the model.* Results: The initial experiments demonstrated promising qualitative results.Here’s the simplified Chinese text for the three information points:* For: 这篇论文是为了使用文本和粗略图像作为导引信息，生成创新的艺术。* Methods: 这篇论文使用当前最佳的扩散模型来生成图像，这些模型从Random Dots开始，逐渐转化为设计图像，以导入模型中的导引信息为基础。* Results: 初步实验表现出了有前途的质量结果。

Abstract
Cross-Modal learning tasks have picked up pace in recent times. With plethora of applications in diverse areas, generation of novel content using multiple modalities of data has remained a challenging problem. To address the same, various generative modelling techniques have been proposed for specific tasks. Novel and creative image generation is one important aspect for industrial application which could help as an arm for novel content generation. Techniques proposed previously used Generative Adversarial Network(GAN), autoregressive models and Variational Autoencoders (VAE) for accomplishing similar tasks. These approaches are limited in their capability to produce images guided by either text instructions or rough sketch images decreasing the overall performance of image generator. We used state of the art diffusion models to generate creative art by primarily leveraging text with additional support of rough sketches. Diffusion starts with a pattern of random dots and slowly converts that pattern into a design image using the guiding information fed into the model. Diffusion models have recently outperformed other generative models in image generation tasks using cross modal data as guiding information. The initial experiments for this task of novel image generation demonstrated promising qualitative results.

摘要
Cross-modal learning tasks 在 latest 时间内得到了更多的应用。通过多种数据模式，生成新的内容是一个挑战性的问题。为了解决这个问题，各种生成模型技术被提出来用于特定任务。 novel 和创新的图像生成是一个重要的工业应用，可以作为内容生成的一个手臂。 previously proposed 技术使用 Generative Adversarial Network(GAN)、autoregressive 模型和 Variational Autoencoders (VAE) 来完成相似的任务。这些方法受限于能够根据文本指令或粗略绘制图像来生成图像，这会降低整体图像生成器的性能。我们使用了 state of the art 的扩散模型来生成创新艺术， primarily 利用文本，并且具有附加的粗略绘制图像支持。扩散模型在使用交叉模式数据作为指导信息时在图像生成任务中最近表现出了比其他生成模型更好的性能。初步实验表明，这项新图像生成任务的Result 是非常有前途的。

SAM-U: Multi-box prompts triggered uncertainty estimation for reliable SAM in medical image

paper_url: http://arxiv.org/abs/2307.04973
repo_url: None
paper_authors: Guoyao Deng, Ke Zou, Kai Ren, Meng Wang, Xuedong Yuan, Sancong Ying, Huazhu Fu
for: 这个研究旨在提高Segmenting Anything（SAM）的可靠性和公平性，特别在医疗领域。
methods: 该研究提出了多个框架触发的uncertainty估计方法，通过Monte Carlo方法使用不同的提示参数来估计SAM预测结果的分布。
results: 实验结果表明，多个框架触发的augmentation可以提高SAM性能，并为每个像素提供不确定性。这成为了第一种可靠的SAM paradigm。

Abstract
Recently, Segmenting Anything has taken an important step towards general artificial intelligence. At the same time, its reliability and fairness have also attracted great attention, especially in the field of health care. In this study, we propose multi-box prompts triggered uncertainty estimation for SAM cues to demonstrate the reliability of segmented lesions or tissues. We estimate the distribution of SAM predictions via Monte Carlo with prior distribution parameters, which employs different prompts as formulation of test-time augmentation. Our experimental results found that multi-box prompts augmentation improve the SAM performance, and endowed each pixel with uncertainty. This provides the first paradigm for a reliable SAM.

摘要
最近，Segmenting Anything（SAM）已经向全面人工智能至关重要的一步。同时，其可靠性和公正性也吸引了广泛的关注，特别是医疗领域。在这项研究中，我们提议使用多个框架触发uncertainty估计来证明分割结果的可靠性。我们通过Monte Carlo方法使用不同的提示参数来估算SAM预测结果的分布。实验结果显示，多个框架增强可以提高SAM性能，并赋予每个像素不确定性。这提供了第一种可靠的SAM方法。Note: "Segmenting Anything" (SAM) is a fictional concept, and the text is a hypothetical research paper. The translation is provided for illustration purposes only.

Image Reconstruction using Enhanced Vision Transformer

paper_url: http://arxiv.org/abs/2307.05616
repo_url: None
paper_authors: Nikhil Verma, Deepkamal Kaur, Lydia Chau
for: 这个项目的目标是提高计算机视觉领域中图像去噪的能力，以便提高图像的量化测量精度。
methods: 该项目提出了一种基于视图转换器（ViT）的图像重建框架，该框架可以用于图像去噪、抑噪和填充等任务。项目还integrated了四种优化技术来提高模型的重建能力，包括局部敏感注意力（LSA）、偏移patch Tokenization（SPT）、旋转位嵌入（RoPE）以及基于生成对抗网络（GANs）的敌对损失函数。这些优化技术使得 transformer 更加有效地学习 dataset，而且增强了重建图像的分辨率。
results: 根据我们的实验，提出的架构在图像去噪和填充任务上的重建性能比对比（U-Net）模型高出3.5％的结构相似指标（SSIM）。而在添加了LSA、SPT和RoPE优化技术后，提出的架构在两个任务上的重建性能增加了大约5％的SSIM。

Abstract
Removing noise from images is a challenging and fundamental problem in the field of computer vision. Images captured by modern cameras are inevitably degraded by noise which limits the accuracy of any quantitative measurements on those images. In this project, we propose a novel image reconstruction framework which can be used for tasks such as image denoising, deblurring or inpainting. The model proposed in this project is based on Vision Transformer (ViT) that takes 2D images as input and outputs embeddings which can be used for reconstructing denoised images. We incorporate four additional optimization techniques in the framework to improve the model reconstruction capability, namely Locality Sensitive Attention (LSA), Shifted Patch Tokenization (SPT), Rotary Position Embeddings (RoPE) and adversarial loss function inspired from Generative Adversarial Networks (GANs). LSA, SPT and RoPE enable the transformer to learn from the dataset more efficiently, while the adversarial loss function enhances the resolution of the reconstructed images. Based on our experiments, the proposed architecture outperforms the benchmark U-Net model by more than 3.5\% structural similarity (SSIM) for the reconstruction tasks of image denoising and inpainting. The proposed enhancements further show an improvement of \textasciitilde5\% SSIM over the benchmark for both tasks.

摘要
修除图像中的噪声是计算机视觉领域的一个挑战性问题。现代摄像头捕捉的图像都会受到噪声的影响，从而限制图像的准确性。在这个项目中，我们提出了一种新的图像重建框架，可以用于图像杂谱、模糊和缺失部分的修复。我们的模型基于视觉变换器（ViT），它可以将2D图像作为输入，并生成用于重建清晰图像的嵌入。我们在框架中添加了四种优化技术，以提高模型的重建能力，即：局部敏感注意力（LSA）、移动矩阵Tokenization（SPT）、旋转位嵌入（RoPE）以及基于生成对抗网络（GANs）的对抗损失函数。LSA、SPT和RoPE使得转换器能够更有效地学习数据集，而对抗损失函数则提高了重建图像的分辨率。根据我们的实验，我们的建议架构在图像杂谱和缺失部分的重建任务上超过了标准U-Net模型的3.5%Structural Similarity（SSIM），而我们的优化技术还能够提高\textasciitilde5% SSIM。

PKU-GoodsAD: A Supermarket Goods Dataset for Unsupervised Anomaly Detection and Segmentation

paper_url: http://arxiv.org/abs/2307.04956
repo_url: https://github.com/jianzhang96/goodsad
paper_authors: Jian Zhang, Runwei Ding, Miaoju Ban, Ge Yang
for: 这个研究是为了实现自动化超市商品异常检测，以扩展计算机视觉领域内的异常检测应用和研究。
methods: 这个研究使用了现有的无监督异常检测方法，并对其进行了评估。
results: 研究发现，一些在工业异常检测dataset（例如MVTec AD）中表现良好的方法对于这个超市商品异常检测dataset表现不佳，这是一个全面、多bject的异常检测 dataset。

Abstract
Visual anomaly detection is essential and commonly used for many tasks in the field of computer vision. Recent anomaly detection datasets mainly focus on industrial automated inspection, medical image analysis and video surveillance. In order to broaden the application and research of anomaly detection in unmanned supermarkets and smart manufacturing, we introduce the supermarket goods anomaly detection (GoodsAD) dataset. It contains 6124 high-resolution images of 484 different appearance goods divided into 6 categories. Each category contains several common different types of anomalies such as deformation, surface damage and opened. Anomalies contain both texture changes and structural changes. It follows the unsupervised setting and only normal (defect-free) images are used for training. Pixel-precise ground truth regions are provided for all anomalies. Moreover, we also conduct a thorough evaluation of current state-of-the-art unsupervised anomaly detection methods. This initial benchmark indicates that some methods which perform well on the industrial anomaly detection dataset (e.g., MVTec AD), show poor performance on our dataset. This is a comprehensive, multi-object dataset for supermarket goods anomaly detection that focuses on real-world applications.

摘要
“视觉异常检测是计算机视觉领域中非常重要的任务之一。现有的异常检测数据集主要集中在自动化生产、医疗影像分析和视频监测等领域。为扩展无人超市和智能制造领域中的异常检测应用和研究，我们介绍了超市商品异常检测（GoodsAD）数据集。该数据集包含6124个高分辨率图像，分为6类不同的商品类型，每类含有多种常见的异常类型，如扭曲、表面损害和开启等。异常包括Texture变化和结构变化。数据集遵循无监督设置，只有无损图像用于训练。每个异常都有精确的像素精度的真实区域标注。此外，我们还进行了现有状态的权威无监督异常检测方法的完整评估。这个初始的比较表明，一些在工业异常检测数据集（例如MVTec AD）中表现出色的方法，在我们的数据集中表现不佳。这是一个complete、多对象的超市商品异常检测数据集，关注实际应用。”

Compact Twice Fusion Network for Edge Detection

paper_url: http://arxiv.org/abs/2307.04952
repo_url: https://github.com/li-yachuan/ctfn-pytorch-master
paper_authors: Yachuan Li, Zongmin Li, Xavier Soria P., Chaozhi Yang, Qian Xiao, Yun Bai, Hua Li, Xiangdong Wang
for: 这篇论文旨在提出一种具有较少参数和计算成本的多级特征融合网络，以便实现高精度Edge检测。methods: 该方法包括两个轻量级多级特征融合模块：一个具有语义增强功能的语义增强模块（SEM），可以利用粗级特征中的语义信息来导引细级特征的学习；另一个是一个伪像像素级权重（PPW）模块，可以将多级特征的共同优点权重为所有特征。results: 相比于state-of-the-art方法，CTFN在BSDS500、NYUDv2和BIPEDv2三个数据集上达到了竞争性的准确率，同时具有较少的参数和计算成本。特别是，CTFN只需要0.1M的额外参数，相比其他state-of-the-art方法的参数量减少了60%。代码可以在https://github.com/Li-yachuan/CTFN-pytorch-master中下载。

Abstract
The significance of multi-scale features has been gradually recognized by the edge detection community. However, the fusion of multi-scale features increases the complexity of the model, which is not friendly to practical application. In this work, we propose a Compact Twice Fusion Network (CTFN) to fully integrate multi-scale features while maintaining the compactness of the model. CTFN includes two lightweight multi-scale feature fusion modules: a Semantic Enhancement Module (SEM) that can utilize the semantic information contained in coarse-scale features to guide the learning of fine-scale features, and a Pseudo Pixel-level Weighting (PPW) module that aggregate the complementary merits of multi-scale features by assigning weights to all features. Notwithstanding all this, the interference of texture noise makes the correct classification of some pixels still a challenge. For these hard samples, we propose a novel loss function, coined Dynamic Focal Loss, which reshapes the standard cross-entropy loss and dynamically adjusts the weights to correct the distribution of hard samples. We evaluate our method on three datasets, i.e., BSDS500, NYUDv2, and BIPEDv2. Compared with state-of-the-art methods, CTFN achieves competitive accuracy with less parameters and computational cost. Apart from the backbone, CTFN requires only 0.1M additional parameters, which reduces its computation cost to just 60% of other state-of-the-art methods. The codes are available at https://github.com/Li-yachuan/CTFN-pytorch-master.

摘要
“多尺度特征的重要性逐渐被边缘检测社区所认可。然而，将多尺度特征融合到模型中增加了模型的复杂度，这不符合实际应用的需求。在这个工作中，我们提出了一个名为Compact Twice Fusion Network（CTFN）的方法，可以充分融合多尺度特征，同时保持模型的简洁性。CTFN包括两个轻量级多尺度特征融合模组：一个具有Semantic Enhancement Module（SEM），可以利用粗细度特征中的 semantics信息来引导细节特征的学习；另一个则是一个名为Pseudo Pixel-level Weighting（PPW）模组，可以将多尺度特征的补偿特点相互融合。不过，由于Texture noise的干扰，使得某些像素的正确分类仍然是一个挑战。为了解决这个问题，我们提出了一个新的损失函数，即Dynamic Focal Loss，它可以重新定义标准的交叉熵损失函数，并在适当的情况下动态地调整权重，以正确地对待困难的样本。我们在BSDS500、NYUDv2和BIPEDv2三个dataset上评估了我们的方法，与现有的方法相比，CTFN实现了竞争的精度，仅需额外0.1M参数，对应的计算成本只有60%。代码可以在https://github.com/Li-yachuan/CTFN-pytorch-master中找到。”

DDGM: Solving inverse problems by Diffusive Denoising of Gradient-based Minimization

paper_url: http://arxiv.org/abs/2307.04946
repo_url: None
paper_authors: Kyle Luther, H. Sebastian Seung
for: 这个论文的目的是提出一种简单的方法，用于解决逆问题，该方法combines tradicional gradient-based minimization of reconstruction error with denoising。
methods: 该方法使用了一个卷积神经网络来去噪，并在运行时通过这个神经网络进行反射。
results: 研究发现，使用这种方法可以在50个去噪步骤中获得高精度的重建结果，并且比DDRM和DPS等更复杂的扩散方法更高精度（按照MSE和SSIM的评价）。此外，该方法还可以在处理任意大小的图像上进行重建。

Abstract
Inverse problems generally require a regularizer or prior for a good solution. A recent trend is to train a convolutional net to denoise images, and use this net as a prior when solving the inverse problem. Several proposals depend on a singular value decomposition of the forward operator, and several others backpropagate through the denoising net at runtime. Here we propose a simpler approach that combines the traditional gradient-based minimization of reconstruction error with denoising. Noise is also added at each step, so the iterative dynamics resembles a Langevin or diffusion process. Both the level of added noise and the size of the denoising step decay exponentially with time. We apply our method to the problem of tomographic reconstruction from electron micrographs acquired at multiple tilt angles. With empirical studies using simulated tilt views, we find parameter settings for our method that produce good results. We show that high accuracy can be achieved with as few as 50 denoising steps. We also compare with DDRM and DPS, more complex diffusion methods of the kinds mentioned above. These methods are less accurate (as measured by MSE and SSIM) for our tomography problem, even after the generation hyperparameters are optimized. Finally we extend our method to reconstruction of arbitrary-sized images and show results on 128 $\times$ 1568 pixel images

摘要
通常，反射问题需要一个正则化或先验来获得良好的解决方案。一种现在趋势是使用卷积网来除噪图像，并将这个网络作为先验来解决反射问题。一些提议基于前景算子的特征值分解，而其他一些在运行时通过杀噪网络进行反propagation。在这里，我们提出了一种更简单的方法，它将传统的梯度基于的最小化重建错误与杀噪结合在一起。噪声也在每步添加，因此迭代过程类似于朗凡或分散过程。噪声水平和杀噪步骤的加入幂数都随时间呈指数衰减。我们应用我们的方法于电子镜像多角度扫描问题中的重建问题。通过对 simulate tilt view 进行实验，我们选择了方法的参数设置，并发现高精度可以通过50次杀噪步骤获得。我们还与DDRM和DPS等更复杂的扩散方法进行比较，这些方法在我们的 Tomatoes 问题中具有较差的精度（按照MSE和SSIM来度量），即使通过优化其 гиперпараметров。最后，我们扩展了我们的方法到任意大小的图像重建问题，并对128 x 1568像素图像进行示例。

Count-Free Single-Photon 3D Imaging with Race Logic

paper_url: http://arxiv.org/abs/2307.04924
repo_url: None
paper_authors: Atul Ingle, David Maier
For: The paper is written for the development of an online approach for distance estimation using single-photon cameras (SPCs) that can reduce bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional processing methods.* Methods: The paper uses race logic to process photon streams in the time-delay domain and constructs count-free equi-depth histograms using a binner element that converges on the median of a distribution.* Results: The paper shows that the proposed approach can provide an order of magnitude reduction in bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional processing methods.

Abstract
Single-photon cameras (SPCs) have emerged as a promising technology for high-resolution 3D imaging. A single-photon 3D camera determines the round-trip time of a laser pulse by capturing the arrival of individual photons at each camera pixel. Constructing photon-timestamp histograms is a fundamental operation for a single-photon 3D camera. However, in-pixel histogram processing is computationally expensive and requires large amount of memory per pixel. Digitizing and transferring photon timestamps to an off-sensor histogramming module is bandwidth and power hungry. Here we present an online approach for distance estimation without explicitly storing photon counts. The two key ingredients of our approach are (a) processing photon streams using race logic, which maintains photon data in the time-delay domain, and (b) constructing count-free equi-depth histograms. Equi-depth histograms are a succinct representation for ``peaky'' distributions, such as those obtained by an SPC pixel from a laser pulse reflected by a surface. Our approach uses a binner element that converges on the median (or, more generally, to another quantile) of a distribution. We cascade multiple binners to form an equi-depth histogrammer that produces multi-bin histograms. Our evaluation shows that this method can provide an order of magnitude reduction in bandwidth and power consumption while maintaining similar distance reconstruction accuracy as conventional processing methods.

摘要
单 photon 摄像机（SPC）已经出现为高分辨率 3D 成像技术的替代方案。一个单 photon 3D 摄像机可以通过记录每个像素中的单 photon 到达时间来确定激光脉冲的往返时间。构建单 photon 时间域 histogram 是基本的操作。但是，在每个像素中进行 histogram 处理是 computationally expensive 和需要大量的内存。将单 photon 时间戳转移到外部 histogramming 模块进行处理是带宽和功耗很大的。我们现在介绍一种在线方法，无需直接存储单 photon 计数，可以实现距离估计。我们的方法包括以下两个关键元素：(a) 使用 race logic 处理单 photon 流，以保持单 photon 数据在时间延迟Domain中，(b) 使用 count-free equi-depth histogram 构建方法。 equi-depth histogram 是一种简洁的表示方法，可以用于表示由 SPC 像素反射激光脉冲后得到的“峰状”分布。我们的方法使用一个 binner 元素，可以 converge 到分布的中位値（或更一般地，另一个量）。我们将多个 binner 组合成一个 equi-depth histogrammer，以生成多个分布。我们的评估表明，这种方法可以提供一个数颗级别的带宽和功耗减少，同时保持与传统处理方法相同的距离重建精度。

Kinematically-Decoupled Impedance Control for Fast Object Visual Servoing and Grasping on Quadruped Manipulators

paper_url: http://arxiv.org/abs/2307.04918
repo_url: None
paper_authors: Riccardo Parosi, Mattia Risiglione, Darwin G. Caldwell, Claudio Semini, Victor Barasuol
for: 该论文旨在提出一个基于分离式机械链和弹簧控制的对象搜寻、接近和抓取（SAG）控制管道，并与图像基于视 серво（IBVS）集成。
methods: 该管道使用分离式机械链，以实现快速的终端器运动和恢复，从而实现可靠的视 серво。
results: 在我们140公斤HyQReal四脚机器人上测试了该管道，并在不同的动态移动、外部干扰和快速目标物移动等情况下，表现出了高效和稳定的性能。

Abstract
We propose a control pipeline for SAG (Searching, Approaching, and Grasping) of objects, based on a decoupled arm kinematic chain and impedance control, which integrates image-based visual servoing (IBVS). The kinematic decoupling allows for fast end-effector motions and recovery that leads to robust visual servoing. The whole approach and pipeline can be generalized for any mobile platform (wheeled or tracked vehicles), but is most suitable for dynamically moving quadruped manipulators thanks to their reactivity against disturbances. The compliance of the impedance controller makes the robot safer for interactions with humans and the environment. We demonstrate the performance and robustness of the proposed approach with various experiments on our 140 kg HyQReal quadruped robot equipped with a 7-DoF manipulator arm. The experiments consider dynamic locomotion, tracking under external disturbances, and fast motions of the target object.

摘要
我们提出了一个对象搜索、接近和抓取（SAG）控制管道，基于分离式机械臂链和弹簧控制，并 integrate了图像基于视服务（IBVS）。机械链的分离使得结束效器快速运动和恢复，从而实现了可靠的视服务。整个方法和管道可以普遍应用于任何移动平台（轮式或轨道车辆），但最适合动态移动四脚 manipulate器，因为它们对干扰的敏感。弹簧控制器的弹性使 robot更安全地与人类和环境进行交互。我们通过对我们7度自由度拥有的140公斤HyQReal四脚 robot的各种实验，证明了我们提出的方法的性能和可靠性。这些实验包括动态移动、外部干扰的追踪和目标物体的快速运动。

Rapid Deforestation and Burned Area Detection using Deep Multimodal Learning on Satellite Imagery

paper_url: http://arxiv.org/abs/2307.04916
repo_url: https://github.com/h2oai/cvpr-multiearth-deforestation-segmentation
paper_authors: Gabor Fodor, Marcos V. Conde
for: 这项研究的目的是提出一种基于多模态卫星影像和远程感知技术的方法，用于估计亚马逊盆地的森林破坏和野火检测。
methods: 该研究使用了深度学习方法和全面数据处理技术，并开发了一个新的准备过程，以提高森林破坏和野火检测的精度。
results: 该研究成功地实现了高精度的森林破坏估计和野火检测，并且在未见图像上也能够达到高精度。code、模型和数据集都是开源的：https://github.com/h2oai/cvpr-multiearth-deforestation-segmentation。

Abstract
Deforestation estimation and fire detection in the Amazon forest poses a significant challenge due to the vast size of the area and the limited accessibility. However, these are crucial problems that lead to severe environmental consequences, including climate change, global warming, and biodiversity loss. To effectively address this problem, multimodal satellite imagery and remote sensing offer a promising solution for estimating deforestation and detecting wildfire in the Amazonia region. This research paper introduces a new curated dataset and a deep learning-based approach to solve these problems using convolutional neural networks (CNNs) and comprehensive data processing techniques. Our dataset includes curated images and diverse channel bands from Sentinel, Landsat, VIIRS, and MODIS satellites. We design the dataset considering different spatial and temporal resolution requirements. Our method successfully achieves high-precision deforestation estimation and burned area detection on unseen images from the region. Our code, models and dataset are open source: https://github.com/h2oai/cvpr-multiearth-deforestation-segmentation

摘要
亚马逊雨林的森林耗损和野火探测存在巨大的挑战，主要是因为这个地区的面积非常广阔，同时访问受限。但这些问题对环境造成严重的影响，包括气候变化、全球暖化和生物多样性损失。为了有效解决这个问题，多模态卫星影像和远程感知技术提供了一个有前途的解决方案。本研究论文介绍了一个新的准备过的数据集和深度学习基于 convolutional neural networks (CNNs) 和全面的数据处理技术来解决森林耗损和野火探测问题。我们的数据集包括准备过的图像和多种通道频谱的卫星数据，包括 Sentinel、Landsat、VIIRS 和 MODIS 卫星。我们设计了数据集，考虑了不同的空间和时间分辨率要求。我们的方法在未看到图像上实现了高精度的森林耗损和烧毁地带探测。我们的代码、模型和数据集都是开源的，可以在 GitHub 上找到：https://github.com/h2oai/cvpr-multiearth-deforestation-segmentation。

Planar Curve Registration using Bayesian Inversion

paper_url: http://arxiv.org/abs/2307.04909
repo_url: None
paper_authors: Andreas Bock, Colin J. Cotter, Robert C. Kirby
For: 该研究是关于固有参数化的关闭曲线匹配问题的 bayesian 反向问题。* Methods: 该研究使用了 diffeomorphism group 模型曲线的运动，并使用了 Wu-Xu 元素来解析 Hamilton’s 方程，以确定曲线的匹配。* Results: 研究使用了 ensemble Kalman inversion 方法，并采用了负 Sobolev нор偏差罚 penalty 来衡量目标和尘埃均值形状之间的差异。

Abstract
We study parameterisation-independent closed planar curve matching as a Bayesian inverse problem. The motion of the curve is modelled via a curve on the diffeomorphism group acting on the ambient space, leading to a large deformation diffeomorphic metric mapping (LDDMM) functional penalising the kinetic energy of the deformation. We solve Hamilton's equations for the curve matching problem using the Wu-Xu element [S. Wu, J. Xu, Nonconforming finite element spaces for $2m^\text{th}$ order partial differential equations on $\mathbb{R}^n$ simplicial grids when $m=n+1$, Mathematics of Computation 88 (316) (2019) 531-551] which provides mesh-independent Lipschitz constants for the forward motion of the curve, and solve the inverse problem for the momentum using Bayesian inversion. Since this element is not affine-equivalent we provide a pullback theory which expedites the implementation and efficiency of the forward map. We adopt ensemble Kalman inversion using a negative Sobolev norm mismatch penalty to measure the discrepancy between the target and the ensemble mean shape. We provide several numerical examples to validate the approach.

摘要
我们研究Parameterisation-independent closed planar curve匹配作为 bayesian inverse problem。Curve的运动被模型为diffomorphism group acting on the ambient space中的一个curve，导致一个大尺度diffometric mapping（LDDMM）函数 penalty curvature的动能。我们使用Wu-Xu element [S. Wu, J. Xu, Nonconforming finite element spaces for $2m^\text{th}$ order partial differential equations on $\mathbb{R}^n$ simplicial grids when $m=n+1$, Mathematics of Computation 88 (316) (2019) 531-551]来解析 Hamilton's equations for the curve匹配问题，这个element提供了独立于网格的Lipschitz常数for the forward motion of the curve，并使用 bayesian inversion来解决反问题。由于这个element不是 affine-equivalent，我们提供了一个pullback theory来 expedite the implementation and efficiency of the forward map。我们采用ensemble Kalman inversion using a negative Sobolev norm mismatch penalty to measure the discrepancy between the target and the ensemble mean shape。我们提供了多个数字示例来验证方法。

Unsupervised Domain Adaptation with Deep Neural-Network

paper_url: http://arxiv.org/abs/2307.05601
repo_url: https://github.com/jetwev/domain-adaptation
paper_authors: Artem Bituitskii
for: 本研究为Unsupervised Domain Adaptation领域提供了一个分析现有方法、引入新方法，并在不同领域下进行视觉识别任务的改进。
methods: 本研究使用了现有的方法和新提出的方法来进行领域适应。
results: 本研究的结果表明，通过采用新的方法和技术，可以在不同领域下进行视觉识别任务的改进。

Abstract
This report contributes to the field of unsupervised domain adaptation by providing an analysis of existing methods, introducing a new approach, and demonstrating the potential for improving visual recognition tasks across different domains. The results of this study open up opportunities for further study and development of advanced methods in the field of domain adaptation.

摘要
这份报告对不监督领域适应进行了分析，提出了一种新的方法，并证明了在不同频率上进行视觉识别任务的改进potential。这些研究结果为领域适应领域的进一步研究开创了新的机会。Here's a word-for-word translation:这份报告对不监督领域适应进行了分析，提出了一种新的方法，并证明了在不同频率上进行视觉识别任务的改进potential。这些研究结果为领域适应领域的进一步研究开创了新的机会。

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

paper_url: http://arxiv.org/abs/2307.04859
repo_url: None
paper_authors: Alexander W. Bergman, Wang Yifan, Gordon Wetzstein
for: 文章旨在提供一种基于文本描述的3D人物头像生成方法，以满足数字化人物创建、虚拟现实等领域的需求。
methods: 该方法基于预训练的2D文本到图像扩散模型，直接利用这些模型生成3D多视角准确的辐射场，以实现3D人物头像的生成。新的优化策略引入了 geometry和 texture 的约束，以保证生成的3D人物头像与文本描述保持一致。
results: 实验结果表明， compared to CLIP 等其他方法，我们的扩散基于的3D人物头像生成方法可以提供更高的多样性和准确性。

Abstract
The ability to generate diverse 3D articulated head avatars is vital to a plethora of applications, including augmented reality, cinematography, and education. Recent work on text-guided 3D object generation has shown great promise in addressing these needs. These methods directly leverage pre-trained 2D text-to-image diffusion models to generate 3D-multi-view-consistent radiance fields of generic objects. However, due to the lack of geometry and texture priors, these methods have limited control over the generated 3D objects, making it difficult to operate inside a specific domain, e.g., human heads. In this work, we develop a new approach to text-guided 3D head avatar generation to address this limitation. Our framework directly operates on the geometry and texture of an articulable 3D morphable model (3DMM) of a head, and introduces novel optimization procedures to update the geometry and texture while keeping the 2D and 3D facial features aligned. The result is a 3D head avatar that is consistent with the text description and can be readily articulated using the deformation model of the 3DMM. We show that our diffusion-based articulated head avatars outperform state-of-the-art approaches for this task. The latter are typically based on CLIP, which is known to provide limited diversity of generation and accuracy for 3D object generation.

摘要
<>将文本导向的3D人物头生成能力是许多应用程序的关键，包括增强现实、电影摄影和教育。最近的文本导向3D对象生成研究已经展现出了很大的承诺，这些方法直接利用预训练的2D文本扩散模型来生成3D多视角准确的辐射场景。然而，由于缺乏geometry和Texture prior，这些方法控制3D对象的生成很难，特别是在人头部上。在这项工作中，我们开发了一种新的文本导向3D头人物生成方法，以解决这个限制。我们的框架直接操作3D人物模型（3DMM）的几何和Texture，并引入了新的优化过程来更新几何和Texture，同时保持2D和3D脸部特征的一致。结果是一个与文本描述一致的3D头人物，可以轻松地使用3DMM的扭formation来操作。我们示出了我们的扩散基于的拟合头人物比之前的方法更高。这些方法通常基于CLIP，CLIP知道提供有限的多样性和3D对象生成的准确性。

AmadeusGPT: a natural language interface for interactive animal behavioral analysis

paper_url: http://arxiv.org/abs/2307.04858
repo_url: https://github.com/adaptivemotorcontrollab/amadeusgpt
paper_authors: Shaokai Ye, Jessy Lauer, Mu Zhou, Alexander Mathis, Mackenzie W. Mathis
for: 这 paper 的目的是提供一种自然语言界面，使得动物行为分析可以轻松地转化为机器可读代码。
methods: 这 paper 使用了大型自然语言模型（LLM），如 GPT3.5 和 GPT4，以及一种新的双存储机制，以解决自然语言界面中的上下文窗口限制。
results: authors 通过 benchmarking 表明，AmadeusGPT 可以在 MABE 2022 行为挑战任务上达到状态码性表现。

Abstract
The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We benchmark AmadeusGPT and show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT.

摘要
“量化和分析动物行为的过程通常需要将动物的自然occurring描述语言转化为可读取的机器代码。然而，编码行为分析 часто是困难的，因为需要深刻了解动物行为和技术机器学习知识。为限制这个差距，我们介绍AmadeusGPT：一个自然语言界面，可以将自然语言描述转化为机器可执行代码。大语言模型（LLM），如GPT3.5和GPT4，允许用户进行互动语言基本查询，这些查询可能适合进行互动行为分析。然而，这些LLM的理解能力受到上下文窗口大小的限制，因此无法记忆远程对话。为超越上下文窗口限制，我们实现了一种双重记忆机制，允许短期和长期记忆之间的交互通信，使用符号作为上下文指针进行检索和存储。”“用户可以直接使用自然语言来定义行为，而我们的扩展GPT将基于 AmadeusGPT API 中的机器学习、计算机视觉、空间-时间推理和可视化模块来开发代码。用户可以互动地纠正结果，并可以随时添加新的行为模块。我们对AmadeusGPT进行了 benchmarking，并证明我们可以在 MABE 2022 行为挑战任务上达到状态作均或更高的性能。注意，用户不需要写任何代码来实现这一点。因此，AmadeusGPT 总的来说是一种将深刻生物知识、大语言模型和核心计算机视觉模块集成在一起的更自然智能系统。代码和示例可以在 GitHub 上找到：https://github.com/AdaptiveMotorControlLab/AmadeusGPT。”

CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction

paper_url: http://arxiv.org/abs/2307.04838
repo_url: https://github.com/llnl/crepe
paper_authors: Rakshith Subramanyam, T. S. Jayram, Rushil Anirudh, Jayaraman J. Thiagarajan
for: 这paper是为了探讨视觉语言模型（VLM），具体来说是CLIP，在预测视觉对象关系方面的潜力。
methods: 这paper使用的方法是UVTransE关系预测框架，该框架学习关系为一个翻译嵌入，包括主体、 объек和联合盒 embeddings。
results: 这paper使用CLIP表示法，与UVTransE框架结合，在Visual Genome benchmark上实现了预测 predicate 的state-of-the-art性能，具体来说是mR@5 27.79，mR@20 31.95，相比最近的状态态的表现提高了15.3%。

Abstract
In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based representations for all three bounding boxes and introduces a novel contrastive training strategy to automatically infer the text prompt for union-box. Our approach achieves state-of-the-art performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual Genome benchmark, achieving a 15.3\% gain in performance over recent state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in object relation prediction and encourages further research on VLMs in this challenging domain.

摘要
在这篇论文中，我们探索了视觉语言模型（VLM），尤其是CLIP，在预测视觉对象关系方面的潜力。现有的状态艺术方法使用复杂的图形模型，利用语言提示和视觉特征来解决这个挑战。我们假设CLIP的强语言偏好可以简化这些图形模型，为更简单的方法开辟道路。我们采用UVTransE关系预测框架，该框架学习关系为视图的翻译嵌入，从场景中获得主题、 объек、并 union 盒子嵌入。我们系统地探索CLIP基于主题、 объек、并 union 盒子表示的设计，并提出CREPE（CLIP Representation Enhanced Predicate Estimation）方法。CREPE利用文本基于表示 для所有三个 bounding 盒子，并 introduces 一种新的对比训练策略，自动生成 union 盒子的文本提示。我们的方法在Visual Genome benchmark上实现了 predicate estimation 的状态艺术性成绩，mR@5 27.79，mR@20 31.95，与最近的状态艺术提高15.3%。这项工作证明了CLIP在对象关系预测中的效iveness，并鼓励进一步的VLM在这一领域的研究。

Semantic-SAM: Segment and Recognize Anything at Any Granularity

paper_url: http://arxiv.org/abs/2307.04767
repo_url: https://github.com/ux-decoder/semantic-sam
paper_authors: Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, Jianfeng Gao
for: 本研究旨在开发一种通用的图像分割模型，以实现任何粒度的分割和识别。
methods: 我们的模型具有两个优势：具备semantic-awareness和granularity-abundance。我们通过将多个数据集集成到三个粒度上，并引入分离类别的object和part分类来实现semantic-awareness。为实现多粒度能力，我们提出了多选学习方案，使得每个鼠标单击可以生成多个粒度的mask，与多个真实粒度的mask相匹配。
results: 实验结果和视觉化表明，我们的模型成功实现了semantic-awareness和granularity-abundance。此外，将SA-1B训练与其他分割任务相结合，如�anoptic分割和part分割，会提高性能。我们将提供代码和demo，以便进一步的探索和评估。

Abstract
In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our model offers two key advantages: semantic-awareness and granularity-abundance. To achieve semantic-awareness, we consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. This allows our model to capture rich semantic information. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels that correspond to multiple ground-truth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that our model successfully achieves semantic-awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements. We will provide code and a demo for further exploration and evaluation.

摘要
在本文中，我们介绍Semantic-SAM模型，一种通用的图像分割模型，可以在任何希望的粒度上进行分割和识别。我们的模型具有两个优势：具有 semantic-awareness 和 granularity-abundance。为实现 semantic-awareness，我们将多个数据集合并，并引入分离类别对象和部件。这使得我们的模型能够捕捉丰富的Semantic信息。为实现多粒度能力，我们提议在训练时使用多选学习方案，使每个鼠标单击可以生成多个粒度的mask，与多个真实粒度的mask相匹配。值得注意的是，这是首次将SA-1B、通用和部分分割数据集合并训练模型。实验结果和视觉化显示，我们的模型成功实现 semantic-awareness 和 granularity-abundance。此外，将SA-1B训练与其他分割任务，如�ANN和部分分割，结合可以提高性能。我们将提供代码和示例，以便进一步探索和评估。

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

paper_url: http://arxiv.org/abs/2307.04760
repo_url: None
paper_authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman
for: 学习基于 Egocentric 视频中的空间声视关系的表示，提高 Egocentric 视频中的社交场景中的空间理解。
methods: 使用masked auto-encoding框架，通过声视同步学习，学习声视之间的有用关系。
results: 通过大量实验，表明我们的特征可以超越多个状态的准确基eline，在两个公共的 Egocentric 视频数据集上提高active speaker detection和spatial audio denoising等两个视频任务的性能。Here’s the breakdown of each line:
for: The paper is written for learning representations based on spatial audio-visual correspondences in egocentric videos, with the goal of improving spatial understanding in social scenarios.
methods: The method used is a masked auto-encoding framework, which leverages the synergy of audio and vision to learn useful spatial relationships between the two modalities.
results: The features learned by the method are generic enough to improve over multiple state-of-the-art baselines on two public challenging egocentric video datasets, EgoCom and EasyCom.

Abstract
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. In particular, our method leverages a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. We show through extensive experiments that our features are generic enough to improve over multiple state-of-the-art baselines on two public challenging egocentric video datasets, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.

摘要
我们提出一种自主学习的方法，通过 egocentric 视频中的空间声音关系学习表示。具体来说，我们利用伪装自编码框架，通过声音和视频之间的共同作用，生成伪装声音，从而学习声音和视频之间的有用空间关系。我们使用我们预训练的特征来解决两个需要社交场景中的空间理解的视频任务：活跃speaker检测和空间声音净化。我们通过广泛的实验表明，我们的特征可以超越多个州态艺法基elines在两个公共的 egocentric 视频数据集上，EgoCom和EasyCom。项目：http://vision.cs.utexas.edu/projects/ego_av_corr。

paper_url: http://arxiv.org/abs/2307.04751
repo_url: None
paper_authors: Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, Dieter Fox
for: 这个系统是用于重新排序场景中的物体，以实现想要的物体-场景置换关系，如书插入开放架上的槽中。methods: 该系统使用3D点云数据进行训练，并且可以普适到新的几何结构、姿态和布局。它使用迭代姿态噪声训练过程，可以处理多种多样的示例数据，并且可以生成多种多样的输出。results: 该系统可以在三个不同的重新排序任务中进行处理，包括处理多样性和总体结构的损害。它可以提供精度和准确性，并且可以conditioning on相关的地方几何特征，而忽略不相关的全局结构。

Abstract
We propose a system for rearranging objects in a scene to achieve a desired object-scene placing relationship, such as a book inserted in an open slot of a bookshelf. The pipeline generalizes to novel geometries, poses, and layouts of both scenes and objects, and is trained from demonstrations to operate directly on 3D point clouds. Our system overcomes challenges associated with the existence of many geometrically-similar rearrangement solutions for a given scene. By leveraging an iterative pose de-noising training procedure, we can fit multi-modal demonstration data and produce multi-modal outputs while remaining precise and accurate. We also show the advantages of conditioning on relevant local geometric features while ignoring irrelevant global structure that harms both generalization and precision. We demonstrate our approach on three distinct rearrangement tasks that require handling multi-modality and generalization over object shape and pose in both simulation and the real world. Project website, code, and videos: https://anthonysimeonov.github.io/rpdiff-multi-modal/

摘要
我们提出一个系统，用于重新排序场景中的物体，以实现想要的物体-场景占位关系，如一本书被插入打开的书架中的一个开口中。我们的管道可以涵盖新的几何结构、姿态和布局，并从示例学习而来操作直接3D点云。我们的系统可以解决场景中存在许多几何相似的重新排序解的挑战。通过利用循环pose的净化训练过程，我们可以适应多模态示例数据，并产生多模态输出，同时保持精度和准确。我们还表明了在conditioning on relevante的本地几何特征，而忽略不相关的全局结构，可以提高generalization和精度。我们在三个不同的重新排序任务中展示了我们的方法，这些任务需要处理多模态和对物体形状和姿态的泛化。项目网站、代码和视频：https://anthonysimeonov.github.io/rpdiff-multi-modal/

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

paper_url: http://arxiv.org/abs/2307.04725
repo_url: https://github.com/guoyww/animatediff
paper_authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai
for: 这个论文的目的是提出一种实用的框架，以便使得大多数现有的个性化文本到图模型都可以一键animate。
methods: 该框架的核心是插入一个新初始化的动态模型模块，并在视频剪辑中培养它以抽取合理的动态约束。
results: 在多个公共代表性的个性化文本到图模型上进行评估，研究发现该框架可以使这些模型生成smooth的动画图片，同时保持它们的领域和多样性。I hope that helps! Let me know if you have any other questions.

Abstract
With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at https://animatediff.github.io/ .

摘要
《文本到图像模型（如稳定扩散）和相关个性化技术（如梦幻箱和LoRA）的进步，使得任何人都可以将想象力转化成高质量图像，而且很有经济效益。然而，随着图像动画技术的发展，需要一个实用的框架来animateexististing的个性化文本到图像模型，以免需要额外的模型特定调整。我们提出了一种实用的框架，核心思想是插入一个新初始化的动力模型模块到冻结的文本到图像模型中，并在视频剪辑中训练以抽取合理的动力约束。一旦训练完成，只需将这个动力模型模块插入到个性化文本到图像模型中，即使不是原始模型的特定版本，也可以使得这些模型生成个性化的动画图像。我们对多个公共代表性的个性化文本到图像模型进行了评估，包括漫画图像和真实 fotografías，并证明了我们的提议框架可以帮助这些模型生成 temporally smooth的动画clip，保持域和多样性的输出。代码和预训练参数将公开在https://animatediff.github.io/。》

CVPR MultiEarth 2023 Deforestation Estimation Challenge:SpaceVision4Amazon

paper_url: http://arxiv.org/abs/2307.04715
repo_url: None
paper_authors: Sunita Arya, S Manthira Moorthi, Debajyoti Dhar
for: 本研究开发了一种基于注意力引导UNet架构的森林砍伐估计方法，使用电子光（EO）和Synthetic Aperture Radar（SAR）卫星图像。
methods: 本研究使用了Landstat-8和Sentinel-1侦测器的EO和SAR图像进行训练和验证，由于没有时空对准的数据，因此个别模型被训练了每个感应器。
results: 在训练时，Landstat-8模型 achieved training和验证像素精度为93.45%，Sentinel-2模型 achieved 83.87%像素精度。在测试集评估中，模型 achieved像素精度为84.70%，F1-Score为0.79，IoU为0.69。

Abstract
In this paper, we present a deforestation estimation method based on attention guided UNet architecture using Electro-Optical (EO) and Synthetic Aperture Radar (SAR) satellite imagery. For optical images, Landsat-8 and for SAR imagery, Sentinel-1 data have been used to train and validate the proposed model. Due to the unavailability of temporally and spatially collocated data, individual model has been trained for each sensor. During training time Landsat-8 model achieved training and validation pixel accuracy of 93.45% and Sentinel-2 model achieved 83.87% pixel accuracy. During the test set evaluation, the model achieved pixel accuracy of 84.70% with F1-Score of 0.79 and IoU of 0.69.

摘要
在本文中，我们提出了一种基于注意力导向UNet架构的检测方法，用于估计采用电Optical（EO）和Synthetic Aperture Radar（SAR）卫星图像中的森林析出。对于光学图像，我们使用了Landstat-8数据进行训练和验证；对于SAR图像，我们使用了Sentinel-1数据。由于数据的时空均不可用，我们需要对每枚摄像机进行单独的模型训练。在训练时，Landstat-8模型在训练和验证像素精度上达到93.45%，而Sentinel-2模型在训练和验证像素精度上达到83.87%。在测试集评估中，模型达到了像素精度84.70%，F1得分0.79和 IoU0.69。

FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing

paper_url: http://arxiv.org/abs/2307.04684
repo_url: https://github.com/lpengyang/freedrag
paper_authors: Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin
for: 提高图像编辑精度和灵活性，解决DragGANmiss tracking和ambiguous tracking问题。
methods: 采用特征导向方法，具有适应模板特征、线搜索和柔化本地化技术，实现稳定和高效的点基于图像编辑。
results: 比DragGAN更高效和稳定，在复杂的图像编辑 scenarios中实现精准的点基于编辑。

Abstract
To serve the intricate and varied demands of image editing, precise and flexible manipulation of image content is indispensable. Recently, DragGAN has achieved impressive editing results through point-based manipulation. However, we have observed that DragGAN struggles with miss tracking, where DragGAN encounters difficulty in effectively tracking the desired handle points, and ambiguous tracking, where the tracked points are situated within other regions that bear resemblance to the handle points. To deal with the above issues, we propose FreeDrag, which adopts a feature-oriented approach to free the burden on point tracking within the point-oriented methodology of DragGAN. The FreeDrag incorporates adaptive template features, line search, and fuzzy localization techniques to perform stable and efficient point-based image editing. Extensive experiments demonstrate that our method is superior to the DragGAN and enables stable point-based editing in challenging scenarios with similar structures, fine details, or under multi-point targets.

摘要
为了满足图像修改的细致和多样化需求，图像内容的精准和灵活修改是不可或缺的。近期，DragGAN已经实现了印象深刻的编辑结果通过点基修改。然而，我们发现DragGAN在跟踪把钩点时存在着困难和混淆跟踪问题，其中跟踪点可能会被其他相似区域所吸引。为了解决这些问题，我们提出了FreeDrag，它采用了特征 ориентирован的方法来减轻DragGAN中点跟踪的负担。FreeDragintegrates了适应模板特征、线搜索和朴素地理化技术，以实现稳定和高效的点基图像修改。广泛的实验表明，我们的方法比DragGAN更加稳定和可靠，并且在复杂的场景下，包括类似结构、细节和多个目标下，也能够实现稳定的点基修改。

2023-07-11

On the Vulnerability of DeepFake Detectors to Attacks Generated by Denoising Diffusion Models

Self-supervised adversarial masking for 3D point cloud representation learning

Class Instance Balanced Learning for Long-Tailed Classification

Navigating Uncertainty: The Role of Short-Term Trajectory Prediction in Autonomous Vehicle Safety

Unbiased Scene Graph Generation via Two-stage Causal Modeling

APRF: Anti-Aliasing Projection Representation Field for Inverse Problem in Imaging

OpenAL: An Efficient Deep Active Learning Framework for Open-Set Pathology Image Classification

Evidence-based Hand Hygiene. Can You Trust the Fluorescent-based Assessment Methods?

DRMC: A Generalist Model with Dynamic Routing for Multi-Center PET Image Synthesis

Does pre-training on brain-related tasks results in better deep-learning-based brain age biomarkers?

Generative Pretraining in Multimodality

The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework

ResMatch: Residual Attention Learning for Local Feature Matching

HistoColAi: An Open-Source Web Platform for Collaborative Digital Histology Image Annotation with AI-Driven Predictive Integration

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

ExFaceGAN: Exploring Identity Directions in GAN’s Learned Latent Space for Synthetic Identity Generation

Unveiling the Invisible: Enhanced Detection and Analysis of Deteriorated Areas in Solar PV Modules Using Unsupervised Sensing Algorithms and 3D Augmented Reality

DFR: Depth from Rotation by Uncalibrated Image Rectification with Latitudinal Motion Assumption

One-Shot Learning for Periocular Recognition: Exploring the Effect of Domain Adaptation and Data Bias on Deep Representations

Hyperspherical Embedding for Point Cloud Completion

Offline and Online Optical Flow Enhancement for Deep Video Compression

SAR-NeRF: Neural Radiance Fields for Synthetic Aperture Radar Multi-View Representation

Estimating label quality and errors in semantic segmentation data via any model

Disentangled Contrastive Image Translation for Nighttime Surveillance

Towards Anytime Optical Flow Estimation with Event Cameras

TRansPose: Large-Scale Multispectral Dataset for Transparent Object

Test-Time Training on Video Streams

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

$\mathrm{SAM^{Med}}$: A medical image annotation framework based on large vision model

A Multi-view Impartial Decision Network for Frontotemporal Dementia Diagnosis

Diffusion idea exploration for art generation

SAM-U: Multi-box prompts triggered uncertainty estimation for reliable SAM in medical image

Image Reconstruction using Enhanced Vision Transformer

PKU-GoodsAD: A Supermarket Goods Dataset for Unsupervised Anomaly Detection and Segmentation

Compact Twice Fusion Network for Edge Detection

DDGM: Solving inverse problems by Diffusive Denoising of Gradient-based Minimization

Count-Free Single-Photon 3D Imaging with Race Logic

Kinematically-Decoupled Impedance Control for Fast Object Visual Servoing and Grasping on Quadruped Manipulators

Rapid Deforestation and Burned Area Detection using Deep Multimodal Learning on Satellite Imagery

Planar Curve Registration using Bayesian Inversion

Unsupervised Domain Adaptation with Deep Neural-Network

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

AmadeusGPT: a natural language interface for interactive animal behavioral analysis

CREPE: Learnable Prompting With CLIP Improves Visual Relationship Prediction

Semantic-SAM: Segment and Recognize Anything at Any Granularity

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

CVPR MultiEarth 2023 Deforestation Estimation Challenge:SpaceVision4Amazon

FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing