2023-08-09

cs.CV

cs.CV - 2023-08-09

Density Crop-guided Semi-supervised Object Detection in Aerial Images

paper_url: http://arxiv.org/abs/2308.05032
repo_url: https://github.com/akhilpm/dronessod
paper_authors: Akhil Meethal, Eric Granger, Marco Pedersoli
for: 这篇论文主要应用于如何提高飞行器上的物体探测器训练效率，特别是面临小型物体分布在高分辨率图像上的情况。
methods: 本研究使用了pseudo-labels和弱强同步增强的方法，并将其应用于飞行器上的物体探测器训练。在训练过程中，使用了图像组的丰富度导航，将图像组分为不同的小区域，并将这些小区域用于训练。在测试过程中，使用了这些小区域来增强物体探测器的准确性。
results: 本研究的实验结果显示，使用了density crop-guided semi-supervised detector可以在COCO式AP中提高物体探测器的准确性超过2%。

Abstract
One of the important bottlenecks in training modern object detectors is the need for labeled images where bounding box annotations have to be produced for each object present in the image. This bottleneck is further exacerbated in aerial images where the annotators have to label small objects often distributed in clusters on high-resolution images. In recent days, the mean-teacher approach trained with pseudo-labels and weak-strong augmentation consistency is gaining popularity for semi-supervised object detection. However, a direct adaptation of such semi-supervised detectors for aerial images where small clustered objects are often present, might not lead to optimal results. In this paper, we propose a density crop-guided semi-supervised detector that identifies the cluster of small objects during training and also exploits them to improve performance at inference. During training, image crops of clusters identified from labeled and unlabeled images are used to augment the training set, which in turn increases the chance of detecting small objects and creating good pseudo-labels for small objects on the unlabeled images. During inference, the detector is not only able to detect the objects of interest but also regions with a high density of small objects (density crops) so that detections from the input image and detections from image crops are combined, resulting in an overall more accurate object prediction, especially for small objects. Empirical studies on the popular benchmarks of VisDrone and DOTA datasets show the effectiveness of our density crop-guided semi-supervised detector with an average improvement of more than 2\% over the basic mean-teacher method in COCO style AP. Our code is available at: https://github.com/akhilpm/DroneSSOD.

摘要
一个重要的瓶颈在现代物体探测器的训练中是需要标注的图像，其中每个图像需要生成矩形框注释。这个瓶颈在飞行图像中更加严重，因为注释者需要为高分辨率图像中的小对象进行标注。在最近的日子里，使用pseudo-labels和弱强协同稳定的方法训练的semi-supervised物体探测器在获得优化的结果。然而，直接将这种semi-supervised探测器应用于飞行图像中可能并不会导致最佳的结果。在这篇论文中，我们提出了一种基于密度的剪辑引导的semi-supervised探测器，它在训练时使用标注和无标注图像中的群集来增强训练集，从而提高小对象的检测和生成良好的pseudo-labels。在探测时，探测器不仅可以检测输入图像中的对象，还可以检测密度较高的小对象区域（密度剪辑），从而将输入图像和剪辑图像的检测结果组合起来，实现更加准确的对象预测，特别是小对象。我们的实验结果表明，我们的密度剪辑引导的semi-supervised探测器在COCO风格的AP中超过2%的提升。我们的代码可以在GitHub上找到：https://github.com/akhilpm/DroneSSOD。

An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images

paper_url: http://arxiv.org/abs/2308.05026
repo_url: None
paper_authors: Hao Cheng, Mengmeng liu, Lin Chen
for: 本研究旨在提出一个端到端检测、跟踪和预测框架，帮助自动驾驶车辆实现更高精度的路径预测。
methods: 本研究使用了state-of-the-art online多对象跟踪模型QD-3DT进行感知，并直接基于检测结果训练了路径预测器DCENet++。
results: 广泛的实验表明，ODTP在nuScenes数据集上表现出高级别的端到端路径预测能力，DCENet++通过加强动态地图来预测更为准确的路径，并与其他生成和决定性路径预测模型相比较为稳定。

Abstract
Perception that involves multi-object detection and tracking, and trajectory prediction are two major tasks of autonomous driving. However, they are currently mostly studied separately, which results in most trajectory prediction modules being developed based on ground truth trajectories without taking into account that trajectories extracted from the detection and tracking modules in real-world scenarios are noisy. These noisy trajectories can have a significant impact on the performance of the trajectory predictor and can lead to serious prediction errors. In this paper, we build an end-to-end framework for detection, tracking, and trajectory prediction called ODTP (Online Detection, Tracking and Prediction). It adopts the state-of-the-art online multi-object tracking model, QD-3DT, for perception and trains the trajectory predictor, DCENet++, directly based on the detection results without purely relying on ground truth trajectories. We evaluate the performance of ODTP on the widely used nuScenes dataset for autonomous driving. Extensive experiments show that ODPT achieves high performance end-to-end trajectory prediction. DCENet++, with the enhanced dynamic maps, predicts more accurate trajectories than its base model. It is also more robust when compared with other generative and deterministic trajectory prediction models trained on noisy detection results.

摘要
感知 tasks 中的多对象探测和跟踪，以及预测 trajectory 是自动驾驶技术的两大任务。然而，这两个任务目前大多是分开研究，导致大多数预测 trajectory 模块是基于真实的 trajectory 进行开发，而不是基于实际enario 中的探测和跟踪结果。这些噪音的 trajectory 可能会对预测性能产生重要的影响，导致严重的预测错误。在这篇论文中，我们提出了一个综合框架，称为 ODTP（Online Detection, Tracking and Prediction），用于探测、跟踪和预测。ODTP 采用了当前最佳的在线多对象跟踪模型 QD-3DT，用于感知，并直接基于探测结果进行预测 trajectory 的训练，而不是完全依赖于真实的 trajectory。我们对 nuScenes 数据集进行了广泛的实验，并证明了 ODTP 在综合框架中的高性能端到端预测。DCENet++ 使用了增强的动态地图，预测更加准确的 trajectory，并与其基本模型相比，更加Robust。

paper_url: http://arxiv.org/abs/2308.05022
repo_url: https://github.com/avc2-uestc/craft-sr
paper_authors: Ao Li, Le Zhang, Yun Liu, Ce Zhu
For: The paper is written for improving the performance of single image super-resolution (SISR) using transformer-based methods.* Methods: The paper proposes a new method called CRAFT, which integrates the strengths of both convolutional and transformer structures. CRAFT consists of three key components: the high-frequency enhancement residual block (HFERB), the shift rectangle window attention block (SRWAB), and the hybrid fusion block (HFB).* Results: The paper reports that CRAFT outperforms state-of-the-art methods by up to 0.29dB while using fewer parameters, as demonstrated through experiments on multiple datasets.

Abstract
Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. Our experiments on multiple datasets demonstrate that CRAFT outperforms state-of-the-art methods by up to 0.29dB while using fewer parameters. The source code will be made available at: https://github.com/AVC2-UESTC/CRAFT-SR.git.

摘要
“ transformer 基本方法在单图超解（SISR）中表现出了非常出色的潜力，但大多数当前的研究都是强调设计 transformer 块来捕捉全球信息，而忽略了包含高频约束的重要性，我们认为这可能是有利的。在我们的研究中，我们进行了一系列实验，发现 transformer 结构更适合捕捉低频信息，但在高频信息建模方面有限制，与 convolutional 结构相比。我们的提议方案，即 cross-refinement adaptive feature modulation transformer（CRAFT），结合了 convolutional 和 transformer 结构的优点。它包括三个关键组件：高频增强剩余块（HFERB）、移动矩形窗口注意块（SRWAB）和混合融合块（HFB）。我们在多个数据集上进行了实验，结果显示，CRAFT 可以与状态之前的方法相比，在0.29dB 的误差上提高到最高。我们将在 GitHub 上公开源代码：https://github.com/AVC2-UESTC/CRAFT-SR.git。”

Robust Object Modeling for Visual Tracking

paper_url: http://arxiv.org/abs/2308.05140
repo_url: https://github.com/dawnyc/romtrack
paper_authors: Yidong Cai, Jie Liu, Jie Tang, Gangshan Wu
for: 提高视觉跟踪的稳定性和性能，尤其是在对象的形态变化和不确定环境下。
methods: 提出了一种同时模型固有模板和协同模板特征的robust对象模型框架（ROMTrack），通过将固有模板特征和搜索区域指导相结合，Suppress干扰性的distractors，提取target对象相关的特征。同时，通过协同模板特征，提取更多的target对象特征，提高整体性能。
results: 实验表明，ROMTrack在多个benchmark上达到了新的状态态-of-the-art水平， indicating that the proposed framework is effective in improving the stability and performance of visual tracking.

Abstract
Object modeling has become a core part of recent tracking frameworks. Current popular tackers use Transformer attention to extract the template feature separately or interactively with the search region. However, separate template learning lacks communication between the template and search regions, which brings difficulty in extracting discriminative target-oriented features. On the other hand, interactive template learning produces hybrid template features, which may introduce potential distractors to the template via the cluttered search regions. To enjoy the merits of both methods, we propose a robust object modeling framework for visual tracking (ROMTrack), which simultaneously models the inherent template and the hybrid template features. As a result, harmful distractors can be suppressed by combining the inherent features of target objects with search regions' guidance. Target-related features can also be extracted using the hybrid template, thus resulting in a more robust object modeling framework. To further enhance robustness, we present novel variation tokens to depict the ever-changing appearance of target objects. Variation tokens are adaptable to object deformation and appearance variations, which can boost overall performance with negligible computation. Experiments show that our ROMTrack sets a new state-of-the-art on multiple benchmarks.

摘要
现代跟踪框架中的对象模型已成为核心。目前流行的跟踪器使用 transformer 注意力来分离模板特征或在搜索区域中交互地学习模板。然而，分离模板学习缺乏模板和搜索区域之间的交流，这会增加提取特定目标 oriented 特征的困难。然而，交互模板学习生成的杂合模板特征可能会通过搜索区域中的噪声引入潜在的干扰器。为了享受到这两种方法的优点，我们提出了一种robust对象模型框架（ROMTrack），它同时模型了内在模板和杂合模板特征。因此，干扰器可以通过将内在特征与搜索区域的指导相结合来被抑制。同时，我们还可以使用杂合模板来提取目标相关的特征，从而得到更加robust的对象模型框架。为了进一步增强可靠性，我们还提出了一种新的变化 токен来描述目标对象的变化。这些变化 токен可以适应物体变形和变化，从而提高总性能。实验表明，我们的 ROMTrack 在多个标准准则上设置了新的状态之冠。

Do Diffusion Models Suffer Error Propagation? Theoretical Analysis and Consistency Regularization

paper_url: http://arxiv.org/abs/2308.05021
repo_url: None
paper_authors: Yangming Li, Zhaozhi Qian, Mihaela van der Schaar
for: This paper aims to address the error propagation issue in diffusion models, which can cause the cascade structure to magnify distributional mismatches.
methods: The paper proposes a regularization scheme to address error propagation in diffusion models, which is based on a consistency constraint that ensures the forward and backward processes have similar distributions.
results: The paper shows through theoretical analysis and experimental results that the proposed regularization scheme can effectively reduce error propagation in diffusion models, leading to improved performance on multiple image datasets.

Abstract
While diffusion models have achieved promising performances in data synthesis, they might suffer error propagation because of their cascade structure, where the distributional mismatch spreads and magnifies through the chain of denoising modules. However, a strict analysis is expected since many sequential models such as Conditional Random Field (CRF) are free from error propagation. In this paper, we empirically and theoretically verify that diffusion models are indeed affected by error propagation and we then propose a regularization to address this problem. Our theoretical analysis reveals that the question can be reduced to whether every denoising module of the diffusion model is fault-tolerant. We derive insightful transition equations, indicating that the module can't recover from input errors and even propagates additional errors to the next module. Our analysis directly leads to a consistency regularization scheme for diffusion models, which explicitly reduces the distribution gap between forward and backward processes. We further introduce a bootstrapping algorithm to reduce the computation cost of the regularizer. Our experimental results on multiple image datasets show that our regularization effectively handles error propagation and significantly improves the performance of vanilla diffusion models.

摘要
diffusion models 有 achieved promising performances in data synthesis, but they may suffer from error propagation due to their cascade structure, where the distributional mismatch spreads and magnifies through the chain of denoising modules. However, strict analysis is expected since many sequential models such as Conditional Random Field (CRF) are free from error propagation. In this paper, we empirically and theoretically verify that diffusion models are indeed affected by error propagation, and we then propose a regularization to address this problem.我们的理论分析表明， diffusion models 的问题可以简化为每个减腾模组是否有耐错能力。我们 derivate 出几个重要的转换方程，显示每个减腾模组无法从输入错误中恢复，甚至将错误传递到下一个模组。我们的分析直接导向一种一致调整方案，用于降低 diffusion models 中的分布差距。我们还提出了一个快速bootstrapping算法，以减少调整的计算成本。我们的实验结果显示，我们的调整方法可以有效地处理 error propagation，并提高了 vanilla diffusion models 的性能。

Deep Learning Model Transfer in Forest Mapping using Multi-source Satellite SAR and Optical Images

paper_url: http://arxiv.org/abs/2308.05005
repo_url: None
paper_authors: Shaojia Ge, Oleg Antropov, Tuomas Häme, Ronald E. McRoberts, Jukka Miettinen
for: 这个论文的目的是使用深度学习模型预测森林变量，但是在实际的森林调查中，参考数据通常是plot或stand水平的测量数据，而高质量的代表性数据对深度学习模型的末端训练是罕见的。
methods: 这个研究使用了转移学习，将预训练的深度学习模型转移到目标区域，并使用了plot水平的测量数据进行训练。使用了earlier开发的UNet基于模型（SeUNet）来示例两个不同的落叶阔寒地区，并使用了多源的地球观测数据（ Copernicus Sentinel-1 C-band SAR、Sentinel-2多spectral图像、JAXA ALOS-2 PALSAR-2 SAR组合图像和TanDEM-X бистатиче干扰 ради谱数据）进行预测。
results: 通过转移学习，SeUNet的预测得到了根圆误差（RMSE）为2.70米和R$^2$为0.882，与传统标准方法相比较为准确。 authors expect这种森林特定的深度学习模型转移可以适用于其他森林变量和其他地球观测数据源。

Abstract
Deep learning (DL) models are gaining popularity in forest variable prediction using Earth Observation images. However, in practical forest inventories, reference datasets are often represented by plot- or stand-level measurements, while high-quality representative wall-to-wall reference data for end-to-end training of DL models are rarely available. Transfer learning facilitates expansion of the use of deep learning models into areas with sub-optimal training data by allowing pretraining of the model in areas where high-quality teaching data are available. In this study, we perform a "model transfer" (or domain adaptation) of a pretrained DL model into a target area using plot-level measurements and compare performance versus other machine learning models. We use an earlier developed UNet based model (SeUNet) to demonstrate the approach on two distinct taiga sites with varying forest structure and composition. Multisource Earth Observation (EO) data are represented by a combination of Copernicus Sentinel-1 C-band SAR and Sentinel-2 multispectral images, JAXA ALOS-2 PALSAR-2 SAR mosaic and TanDEM-X bistatic interferometric radar data. The training study site is located in Finnish Lapland, while the target site is located in Southern Finland. By leveraging transfer learning, the prediction of SeUNet achieved root mean squared error (RMSE) of 2.70 m and R$^2$ of 0.882, considerably more accurate than traditional benchmark methods. We expect such forest-specific DL model transfer can be suitable also for other forest variables and other EO data sources that are sensitive to forest structure.

摘要
深度学习（DL）模型在森林变量预测中获得广泛应用，但在实际森林资源评估中，参考数据通常由干扰或树立面级别测量，而高质量代表墙壁到墙壁的参考数据对深度学习模型的端到端训练是罕见。转移学习可以扩展深度学习模型在有限制性训练数据的领域中的应用，通过将预训练模型转移到目标领域使用干扰量级测量。本研究使用了之前开发的UNet基于模型（SeUNet），在两个不同的落叶阔绿林区域中进行了对比研究。使用了欧盟资料遥感-1C频率Synthetic Aperture Radar（SAR）、欧盟资料遥感-2多spectral图像、JAXA ALOS-2 PALSAR-2 SAR覆盖图和TanDEM-X对干扰雷达数据。训练研究地点位于芬兰拉普兰地区，目标地点位于南芬兰。通过利用转移学习，SeUNet模型预测的root mean squared error（RMSE）为2.70米，R$^2$为0.882，远远高于传统标准方法。我们预期这种森林特有的深度学习模型转移可以适用于其他森林变量和其他遥感数据源，这些数据源对森林结构敏感。

Discrepancy-based Active Learning for Weakly Supervised Bleeding Segmentation in Wireless Capsule Endoscopy Images

paper_url: http://arxiv.org/abs/2308.05137
repo_url: None
paper_authors: Fan Bai, Xiaohan Xing, Yutian Shen, Han Ma, Max Q. -H. Meng
for: 这篇论文的目的是提出一种新的缺失基本学习（DEAL）方法，以将涉猛投影（CAM）标签与医学图像的真实标签之间的差距填充，并且只需要少量的人工标注。
methods: 这篇论文使用了一种新的缺失基本学习（DEAL）方法，包括一个专门的错误分配模型和一个CAMPUS（CAM、假标签和真实标签选择）标准。这个方法通过将涉猛投影（CAM）标签与模型预测之间的差距用来取代噪音假标签。
results: 根据实验结果，这篇论文的方法比预设的活动学习方法更好，并且只需要10%的训练数据被标注。此外，这篇论文的方法与完全标注的数据集 trains中的性能相似。

Abstract
Weakly supervised methods, such as class activation maps (CAM) based, have been applied to achieve bleeding segmentation with low annotation efforts in Wireless Capsule Endoscopy (WCE) images. However, the CAM labels tend to be extremely noisy, and there is an irreparable gap between CAM labels and ground truths for medical images. This paper proposes a new Discrepancy-basEd Active Learning (DEAL) approach to bridge the gap between CAMs and ground truths with a few annotations. Specifically, to liberate labor, we design a novel discrepancy decoder model and a CAMPUS (CAM, Pseudo-label and groUnd-truth Selection) criterion to replace the noisy CAMs with accurate model predictions and a few human labels. The discrepancy decoder model is trained with a unique scheme to generate standard, coarse and fine predictions. And the CAMPUS criterion is proposed to predict the gaps between CAMs and ground truths based on model divergence and CAM divergence. We evaluate our method on the WCE dataset and results show that our method outperforms the state-of-the-art active learning methods and reaches comparable performance to those trained with full annotated datasets with only 10% of the training data labeled.

摘要
weakly 监督的方法，如基于活化图 (CAM) 的方法，已经应用于具有低注释努力的内膜投射图像 (WCE) 中的分割。然而，CAM 标签往往具有噪声，并且在医疗图像中存在不可覆盖的差距 между CAM 标签和真实值。这篇论文提出了一种新的差异基于活动学习 (DEAL) 方法，以填补 CAM 标签和真实值之间的差距。特别是，为了减少劳动力，我们设计了一种新的差异解码器模型和 CAMPUS (CAM, Pseudo-label and groUnd-truth Selection) criterion。差异解码器模型通过独特的训练方法生成标准、粗略和细化预测。而 CAMPUS criterion 基于模型分布和 CAM 分布来预测 CAM 与真实值之间的差距。我们在 WCE 数据集上评估了我们的方法，结果显示，我们的方法比 estado-of-the-art 活动学习方法高效，并且只使用 10% 的训练数据标注就达到了与全注释数据集相同的性能。

IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models

paper_url: http://arxiv.org/abs/2308.04995
repo_url: https://github.com/fdbtrs/IDiff-Face
paper_authors: Fadi Boutros, Jonas Henry Grebe, Arjan Kuijper, Naser Damer
for: This paper aims to address the issue of limited intra-class diversity and cross-class discrimination in synthetic face datasets, which hinders the performance of face recognition models trained on these datasets.
methods: The proposed approach, IDiff-Face, uses conditional latent diffusion models to generate synthetic identities with realistic identity variations for face recognition training.
results: The proposed approach achieved state-of-the-art performance on the LFW benchmark, with an accuracy of 98.00%, significantly outperforming recent synthetic-based face recognition solutions (95.40%) and bridging the gap to authentic-based face recognition (99.82%).

Abstract
The availability of large-scale authentic face databases has been crucial to the significant advances made in face recognition research over the past decade. However, legal and ethical concerns led to the recent retraction of many of these databases by their creators, raising questions about the continuity of future face recognition research without one of its key resources. Synthetic datasets have emerged as a promising alternative to privacy-sensitive authentic data for face recognition development. However, recent synthetic datasets that are used to train face recognition models suffer either from limitations in intra-class diversity or cross-class (identity) discrimination, leading to less optimal accuracies, far away from the accuracies achieved by models trained on authentic data. This paper targets this issue by proposing IDiff-Face, a novel approach based on conditional latent diffusion models for synthetic identity generation with realistic identity variations for face recognition training. Through extensive evaluations, our proposed synthetic-based face recognition approach pushed the limits of state-of-the-art performances, achieving, for example, 98.00% accuracy on the Labeled Faces in the Wild (LFW) benchmark, far ahead from the recent synthetic-based face recognition solutions with 95.40% and bridging the gap to authentic-based face recognition with 99.82% accuracy.

摘要
大量真实面部数据的可用性对于过去一代面部识别研究的进步做出了重要贡献。然而，法律和伦理问题导致了许多这些数据的撤回，使得未来面部识别研究的继续发展受到了很大的威胁。 synthetic数据 emerged as a promising alternative to privacy-sensitive authentic data for face recognition development. However, recent synthetic datasets used to train face recognition models suffer from limitations in intra-class diversity or cross-class (identity) discrimination, leading to less optimal accuracies, far away from the accuracies achieved by models trained on authentic data. This paper targets this issue by proposing IDiff-Face, a novel approach based on conditional latent diffusion models for synthetic identity generation with realistic identity variations for face recognition training. Through extensive evaluations, our proposed synthetic-based face recognition approach pushed the limits of state-of-the-art performances, achieving, for example, 98.00% accuracy on the Labeled Faces in the Wild (LFW) benchmark, far ahead from the recent synthetic-based face recognition solutions with 95.40% and bridging the gap to authentic-based face recognition with 99.82% accuracy.

Foreground Object Search by Distilling Composite Image Feature

paper_url: http://arxiv.org/abs/2308.04990
repo_url: https://github.com/bcmi/foreground-object-search-dataset-fosd
paper_authors: Bo Zhang, Jiacheng Sui, Li Niu
for: 本研究旨在提高前景物搜寻（FOS）的效果，通过塑造复合特征来提高匹配率。
methods: 本方法使用了一个师网络（teacher network）和一个学生网络（student network），其中师网络用于预测复合图像特征，而学生网络则使用了两个Encoder来提取前景特征和背景特征。它们之间的交互输出被要求与师网络预测的复合图像特征匹配。
results: 对于FOS任务，本方法比前一些方法更高效，并且提供了两个新的数据集（S-FOSD和R-FOSD），以便进一步探索FOS领域的可能性。

Abstract
Foreground object search (FOS) aims to find compatible foreground objects for a given background image, producing realistic composite image. We observe that competitive retrieval performance could be achieved by using a discriminator to predict the compatibility of composite image, but this approach has unaffordable time cost. To this end, we propose a novel FOS method via distilling composite feature (DiscoFOS). Specifically, the abovementioned discriminator serves as teacher network. The student network employs two encoders to extract foreground feature and background feature. Their interaction output is enforced to match the composite image feature from the teacher network. Additionally, previous works did not release their datasets, so we contribute two datasets for FOS task: S-FOSD dataset with synthetic composite images and R-FOSD dataset with real composite images. Extensive experiments on our two datasets demonstrate the superiority of the proposed method over previous approaches. The dataset and code are available at https://github.com/bcmi/Foreground-Object-Search-Dataset-FOSD.

摘要
Background object search (FOS) aims to find compatible background objects for a given background image, producing realistic composite image. We observe that competitive retrieval performance could be achieved by using a discriminator to predict the compatibility of composite image, but this approach has unaffordable time cost. To this end, we propose a novel FOS method via distilling composite feature (DiscoFOS). Specifically, the abovementioned discriminator serves as teacher network. The student network employs two encoders to extract foreground feature and background feature. Their interaction output is enforced to match the composite image feature from the teacher network. Additionally, previous works did not release their datasets, so we contribute two datasets for FOS task: S-FOSD dataset with synthetic composite images and R-FOSD dataset with real composite images. Extensive experiments on our two datasets demonstrate the superiority of the proposed method over previous approaches. The dataset and code are available at https://github.com/bcmi/Foreground-Object-Search-Dataset-FOSD.Here's the word-for-word translation of the text into Simplified Chinese:背景物体搜索（FOS）目标是找到与背景图像兼容的背景物体，生成真实的复合图像。我们观察到，通过使用一个推理器预测复合图像的兼容性，可以达到竞争性的检索性能，但这种方法具有不可持续的时间成本。为此，我们提议一种新的FOS方法，通过预测复合特征进行启发（DiscoFOS）。具体来说，所述的推理器服务器作为教师网络。学生网络使用两个编码器提取背景特征和前景特征。它们之间的交互输出被要求与教师网络中的复合图像特征匹配。此外，先前的工作没有公开其数据集，我们为FOS任务贡献了两个数据集：S-FOSD数据集和R-FOSD数据集。我们对这两个数据集进行了广泛的实验，并证明了我们的方法在先前方法之上具有超越性。数据集和代码可以在https://github.com/bcmi/Foreground-Object-Search-Dataset-FOSD上获取。

Self-supervised Landmark Learning with Deformation Reconstruction and Cross-subject Consistency Objectives

paper_url: http://arxiv.org/abs/2308.04987
repo_url: None
paper_authors: Chun-Hung Chao, Marc Niethammer
for: 本研究的目的是提出一种自动提取附注点的方法，以优化Statistical Shape Model (SSM)的建模。
methods: 我们提出了一种基于注册模型的自助学习方法，通过考虑注册模型中的特定点来提取附注点。
results: 我们的方法在一种退化股骨arthritis进程预测任务中表现出色，比 EXISTS image-based和点基的方法更高。

Abstract
A Point Distribution Model (PDM) is the basis of a Statistical Shape Model (SSM) that relies on a set of landmark points to represent a shape and characterize the shape variation. In this work, we present a self-supervised approach to extract landmark points from a given registration model for the PDMs. Based on the assumption that the landmarks are the points that have the most influence on registration, existing works learn a point-based registration model with a small number of points to estimate the landmark points that influence the deformation the most. However, such approaches assume that the deformation can be captured by point-based registration and quality landmarks can be learned solely with the deformation capturing objective. We argue that data with complicated deformations can not easily be modeled with point-based registration when only a limited number of points is used to extract influential landmark points. Further, landmark consistency is not assured in existing approaches In contrast, we propose to extract landmarks based on a given registration model, which is tailored for the target data, so we can obtain more accurate correspondences. Secondly, to establish the anatomical consistency of the predicted landmarks, we introduce a landmark discovery loss to explicitly encourage the model to predict the landmarks that are anatomically consistent across subjects. We conduct experiments on an osteoarthritis progression prediction task and show our method outperforms existing image-based and point-based approaches.

摘要
“一个点分布模型（PDM）是基础的 Statistical Shape Model（SSM）的基础，它透过一组点来表示形状和描述形状的变化。在这个工作中，我们提出了一个自动化的方法，以EXTract点来自已知的注册模型中，以便为PDM中的点分布建立更加精确的描述。对于现有的方法，它们假设只需要使用一小数量的点来学习点基于注册模型，并假设这些点可以充分地捕捉变形的特征。但是，我们认为复杂的变形不易被点基于的注册模型所捕捉，尤其是只使用一小数量的点。此外，现有的方法不能保证点的一致性。相反，我们提出了一个基于注册模型的方法，可以更加精确地描述点的分布，并且引入一个关于点的探索损失，以便更好地保证点的一致性。我们对于关节炎进步预测任务进行了实验，结果显示我们的方法可以较前者的图像基于和点基于方法出perform。”

ACE-HetEM for ab initio Heterogenous Cryo-EM 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.04956
repo_url: None
paper_authors: Weijie Chen, Lin Yao, Zeqing Xia, Yuhang Wang
for: 这篇论文的目的是解决潺爆电子镜像实验中的低信号噪声比和不确定的投影角和图像平移问题，以及将2D图像转化为3D结构。
methods: 这篇论文提出了一种基于整合推理的深度学习架构，称为ACE-HetEM，以解决这些问题。该方法通过设计了两个相互关联的训练任务：图像到图像任务和pose到pose任务，来显著强制分离姿态分类和投影估计。
results: 在模拟数据集上，ACE-HetEM的准确率和非整合方法相当，而且它还能生成更高的重建分辨率。此外，ACE-HetEM还可以应用于实验数据集。

Abstract
Due to the extremely low signal-to-noise ratio (SNR) and unknown poses (projection angles and image translation) in cryo-EM experiments, reconstructing 3D structures from 2D images is very challenging. On top of these challenges, heterogeneous cryo-EM reconstruction also has an additional requirement: conformation classification. An emerging solution to this problem is called amortized inference, implemented using the autoencoder architecture or its variants. Instead of searching for the correct image-to-pose/conformation mapping for every image in the dataset as in non-amortized methods, amortized inference only needs to train an encoder that maps images to appropriate latent spaces representing poses or conformations. Unfortunately, standard amortized-inference-based methods with entangled latent spaces have difficulty learning the distribution of conformations and poses from cryo-EM images. In this paper, we propose an unsupervised deep learning architecture called "ACE-HetEM" based on amortized inference. To explicitly enforce the disentanglement of conformation classifications and pose estimations, we designed two alternating training tasks in our method: image-to-image task and pose-to-pose task. Results on simulated datasets show that ACE-HetEM has comparable accuracy in pose estimation and produces even better reconstruction resolution than non-amortized methods. Furthermore, we show that ACE-HetEM is also applicable to real experimental datasets.

摘要
Translated into Simplified Chinese:由于电子顺传显微实验中的信号噪声比（SNR）和不知的投影角和图像翻译是非常低的，从2D图像中重建3D结构非常困难。此外，病理学实验中的重建还有一个额外要求：确定投影类别和结构。一种趋势的解决方案是受束推断，通过自动编码器架构或其变体来实现。在非束推断方法中，需要为每个图像在数据集中找到正确的图像-投影/结构映射。而我们提出的方法ACE-HetEM使用受束推断，通过两个相互训练任务来强制分离投影类别和结构估计：图像-图像任务和投影-投影任务。实验结果表明，ACE-HetEM在pose估计方面具有相当的准确率，并且在重建分辨率方面还能更好些。此外，我们还证明了ACE-HetEM可以应用于实验数据集。

Branches Mutual Promotion for End-to-End Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.04949
repo_url: None
paper_authors: Lei Zhu, Hangzhou He, Xinliang Zhang, Qian Chen, Shuang Zeng, Qiushi Ren, Yanye Lu
for: 提高单stage训练过程中的权重分割模型性能，无需强制关注分类分支。
methods: 将两个分支视为不同的分 segmentation map生成方法，并在两个分支之间建立互动 Mechanism，以便彼此补做。
results: 实验表明，我们的方法在无监督下进行权重分割，比现有的方法高效。

Abstract
End-to-end weakly supervised semantic segmentation aims at optimizing a segmentation model in a single-stage training process based on only image annotations. Existing methods adopt an online-trained classification branch to provide pseudo annotations for supervising the segmentation branch. However, this strategy makes the classification branch dominate the whole concurrent training process, hindering these two branches from assisting each other. In our work, we treat these two branches equally by viewing them as diverse ways to generate the segmentation map, and add interactions on both their supervision and operation to achieve mutual promotion. For this purpose, a bidirectional supervision mechanism is elaborated to force the consistency between the outputs of these two branches. Thus, the segmentation branch can also give feedback to the classification branch to enhance the quality of localization seeds. Moreover, our method also designs interaction operations between these two branches to exchange their knowledge to assist each other. Experiments indicate our work outperforms existing end-to-end weakly supervised segmentation methods.

摘要
End-to-end weakly supervised semantic segmentation aims to optimize a segmentation model in a single-stage training process based on only image annotations. Existing methods use an online-trained classification branch to provide pseudo annotations for supervising the segmentation branch, but this strategy hinders the two branches from assisting each other. In our work, we treat these two branches equally and add interactions on both their supervision and operation to achieve mutual promotion. For this purpose, we elaborate a bidirectional supervision mechanism to enforce consistency between the outputs of the two branches. This allows the segmentation branch to provide feedback to the classification branch to enhance the quality of localization seeds. Moreover, our method also designs interaction operations between the two branches to exchange their knowledge and assist each other. Experimental results show that our work outperforms existing end-to-end weakly supervised segmentation methods.

SelectNAdapt: Support Set Selection for Few-Shot Domain Adaptation

paper_url: http://arxiv.org/abs/2308.04946
repo_url: https://github.com/yussef93/selectnadapticcvw
paper_authors: Youssef Dawoud, Gustavo Carneiro, Vasileios Belagiannis
for: 这篇论文是为了解决深度神经网络在分布shift时的敏感性问题，特别是将源预训网络适应到目标预训网络中的问题。
methods: 这篇论文提出了一种基于选择支持集的几架shot预训网络适应方法，包括使用自我超vised学习特色特征，然后使用每个类别的聚集方案选择K个表达类别的目标预训网络样本。
results: 这篇论文的实验结果显示，相比于相关方法和标准随机选择方法，SelectNAdapt方法可以更好地适应深度神经网络到目标预训网络中，实现更高的预测性。

Abstract
Generalisation of deep neural networks becomes vulnerable when distribution shifts are encountered between train (source) and test (target) domain data. Few-shot domain adaptation mitigates this issue by adapting deep neural networks pre-trained on the source domain to the target domain using a randomly selected and annotated support set from the target domain. This paper argues that randomly selecting the support set can be further improved for effectively adapting the pre-trained source models to the target domain. Alternatively, we propose SelectNAdapt, an algorithm to curate the selection of the target domain samples, which are then annotated and included in the support set. In particular, for the K-shot adaptation problem, we first leverage self-supervision to learn features of the target domain data. Then, we propose a per-class clustering scheme of the learned target domain features and select K representative target samples using a distance-based scoring function. Finally, we bring our selection setup towards a practical ground by relying on pseudo-labels for clustering semantically similar target domain samples. Our experiments show promising results on three few-shot domain adaptation benchmarks for image recognition compared to related approaches and the standard random selection.

摘要
通用化的深度神经网络在面临源频率和目标频率数据分布shift时变得易受攻击。几拍频率适应缓解这个问题，通过随机选择和标注目标频率领域的支持集来适应源频率上预训练的深度神经网络。这篇论文提出，随机选择支持集可以进一步改进为有效地适应源模型到目标频率领域。作为一种改进方案，我们提出了SelectNAdapt算法，它可以细化目标频率领域样本的选择。具体来说，在K-shot适应问题中，我们首先利用自我超级vision来学习目标频率数据的特征。然后，我们提议一种类别划分方案，将学习的目标频率特征分为K个类别。接着，我们使用一个基于距离评分函数的距离分配K个表示目标频率样本。最后，我们将选择setup带到实际应用中，通过 pseudo-labels来划分semantic相似的目标频率样本。我们的实验结果显示，Compared to related approaches and the standard random selection, our method achieves promising results on three few-shot domain adaptation benchmarks for image recognition.

JEDI: Joint Expert Distillation in a Semi-Supervised Multi-Dataset Student-Teacher Scenario for Video Action Recognition

paper_url: http://arxiv.org/abs/2308.04934
repo_url: None
paper_authors: Lucian Bicsi, Bogdan Alexe, Radu Tudor Ionescu, Marius Leordeanu
For: 本研究提出了一种多dataset semi-supervised learning方法，即JEDI，以提高个体模型在不同数据集上的性能。* Methods: 该方法使用学习多个专家的知识，每个专家在不同数据集上进行预训练，然后将专家的特征表示 concatenate 成教师模型。 student-teacher semi-supervised learning 方法在joint 和 end-to-end 训练中进行学习，以提高学习效率和泛化能力。* Results: 在四个视频动作识别数据集上验证了该方法，结果表明，同时考虑所有数据集在一个统一的 semi-supervised Setting 中，可以获得显著的提升，与初始专家相比。

Abstract
We propose JEDI, a multi-dataset semi-supervised learning method, which efficiently combines knowledge from multiple experts, learned on different datasets, to train and improve the performance of individual, per dataset, student models. Our approach achieves this by addressing two important problems in current machine learning research: generalization across datasets and limitations of supervised training due to scarcity of labeled data. We start with an arbitrary number of experts, pretrained on their own specific dataset, which form the initial set of student models. The teachers are immediately derived by concatenating the feature representations from the penultimate layers of the students. We then train all models in a student-teacher semi-supervised learning scenario until convergence. In our efficient approach, student-teacher training is carried out jointly and end-to-end, showing that both students and teachers improve their generalization capacity during training. We validate our approach on four video action recognition datasets. By simultaneously considering all datasets within a unified semi-supervised setting, we demonstrate significant improvements over the initial experts.

摘要
我们提出了JEDI方法，这是一种多集数 semi-supervised learning方法，可以有效地结合多个专家所学习的知识，以提高每个特定集数据的学生模型的性能。我们的方法解决了当前机器学习研究中两个重要问题：跨集数泛化和监督学习因数据缺乏标注数据而受限。我们从初始的任意数量专家开始，每个专家都是在自己特定的集数据上预训练的。我们的教师是通过将学生模型的准备层的特征表示 concatenate 而得到的。然后，我们在学生-教师半监督学习场景下同时训练所有模型，直到收敛。在我们的有效的方法中，学生-教师训练是结合的和端到端的，表明在训练过程中，学生和教师都会提高其泛化能力。我们在四个视频动作识别dataset上验证了我们的方法，并显示了与初始专家相比有显著的改进。通过同时考虑所有集数据，我们的方法实现了跨集数据的泛化。

GeodesicPSIM: Predicting the Quality of Static Mesh with Texture Map via Geodesic Patch Similarity

paper_url: http://arxiv.org/abs/2308.04928
repo_url: https://github.com/Qi-Yangsjtu/GeodesicPSIM
paper_authors: Qi Yang, Joel Jung, Xiaozhong Xu, Shan Liu
for: GeodesicPSIM is proposed to accurately predict human perception quality for static meshes with texture maps.methods: The paper uses a two-step patch cropping algorithm and a patch texture mapping module to refine the size of 1-hop geodesic patches and build the relationship between mesh geometry and color information. Three types of features are extracted to quantify the distortion.results: GeodesicPSIM provides state-of-the-art performance in comparison with image-based, point-based, and video-based metrics on a newly created and challenging database. The paper also proves the robustness of GeodesicPSIM by introducing different settings of hyperparameters and exhibits the effectiveness of the three proposed features and the patch cropping algorithm through ablation studies.

Abstract
Static meshes with texture maps have attracted considerable attention in both industrial manufacturing and academic research, leading to an urgent requirement for effective and robust objective quality evaluation. However, current model-based static mesh quality metrics have obvious limitations: most of them only consider geometry information, while color information is ignored, and they have strict constraints for the meshes' geometrical topology. Other metrics, such as image-based and point-based metrics, are easily influenced by the prepossessing algorithms, e.g., projection and sampling, hampering their ability to perform at their best. In this paper, we propose Geodesic Patch Similarity (GeodesicPSIM), a novel model-based metric to accurately predict human perception quality for static meshes. After selecting a group keypoints, 1-hop geodesic patches are constructed based on both the reference and distorted meshes cleaned by an effective mesh cleaning algorithm. A two-step patch cropping algorithm and a patch texture mapping module refine the size of 1-hop geodesic patches and build the relationship between the mesh geometry and color information, resulting in the generation of 1-hop textured geodesic patches. Three types of features are extracted to quantify the distortion: patch color smoothness, patch discrete mean curvature, and patch pixel color average and variance. To the best of our knowledge, GeodesicPSIM is the first model-based metric especially designed for static meshes with texture maps. GeodesicPSIM provides state-of-the-art performance in comparison with image-based, point-based, and video-based metrics on a newly created and challenging database. We also prove the robustness of GeodesicPSIM by introducing different settings of hyperparameters. Ablation studies also exhibit the effectiveness of three proposed features and the patch cropping algorithm.

摘要
static mesh 的 texture map 在工业生产和学术研究中吸引了广泛的关注，导致对 static mesh 的效果评估 urgent requirement 的出现。然而，当前的模型基于的 static mesh 质量指标有显著的局限性：大多数它们只考虑 geometry 信息，而忽略 color 信息，并且具有严格的 mesh 的 geometrical topology 约束。其他指标，如图像基于和点基于的指标，容易受到预处理算法的影响，如投影和采样，这会限制它们的表现。在本文中，我们提出了 Geodesic Patch Similarity (GeodesicPSIM)，一种新的模型基于指标，可以准确预测 static mesh 的人类感知质量。选择一组关键点后，使用 cleaned 的 referential 和扭曲的 mesh 构建一个一阶 geodesic patch。一个两步 cropping 算法和一个 patch texture mapping 模块来质量提高一阶 geodesic patch，并将 mesh 的 geometry 和 color 信息建立关系。通过提取三种特征（patch color smoothness、patch discrete mean curvature 和 patch pixel color average和variance），我们可以量化扭曲的程度。在我们知道的情况下，GeodesicPSIM 是第一种特别设计用于 static mesh 的 texture map 的模型基于指标。在一个新创建的和挑战性较高的数据库中，GeodesicPSIM 与图像基于、点基于和视频基于指标进行比较，得到了最新的状态。我们还证明了 GeodesicPSIM 的稳定性，通过不同的设置 hyperparameters 进行证明。另外，我们还进行了ablation study，以证明三种提出的特征和 patch cropping 算法的效果。

Deep Learning-Based Prediction of Fractional Flow Reserve along the Coronary Artery

paper_url: http://arxiv.org/abs/2308.04923
repo_url: None
paper_authors: Nils Hampe, Sanne G. M. van Velzen, Jean-Paul Aben, Carlos Collet, Ivana Išgum
for: This paper aims to develop a deep learning-based method to predict the fractional flow reserve (FFR) along the coronary artery from coronary CT angiography (CCTA) scans, which can help doctors identify functionally significant coronary artery disease (CAD) and determine the best treatment strategy.
methods: The proposed method uses a combination of a variational autoencoder (VAE) and a convolutional neural network (CNN) to predict the FFR along the artery. The VAE is used to characterize the artery and generate an unsupervised artery encoding, while the CNN uses this encoding and other inputs to predict the FFR. The CNN is supervised by multiple loss functions, including a loss function inspired by the Earth Mover’s Distance (EMD) to predict the correct location of FFR drops and a histogram-based loss to explicitly supervise the slope of the FFR curve.
results: The proposed method was evaluated using eight-fold cross-validation on a dataset of 110 patients who underwent invasive FFR pullback measurement in 112 arteries. The resulting FFR curves showed good agreement with the reference, allowing the distinction between diffuse and focal CAD distributions in most cases. Quantitative evaluation yielded a mean absolute difference in the area under the FFR pullback curve (AUPC) of 1.7. The method has the potential to provide fast, accurate, and automatic prediction of FFR along the artery from CCTA, which may help doctors make more informed decisions about treatment strategies for CAD patients.

Abstract
Functionally significant coronary artery disease (CAD) is caused by plaque buildup in the coronary arteries, potentially leading to narrowing of the arterial lumen, i.e. coronary stenosis, that significantly obstructs blood flow to the myocardium. The current reference for establishing the presence of a functionally significant stenosis is invasive fractional flow reserve (FFR) measurement. To avoid invasive measurements, non-invasive prediction of FFR from coronary CT angiography (CCTA) has emerged. For this, machine learning approaches, characterized by fast inference, are increasingly developed. However, these methods predict a single FFR value per artery i.e. they don't provide information about the stenosis location or treatment strategy. We propose a deep learning-based method to predict the FFR along the artery from CCTA scans. This study includes CCTA images of 110 patients who underwent invasive FFR pullback measurement in 112 arteries. First, a multi planar reconstruction (MPR) of the artery is fed to a variational autoencoder to characterize the artery, i.e. through the lumen area and unsupervised artery encodings. Thereafter, a convolutional neural network (CNN) predicts the FFR along the artery. The CNN is supervised by multiple loss functions, notably a loss function inspired by the Earth Mover's Distance (EMD) to predict the correct location of FFR drops and a histogram-based loss to explicitly supervise the slope of the FFR curve. To train and evaluate our model, eight-fold cross-validation was performed. The resulting FFR curves show good agreement with the reference allowing the distinction between diffuse and focal CAD distributions in most cases. Quantitative evaluation yielded a mean absolute difference in the area under the FFR pullback curve (AUPC) of 1.7. The method may pave the way towards fast, accurate, automatic prediction of FFR along the artery from CCTA.

摘要
Functionally significant coronary artery disease (CAD) is caused by plaque buildup in the coronary arteries, potentially leading to narrowing of the arterial lumen, i.e. coronary stenosis, that significantly obstructs blood flow to the myocardium. The current reference for establishing the presence of a functionally significant stenosis is invasive fractional flow reserve (FFR) measurement. To avoid invasive measurements, non-invasive prediction of FFR from coronary CT angiography (CCTA) has emerged. For this, machine learning approaches, characterized by fast inference, are increasingly developed. However, these methods predict a single FFR value per artery, i.e. they don't provide information about the stenosis location or treatment strategy. We propose a deep learning-based method to predict the FFR along the artery from CCTA scans.This study includes CCTA images of 110 patients who underwent invasive FFR pullback measurement in 112 arteries. First, a multi planar reconstruction (MPR) of the artery is fed to a variational autoencoder to characterize the artery, i.e. through the lumen area and unsupervised artery encodings. Thereafter, a convolutional neural network (CNN) predicts the FFR along the artery. The CNN is supervised by multiple loss functions, notably a loss function inspired by the Earth Mover's Distance (EMD) to predict the correct location of FFR drops and a histogram-based loss to explicitly supervise the slope of the FFR curve. To train and evaluate our model, eight-fold cross-validation was performed. The resulting FFR curves show good agreement with the reference, allowing the distinction between diffuse and focal CAD distributions in most cases. Quantitative evaluation yielded a mean absolute difference in the area under the FFR pullback curve (AUPC) of 1.7. The method may pave the way towards fast, accurate, automatic prediction of FFR along the artery from CCTA.

Cross-view Semantic Alignment for Livestreaming Product Recognition

paper_url: http://arxiv.org/abs/2308.04912
repo_url: https://github.com/adxcreative/rice
paper_authors: Wenjie Yang, Yiyi Chen, Yan Li, Yanhua Cheng, Xudong Liu, Quan Chen, Han Li
for: 这篇论文旨在提出一个大规模多modal的产品识别数据集（LPR4M），并提出一种基于多视图实例匹配的方法（RICE）来学习产品特征。
methods: 该方法使用了多视图实例匹配学习（MIL）和卷积神经网络，通过实例级别的对比学习和跨视图patch特征传递来学习产品特征。此外，该方法还提出了一种patch特征重建损失来惩罚跨视图patch的semantic不一致。
results: 对于LPR4M数据集，该方法 achieved state-of-the-art performance，并且提供了对多modal数据集的评估和分析，以及对不同视图和数据集的影响的研究。

Abstract
Live commerce is the act of selling products online through live streaming. The customer's diverse demands for online products introduce more challenges to Livestreaming Product Recognition. Previous works have primarily focused on fashion clothing data or utilize single-modal input, which does not reflect the real-world scenario where multimodal data from various categories are present. In this paper, we present LPR4M, a large-scale multimodal dataset that covers 34 categories, comprises 3 modalities (image, video, and text), and is 50x larger than the largest publicly available dataset. LPR4M contains diverse videos and noise modality pairs while exhibiting a long-tailed distribution, resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt (RICE) model is proposed to learn discriminative instance features from the image and video views of the products. This is achieved through instance-level contrastive learning and cross-view patch-level feature propagation. A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between cross-view patches. Extensive experiments demonstrate the effectiveness of RICE and provide insights into the importance of dataset diversity and expressivity. The dataset and code are available at https://github.com/adxcreative/RICE

摘要
live commerce 是在线销售产品的现象，客户的多样化需求对于在线产品的涵义提出了更多的挑战。现有的研究主要集中在时尚服装数据上，或者使用单模态输入，这并不反映实际情况中的多模态数据从多个类别存在。本文提出了LPR4M，一个大规模多模态数据集，覆盖34个类别，包括图像、视频和文本三种模式，比最大公开数据集50倍大。LPR4M包含多样化的视频和噪音模式对，同时具有长尾分布，类似于实际问题。此外，一种cross-view semantiC alignmEnt（RICE）模型被提出，用于从图像和视频视图中学习抽象实例特征。这是通过实例级别对比学习和跨视图块特征传递来实现的。一种新的patch feature reconstruction loss函数被提出，以惩恶分布在跨视图块上的semantic misalignment。广泛的实验表明RICE的有效性，并提供了数据多样性和表达力的意义。数据和代码可以在https://github.com/adxcreative/RICE上下载。

StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability

paper_url: http://arxiv.org/abs/2308.04904
repo_url: https://github.com/qmme/stablevqa
paper_authors: Tengchuan Kou, Xiaohong Liu, Wei Sun, Jun Jia, Xiongkuo Min, Guangtao Zhai, Ning Liu
for: 这个论文主要是为了提出一种新的视频稳定评估方法，以及一个大规模的视频稳定数据集，以解决现有的视频质量评估模型无法准确地评估视频稳定性的问题。
methods: 这个论文使用了一种新的视频稳定评估模型，即StableVQA，该模型包括三种特征提取器，即光流、语义和模糊特征提取器，以及一个回归层来预测最终的稳定分数。
results: 实验结果显示，StableVQA模型与主观意见更高度相关，比现有的VQA-S模型和通用VQA模型更高效。

Abstract
Video shakiness is an unpleasant distortion of User Generated Content (UGC) videos, which is usually caused by the unstable hold of cameras. In recent years, many video stabilization algorithms have been proposed, yet no specific and accurate metric enables comprehensively evaluating the stability of videos. Indeed, most existing quality assessment models evaluate video quality as a whole without specifically taking the subjective experience of video stability into consideration. Therefore, these models cannot measure the video stability explicitly and precisely when severe shakes are present. In addition, there is no large-scale video database in public that includes various degrees of shaky videos with the corresponding subjective scores available, which hinders the development of Video Quality Assessment for Stability (VQA-S). To this end, we build a new database named StableDB that contains 1,952 diversely-shaky UGC videos, where each video has a Mean Opinion Score (MOS) on the degree of video stability rated by 34 subjects. Moreover, we elaborately design a novel VQA-S model named StableVQA, which consists of three feature extractors to acquire the optical flow, semantic, and blur features respectively, and a regression layer to predict the final stability score. Extensive experiments demonstrate that the StableVQA achieves a higher correlation with subjective opinions than the existing VQA-S models and generic VQA models. The database and codes are available at https://github.com/QMME/StableVQA.

摘要
文本稳定性是 User Generated Content (UGC) 视频的不愉快变形，通常是由于摄像头不稳定所致。在过去几年，许多视频稳定算法被提出，但没有特定和准确的度量可以全面评估视频稳定性。实际上，大多数现有的质量评估模型将视频质量评估为整体，不特别考虑视频稳定性的主观体验。因此，这些模型无法明确和精确地测量视频稳定性， especial when severe shakes are present。此外，没有一个大规模的公共视频数据库，包含不同程度的抖动视频和相应的主观分数，这阻碍了视频质量评估的发展。为此，我们建立了一个新的数据库名为 StableDB，其包含 1,952 个多样化的 UGC 视频，每个视频具有主观分数（MOS），用于评估视频稳定性。此外，我们还设计了一种新的 VQA-S 模型，名为 StableVQA，它包含三种特征提取器，用于获取光流、 semantics 和抖动特征，以及一个回归层，用于预测最终的稳定性分数。经过广泛的实验，我们发现 StableVQA 与主观意见更高度相关，并且在与既有 VQA-S 模型和通用 VQA 模型进行比较时，也表现出更好的性能。数据库和代码可以在 GitHub 上获取：https://github.com/QMME/StableVQA。

Histogram-guided Video Colorization Structure with Spatial-Temporal Connection

paper_url: http://arxiv.org/abs/2308.04899
repo_url: None
paper_authors: Zheyuan Liu, Pan Mu, Hanning Xu, Cong Bai
for: Video colorization, aiming at obtaining colorful and plausible results from grayish frames.
methods: 使用 Histogram-guided Video Colorization with Spatial-Temporal connection structure (named ST-HVC), 结合 histogram 和流动特征，以及一种组合方案来处理模糊和噪声。
results: 与多种现有图像和视频基于方法进行比较，表现出优秀的数值和质量性表现在两个视频数据集中。

Abstract
Video colorization, aiming at obtaining colorful and plausible results from grayish frames, has aroused a lot of interest recently. Nevertheless, how to maintain temporal consistency while keeping the quality of colorized results remains challenging. To tackle the above problems, we present a Histogram-guided Video Colorization with Spatial-Temporal connection structure (named ST-HVC). To fully exploit the chroma and motion information, the joint flow and histogram module is tailored to integrate the histogram and flow features. To manage the blurred and artifact, we design a combination scheme attending to temporal detail and flow feature combination. We further recombine the histogram, flow and sharpness features via a U-shape network. Extensive comparisons are conducted with several state-of-the-art image and video-based methods, demonstrating that the developed method achieves excellent performance both quantitatively and qualitatively in two video datasets.

摘要
“视频彩色化，目标是从灰度帧中获得鲜艳和合理的结果，在最近几年内引起了很多关注。然而，如何保持时间一致性而保持彩色结果的质量仍然是一大挑战。为解决以上问题，我们提出了基于 histogram 和空间-时间结构的彩色视频处理方法（简称 ST-HVC）。在使用 histogram 和流动特征之前，我们特制了 JOINT FLOW 和 histogram 模块，以便充分利用彩色和运动信息。此外，我们还设计了一种组合方案，以解决杂乱和瑕疵问题。最后，我们通过 U 型网络重新组合 histogram、流动和锐度特征，以实现优秀的性能。我们对多个图像和视频基础方法进行了广泛的比较，并证明了我们提出的方法在两个视频数据集上的出色性 both quantitatively and qualitatively。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Transmission and Color-guided Network for Underwater Image Enhancement

paper_url: http://arxiv.org/abs/2308.04892
repo_url: None
paper_authors: Pan Mu, Jing Fang, Haotian Qian, Cong Bai
for: 提高水下图像质量，解决色偏和对比度低的问题。
methods: 提出了一种基于适应传输和动态色彩导航的网络（ATDCnet），利用物理知识设计了适应传输导向模块（ATM）和动态色彩导航模块（DCM），同时实现了色彩Restoration和对比度增强。
results: 对多个 referential dataset进行了广泛的实验，并达到了当前最佳性能。

Abstract
In recent years, with the continuous development of the marine industry, underwater image enhancement has attracted plenty of attention. Unfortunately, the propagation of light in water will be absorbed by water bodies and scattered by suspended particles, resulting in color deviation and low contrast. To solve these two problems, we propose an Adaptive Transmission and Dynamic Color guided network (named ATDCnet) for underwater image enhancement. In particular, to exploit the knowledge of physics, we design an Adaptive Transmission-directed Module (ATM) to better guide the network. To deal with the color deviation problem, we design a Dynamic Color-guided Module (DCM) to post-process the enhanced image color. Further, we design an Encoder-Decoder-based Compensation (EDC) structure with attention and a multi-stage feature fusion mechanism to perform color restoration and contrast enhancement simultaneously. Extensive experiments demonstrate the state-of-the-art performance of the ATDCnet on multiple benchmark datasets.

摘要
Specifically, we design an Adaptive Transmission-directed Module (ATM) to better guide the network, taking into account the knowledge of physics. To address the color deviation issue, we design a Dynamic Color-guided Module (DCM) to post-process the enhanced image color. Additionally, we propose an Encoder-Decoder-based Compensation (EDC) structure with attention and a multi-stage feature fusion mechanism to perform color restoration and contrast enhancement simultaneously.Extensive experiments show that the ATDCnet achieves state-of-the-art performance on multiple benchmark datasets.

Deep Generative Networks for Heterogeneous Augmentation of Cranial Defects

paper_url: http://arxiv.org/abs/2308.04883
repo_url: None
paper_authors: Kamil Kwarciak, Marek Wodzinski
For: 本研究旨在提高个性化头部充填设计的自动化程度，通过使用深度学习技术。* Methods: 本研究使用了三种深度生成模型来增强数据集，包括Wasserstein生成对抗网络带梯度约束（WGAN-GP）、WGAN-GP混合变换学习（VAE/WGAN-GP）和 introspective变换学习（IntroVAE）。* Results: 通过生成各种不同的缺陷头骨，包括具有相同缺陷的多个头骨，使得自动设计个性化头部充填的过程得到了大幅提高。研究表明，使用生成的头骨数据可以提高缺陷分割的精度，并且可以提供更多的实际案例研究。

Abstract
The design of personalized cranial implants is a challenging and tremendous task that has become a hot topic in terms of process automation with the use of deep learning techniques. The main challenge is associated with the high diversity of possible cranial defects. The lack of appropriate data sources negatively influences the data-driven nature of deep learning algorithms. Hence, one of the possible solutions to overcome this problem is to rely on synthetic data. In this work, we propose three volumetric variations of deep generative models to augment the dataset by generating synthetic skulls, i.e. Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP), WGAN-GP hybrid with Variational Autoencoder pretraining (VAE/WGAN-GP) and Introspective Variational Autoencoder (IntroVAE). We show that it is possible to generate dozens of thousands of defective skulls with compatible defects that achieve a trade-off between defect heterogeneity and the realistic shape of the skull. We evaluate obtained synthetic data quantitatively by defect segmentation with the use of V-Net and qualitatively by their latent space exploration. We show that the synthetically generated skulls highly improve the segmentation process compared to using only the original unaugmented data. The generated skulls may improve the automatic design of personalized cranial implants for real medical cases.

摘要
文本：预制个性化头部刺青设计是一项复杂且具有挑战性的任务，目前在使用深度学习技术时已经成为热点话题。主要挑战在于头部缺陷的多样性。缺乏合适的数据源，使得深度学习算法的数据驱动特性受到负面影响。因此，可以考虑使用合成数据来解决这个问题。在这个工作中，我们提出了三种深度生成模型的几何变换，用于增强数据集，即 Wasserstein生成对抗网络 with Gradient Penalty (WGAN-GP)、WGAN-GP 杂合 hybrid with Variational Autoencoder pretraining (VAE/WGAN-GP) 和 Introspective Variational Autoencoder (IntroVAE)。我们发现可以生成多达数千个具有相同缺陷的人工头部，并且实现了缺陷多样性和真实的头部形状之间的质量衡量。我们通过 V-Net 进行缺陷分割评估，以及latent space探索来评估生成的头部。结果显示，生成的头部能够大幅提高分割过程的精度，相比使用原始未处理数据。这些生成的头部可能会改善实际医疗案例中的个性化头部刺青设计。简化中文：预制个性化头部刺青设计是一项复杂且具有挑战性的任务。主要挑战在于头部缺陷的多样性。缺乏合适的数据源，使得深度学习算法的数据驱动特性受到负面影响。因此，可以考虑使用合成数据来解决这个问题。我们提出了三种深度生成模型，用于增强数据集，包括 Wasserstein生成对抗网络、WGAN-GP 杂合 hybrid 和 Introspective Variational Autoencoder。我们发现可以生成多达数千个具有相同缺陷的人工头部，并且实现了缺陷多样性和真实的头部形状之间的质量衡量。这些生成的头部可能会改善实际医疗案例中的个性化头部刺青设计。

Learning multi-domain feature relation for visible and Long-wave Infrared image patch matching

paper_url: http://arxiv.org/abs/2308.04880
repo_url: None
paper_authors: Xiuwei Zhang, Yanping Li, Zhaoshuai Qi, Yi Sun, Yanning Zhang
for: 本研究的目的是提高 Cross-spectral 图像块匹配的性能，特别是在实际应用中。
methods: 本研究使用了一种多元领域特征关系学习网络（MD-FRN），该网络输入了四个分支网络提取的特征，并通过空间相关模块（SCM）和多规 adapted 集成模块（MSAG）来学习空间和缩放领域的特征关系。此外，一种深度域互动机制（DIM）也被应用，以便学习交互式的多元领域特征关系，从而提高对不同模式的应用变化的Robustness。
results: 研究发现，使用 MD-FRN 网络可以提高 Cross-spectral 图像块匹配的性能，特别是在面对不同模式的应用变化时。

Abstract
Recently, learning-based algorithms have achieved promising performance on cross-spectral image patch matching, which, however, is still far from satisfactory for practical application. On the one hand, a lack of large-scale dataset with diverse scenes haunts its further improvement for learning-based algorithms, whose performances and generalization rely heavily on the dataset size and diversity. On the other hand, more emphasis has been put on feature relation in the spatial domain whereas the scale dependency between features has often been ignored, leading to performance degeneration especially when encountering significant appearance variations for cross-spectral patches. To address these issues, we publish, to be best of our knowledge, the largest visible and Long-wave Infrared (LWIR) image patch matching dataset, termed VL-CMIM, which contains 1300 pairs of strictly aligned visible and LWIR images and over 2 million patch pairs covering diverse scenes such as asteroid, field, country, build, street and water.In addition, a multi-domain feature relation learning network (MD-FRN) is proposed. Input by the features extracted from a four-branch network, both feature relations in spatial and scale domains are learned via a spatial correlation module (SCM) and multi-scale adaptive aggregation module (MSAG), respectively. To further aggregate the multi-domain relations, a deep domain interactive mechanism (DIM) is applied, where the learnt spatial-relation and scale-relation features are exchanged and further input into MSCRM and SCM. This mechanism allows our model to learn interactive cross-domain feature relations, leading to improved robustness to significant appearance changes due to different modality.

摘要
近些时间，学习基本的算法在跨спектル图像小块匹配方面实现了可观的表现，但是仍然远远不够满足实际应用需求。一方面，缺乏大规模的多样化场景的数据集，使得further improvement of learning-based algorithms的表现和泛化受到数据集大小和多样化的重要限制。另一方面，更多的注重点放在图像域的特征关系上，而忽略了特征之间的尺度关系，导致特征变化时的表现下降，特别是在遇到明显的特征变化时。为解决这些问题，我们在本文中发布了，以我们所知道的最大的可见和长波infrared（LWIR）图像小块匹配数据集，称为VL-CMIM，该数据集包含1300对精准对齐的可见和LWIR图像，以及超过200万个小块对。此外，我们还提出了一种多域特征关系学习网络（MD-FRN），该网络输入来自四个分支网络提取的特征，并通过空间相关模块（SCM）和多Scale适应汇集模块（MSAG）分别学习图像域和尺度域的特征关系。为了进一步汇集多域关系，我们还应用了深度域互动机制（DIM），该机制使得我们的模型可以学习交互的跨域特征关系，从而提高对不同模态的外观变化的Robustness。

Tracking Players in a Badminton Court by Two Cameras

paper_url: http://arxiv.org/abs/2308.04872
repo_url: None
paper_authors: Young-Ching Chou, Shen-Ru Zhang, Bo-Wei Chen, Hong-Qi Chen, Cheng-Kuan Lin, Yu-Chee Tseng
for: 这个研究旨在提供一种简单的多对象跟踪（MOT）方法，用于跟踪羽毛球场上的球员。
methods: 该方法利用了两台废料相机，一台位于球场顶部，另一台位于球场侧面。顶部相机用于跟踪球员的轨迹，侧面相机用于分析球员的像素特征。通过计算相邻帧之间的相关性和利用两个相机的信息，实现了球员跟踪。
results: 该方法可以减轻球员 occlusion 和重叠问题，提供了球员轨迹跟踪和多视角分析。系统提供了球员位置和运动姿势的信息，可以作为羽毛球教练或自我训练工具，帮助球员提高游戏策略。

Abstract
This study proposes a simple method for multi-object tracking (MOT) of players in a badminton court. We leverage two off-the-shelf cameras, one on the top of the court and the other on the side of the court. The one on the top is to track players' trajectories, while the one on the side is to analyze the pixel features of players. By computing the correlations between adjacent frames and engaging the information of the two cameras, MOT of badminton players is obtained. This two-camera approach addresses the challenge of player occlusion and overlapping in a badminton court, providing player trajectory tracking and multi-angle analysis. The presented system offers insights into the positions and movements of badminton players, thus serving as a coaching or self-training tool for badminton players to improve their gaming strategies.

摘要
这项研究提出了一种简单的多对目标跟踪（MOT）方法，用于跟踪Badminton场上球员的运动轨迹。我们利用了两台准备的摄像头，一台置于场上，另一台置于场边。前者用于跟踪球员的轨迹，后者用于分析球员的像素特征。通过计算相邻帧之间的相关性和两台摄像头的信息，实现了球员 occlusion 和 overlap 问题的解决，从而实现了多个角度的分析和球员轨迹跟踪。该系统为Badminton球员提供了运动轨迹和多个角度分析的视角，可以作为教练或自我训练工具，帮助球员改进游戏策略。

InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering

paper_url: http://arxiv.org/abs/2308.04868
repo_url: None
paper_authors: Antonio Canela, Pol Caselles, Ibrar Malik, Eduard Ramon, Jaime García, Jordi Sánchez-Riera, Gil Triginer, Francesc Moreno-Noguer
for: 快速生成全头Avtaar（full-head avatar）from few images（down to just one）
methods: combinest voxel-grid neural field representation with surface renderer, and uses a novel statistical model to learn a prior distribution over 3D head signed distance functions.
results: achieves 3D head reconstructions with comparable accuracy as the state-of-the-art, with a 100x speed-up.

Abstract
Recent advances in full-head reconstruction have been obtained by optimizing a neural field through differentiable surface or volume rendering to represent a single scene. While these techniques achieve an unprecedented accuracy, they take several minutes, or even hours, due to the expensive optimization process required. In this work, we introduce InstantAvatar, a method that recovers full-head avatars from few images (down to just one) in a few seconds on commodity hardware. In order to speed up the reconstruction process, we propose a system that combines, for the first time, a voxel-grid neural field representation with a surface renderer. Notably, a naive combination of these two techniques leads to unstable optimizations that do not converge to valid solutions. In order to overcome this limitation, we present a novel statistical model that learns a prior distribution over 3D head signed distance functions using a voxel-grid based architecture. The use of this prior model, in combination with other design choices, results into a system that achieves 3D head reconstructions with comparable accuracy as the state-of-the-art with a 100x speed-up.

摘要
最近的全头重建技术已经取得了很大的进步，通过使用可导表面或体积渲染来优化神经场来表示一个场景。尽管这些技术达到了历史上无前例的精度，但它们需要几分钟或者几个小时的昂贵优化过程。在这项工作中，我们介绍了InstantAvatar方法，可以从几张图像（甚至是一张）中快速地恢复全头模型，仅需几秒钟的时间。为了加速重建过程，我们提议一种结合了神经场格表示和表面渲染的系统。尽管这种组合可能导致不稳定的优化过程，但我们提出了一种新的统计模型，可以学习3D头签名距离函数的 prior 分布。通过这种先进的统计模型，我们可以在其他设计选择的基础上实现一个100倍加速的系统，并且和现状的精度相当。

Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

paper_url: http://arxiv.org/abs/2308.05129
repo_url: None
paper_authors: Nina Weng, Siavash Bigdeli, Eike Petersen, Aasa Feragen
for: 这种研究旨在解释胸部X射线诊断中的性别偏见的原因。
methods: 该研究提出了一种新的采样方法，以解决两个公共数据集中每个病人记录的极度不均衡分布，同时减少标签错误的影响。
results: 研究发现，不是数据集的不均衡导致性别之间的表现差异，而是数据集特有的因素。此外，研究还发现，对于不同疾病和数据集，男女群体之间的表现差异强烈不同。最后，研究发现，对卷积物的影响不是解决表现差异的关键。

Abstract
While many studies have assessed the fairness of AI algorithms in the medical field, the causes of differences in prediction performance are often unknown. This lack of knowledge about the causes of bias hampers the efficacy of bias mitigation, as evidenced by the fact that simple dataset balancing still often performs best in reducing performance gaps but is unable to resolve all performance differences. In this work, we investigate the causes of gender bias in machine learning-based chest X-ray diagnosis. In particular, we explore the hypothesis that breast tissue leads to underexposure of the lungs and causes lower model performance. Methodologically, we propose a new sampling method which addresses the highly skewed distribution of recordings per patient in two widely used public datasets, while at the same time reducing the impact of label errors. Our comprehensive analysis of gender differences across diseases, datasets, and gender representations in the training set shows that dataset imbalance is not the sole cause of performance differences. Moreover, relative group performance differs strongly between datasets, indicating important dataset-specific factors influencing male/female group performance. Finally, we investigate the effect of breast tissue more specifically, by cropping out the breasts from recordings, finding that this does not resolve the observed performance gaps. In conclusion, our results indicate that dataset-specific factors, not fundamental physiological differences, are the main drivers of male--female performance gaps in chest X-ray analyses on widely used NIH and CheXpert Dataset.

摘要
多个研究已经评估了人工智能算法在医疗领域的公平性，但是对差异的预测性能的原因frequently unknown.这种不知道偏见的原因使得偏见缓减措施效果受限，例如简单的数据集平衡仍然能够减少性能差距，但是无法解决所有的差距。在这个工作中，我们调查了机器学习基于骨科X光图诊的性别偏见的原因。特别是，我们研究了肿瘤组织导致肺部不足暴露和降低模型性能的假设。方法上，我们提出了一种新的采样方法， Addressing the highly skewed distribution of recordings per patient in two widely used public datasets, while reducing the impact of label errors.我们对男女之间疾病、数据集和训练集中的性别表示进行了全面的分析，发现数据集偏见不是差异的唯一原因。此外，数据集之间的相对性能差异强烈，表明 dataset-specific factors significantly influencing male/female group performance.最后，我们具体调查了乳腺组织的影响，通过将乳腺从记录中剪除，发现这并不能解决观察到的性能差距。结论是，我们的结果表明，在 widely used NIH and CheXpert Dataset 上， male--female performance gaps in chest X-ray analyses are mainly driven by dataset-specific factors, not fundamental physiological differences.

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

paper_url: http://arxiv.org/abs/2308.04834
repo_url: None
paper_authors: Ye Tian, Mengyu Yang, Lanshan Zhang, Zhizhen Zhang, Yang Liu, Xiaohui Xie, Xirong Que, Wendong Wang
for: 本研究旨在提出一种高效的长视频认知方法，以提高视频识别的效率和准确率。
methods: 本研究使用了人类认知的“视而移”理念，提出了一种新的识别方法，即将粗粒度和细粒度的预览和识别合并到一个整体模型中，从而实现了一次性访问原始帧的目的。
results: 实验结果表明，本方法在长视频和短视频识别任务上均达到了状态革命的性能，同时也提供了新的效率和准确率的贸易OFF。

Abstract
Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.

摘要
现代适应方法 для高效视频识别大多采用两个阶段 paradigm，即"预览然后识别"，在推理过程中两次访问粗粒度和细粒度的原始帧（不能并发），并且在第二阶段 capture的空间时间特征不能重用（因为不同的粒度），不友好于效率和计算优化。为了解决这个问题，我们受人类认知启发，提出了一种新的识别方法——"在移动中查看"，这种方法只需要访问原始帧一次。在两个阶段的粗粒度和细粒度 sampling 和识别结合在一起，实现了非常出色的性能。此外，我们还研究了视频中元素的性质，并提出了一种层次机制来高效地捕捉和理解视频中元素的时间 semantics。广泛的实验表明，我们的方法在精度和效率两个方面都高于当前状态艺术方法，提供了新的精度和效率贸易平衡 для视频空间模型。

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

paper_url: http://arxiv.org/abs/2308.04830
repo_url: None
paper_authors: Liyang Chen, Zhiyong Wu, Runnan Li, Weihong Bao, Jun Ling, Xu Tan, Sheng Zhao
for: 论文旨在提高现代人工智能讲话人物的生动性和多样性，使其能够从任意视频提示中提取表达性的人脸样式。
methods: 该论文提出了一种不supervised variational style transfer模型（VAST），包括三个关键组件：样式编码器、混合表达解码器和变量式样式增强器。这三个组件共同使得模型能够从视频提示中提取准确的 speech-related 运动和表达性的人脸样式。
results: 实验结果表明，提出的方法能够在零shot情况下，从任意视频提示中提取表达性的人脸样式，并将其转移到个性化的图像渲染器上，以获得更加生动、有authenticity和丰富的讲话人物。

Abstract
Current talking face generation methods mainly focus on speech-lip synchronization. However, insufficient investigation on the facial talking style leads to a lifeless and monotonous avatar. Most previous works fail to imitate expressive styles from arbitrary video prompts and ensure the authenticity of the generated video. This paper proposes an unsupervised variational style transfer model (VAST) to vivify the neutral photo-realistic avatars. Our model consists of three key components: a style encoder that extracts facial style representations from the given video prompts; a hybrid facial expression decoder to model accurate speech-related movements; a variational style enhancer that enhances the style space to be highly expressive and meaningful. With our essential designs on facial style learning, our model is able to flexibly capture the expressive facial style from arbitrary video prompts and transfer it onto a personalized image renderer in a zero-shot manner. Experimental results demonstrate the proposed approach contributes to a more vivid talking avatar with higher authenticity and richer expressiveness.

摘要
当前的话语生成方法主要关注于 speech-lip 同步。然而，对话优化方法的不足导致生成的人工智能人物具有毫不生动的、单一的表情。大多数前一代的工作无法从 произвольный视频提示中捕捉出表情特征，并确保生成的视频的authenticity。本文提出一种无监督的变换式样本传输模型（VAST），以vivify中性真实的人工智能人物。我们的模型包括三个关键组件：一个样式编码器，用于从给定的视频提示中提取表情样式表示;一个混合的表情解码器，用于模型准确的speech-相关的动作;一个变换式样本增强器，用于增强样式空间，使其具有高度表意和意义。通过我们的关键设计，我们的模型能够灵活地从 произвольный视频提示中捕捉表情样式，并将其传输到个性化的图像渲染器中，无需预训练。实验结果表明，我们的方法对生成的人工智能人物具有更高的authenticity和更丰富的表达力。

paper_url: http://arxiv.org/abs/2308.04829
repo_url: None
paper_authors: Kaixin Cai, Pengzhen Ren, Yi Zhu, Hang Xu, Jianzhuang Liu, Changlin Li, Guangrun Wang, Xiaodan Liang
for: 提高 semantic segmentation 模型在开放世界enario中的精细对alignment和物体mask预测能力。
methods: 使用 MixReorg 准则，通过混合图像patches并保持patch和文本之间匹配性，提高模型对图像区域的精细Semantic alignment能力。
results: 在多个Zero-shot semantic segmentation benchmark上达到了显著的提高，相比GroupViT， MixReorg 模型在 PASCAL VOC2012、PASCAL Context、MS COCO 和 ADE20K 上的mIoU提高为5.0%、6.2%、2.5% 和 3.4%。

Abstract
Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K, respectively.

摘要
最近，受图像级文本监督训练的 semantic segmentation 模型在开放世界enario中表现出色，但这些模型仍然面临精度的semantic alignment 和准确的 объек mask 预测问题。为解决这个问题，我们提出 MixReorg，一种新的预训练方法 для semantic segmentation，该方法可以提高模型对图像中混合的 patch 的重新组织能力，同时考虑本地视觉相关性和全局semantic coherence。我们的方法是通过混合图像 patch 而生成细化的 patch-text 对，并训练模型将混合图像的 segmentation 损失和原始和恢复特征的两个对照损失降低到最小值。通过 MixReorg 作为 mask learner，传统的文本监督 semantic segmentation 模型可以学习高度普适的像素级semantic alignment能力，这是开放世界 segmentation 中非常重要的。 после训练大规模的图像-文本数据，MixReorg 模型可以直接应用于任意类型的视觉对象分割，无需进一步的微调。我们的提出的框架在流行的零shot semantic segmentation 标准准的 benchmark 上显示出强大的表现，比 GroupViT 高出5.0%, 6.2%, 2.5%, 3.4% mIoU 的提高。

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

paper_url: http://arxiv.org/abs/2308.04828
repo_url: None
paper_authors: Qiang Wang, Junlong Du, Ke Yan, Shouhong Ding
for: 实现更高效和泛化的动作识别方法
methods: 提出了两流动模型块，一流动模型块用于捕捉视频帧中的动作和空间信息，另一流动模型块用于生成动作aware的提示，并提出了多模态通信块来实现多模态学习
results: 在hmdb-51、ucf-101和kinetics-400 datasets上进行了广泛的实验，其方法在”少shot”和”zero-shot”训练中超越了大多数现有的状态静态方法，并在”closed-set”训练中实现了非常少的可训练参数和额外计算成本。

Abstract
The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.

摘要
CLIP（对照语言图像预训）在最近的应用中显示了很好的普遍化能力，特别是在“零shot”训练中。我们想要探索CLIP的改进，以实现更有效和普遍的动作识别方法。我们认为关键在于Explicitly 模型影像中的动作讯号。为此，我们设计了两条流动模型对应块，以同时捕捉影像中的动作和空间信息。然后，所获得的动作讯号被用来驱动动态提示学习者生成动作感知的提示，这些提示具有人类动作的含义信息。此外，我们提议了多模式通信对应块，以实现多模式学习和进一步提高性能。我们对HMDB-51、UCF-101和Kinetics-400 dataset进行了广泛的实验。我们的方法在“几shot”和“零shot”训练中比大多数现有的方法表现出了明显的超越。我们还在“关闭集”训练中取得了非常有效的性能，仅需要极少的可训练参数和额外的计算成本。

WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.04826
repo_url: None
paper_authors: Muyu Xu, Fangneng Zhan, Jiahui Zhang, Yingchen Yu, Xiaoqin Zhang, Christian Theobalt, Ling Shao, Shijian Lu
for: 用于 novel view synthesis via implicit scene representation，但通常受到精度降低的问题。
methods: integrate Multi-View Stereo (MVS) technique into NeRF，并在新场景中进行一定的微调。
results: 通过将浮点数分解插入到MVS和NeRF中，实现了高质量且通用的Synthesis，无需每个新场景进行微调。

Abstract
Neural Radiance Field (NeRF) has shown impressive performance in novel view synthesis via implicit scene representation. However, it usually suffers from poor scalability as requiring densely sampled images for each new scene. Several studies have attempted to mitigate this problem by integrating Multi-View Stereo (MVS) technique into NeRF while they still entail a cumbersome fine-tuning process for new scenes. Notably, the rendering quality will drop severely without this fine-tuning process and the errors mainly appear around the high-frequency features. In the light of this observation, we design WaveNeRF, which integrates wavelet frequency decomposition into MVS and NeRF to achieve generalizable yet high-quality synthesis without any per-scene optimization. To preserve high-frequency information when generating 3D feature volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating the discrete wavelet transform into the classical cascade MVS, which disentangles high-frequency information explicitly. With that, disentangled frequency features can be injected into classic NeRF via a novel hybrid neural renderer to yield faithful high-frequency details, and an intuitive frequency-guided sampling strategy can be designed to suppress artifacts around high-frequency regions. Extensive experiments over three widely studied benchmarks show that WaveNeRF achieves superior generalizable radiance field modeling when only given three images as input.

摘要
neural radiance field (NeRF) 已经展现出优秀的新视图合成能力via做 Implicit scene representation. However, it usually suffers from poor scalability as it requires densely sampled images for each new scene. Several studies have attempted to mitigate this problem by integrating Multi-View Stereo (MVS) technique into NeRF, while they still entail a cumbersome fine-tuning process for new scenes. Notably, the rendering quality will drop severely without this fine-tuning process, and the errors mainly appear around the high-frequency features. In the light of this observation, we design WaveNeRF, which integrates wavelet frequency decomposition into MVS and NeRF to achieve generalizable yet high-quality synthesis without any per-scene optimization. To preserve high-frequency information when generating 3D feature volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating the discrete wavelet transform into the classical cascade MVS, which disentangles high-frequency information explicitly. With that, disentangled frequency features can be injected into classic NeRF via a novel hybrid neural renderer to yield faithful high-frequency details, and an intuitive frequency-guided sampling strategy can be designed to suppress artifacts around high-frequency regions. Extensive experiments over three widely studied benchmarks show that WaveNeRF achieves superior generalizable radiance field modeling when only given three images as input.

HyperCoil-Recon: A Hypernetwork-based Adaptive Coil Configuration Task Switching Network for MRI Reconstruction

paper_url: http://arxiv.org/abs/2308.04821
repo_url: https://github.com/sriprabhar/hypercoil-recon
paper_authors: Sriprabha Ramanarayanan, Mohammad Al Fahim, Rahul G. S., Amrit Kumar Jethi, Keerthi Ram, Mohanasankar Sivaprakasam
for: HyperCoil-Recon is proposed to address the challenge of training deep learning-based image reconstruction models for multi-coil MRI reconstruction, which requires adapting to diverse coil configurations.methods: The approach uses a hypernetwork-based coil configuration task-switching network, which encodes varying configurations of the number of coils in a multi-tasking perspective. The hypernetworks infer and embed task-specific weights into the reconstruction network, leveraging contextual knowledge of common and varying image features among the various fields-of-view of the coils.results: The approach adapts on the fly to various unseen configurations up to 32 coils when trained on lower numbers (i.e. 7 to 11) of randomly varying coils, and to 120 deviated unseen configurations when trained on 18 configurations in a single model. It matches the performance of coil configuration-specific models and outperforms configuration-invariant models with improvement margins of around 1 dB / 0.03 and 0.3 dB / 0.02 in PSNR / SSIM for knee and brain data.

Abstract
Parallel imaging, a fast MRI technique, involves dynamic adjustments based on the configuration i.e. number, positioning, and sensitivity of the coils with respect to the anatomy under study. Conventional deep learning-based image reconstruction models have to be trained or fine-tuned for each configuration, posing a barrier to clinical translation, given the lack of computational resources and machine learning expertise for clinicians to train models at deployment. Joint training on diverse datasets learns a single weight set that might underfit to deviated configurations. We propose, HyperCoil-Recon, a hypernetwork-based coil configuration task-switching network for multi-coil MRI reconstruction that encodes varying configurations of the numbers of coils in a multi-tasking perspective, posing each configuration as a task. The hypernetworks infer and embed task-specific weights into the reconstruction network, 1) effectively utilizing the contextual knowledge of common and varying image features among the various fields-of-view of the coils, and 2) enabling generality to unseen configurations at test time. Experiments reveal that our approach 1) adapts on the fly to various unseen configurations up to 32 coils when trained on lower numbers (i.e. 7 to 11) of randomly varying coils, and to 120 deviated unseen configurations when trained on 18 configurations in a single model, 2) matches the performance of coil configuration-specific models, and 3) outperforms configuration-invariant models with improvement margins of around 1 dB / 0.03 and 0.3 dB / 0.02 in PSNR / SSIM for knee and brain data. Our code is available at https://github.com/sriprabhar/HyperCoil-Recon

摘要
《平行巡检：一种快速MRI技术》，其中的巡检配置（number，positioning，sensitivity）与研究对象的解剖学相关。传统的深度学习基于图像重建模型需要根据配置进行训练或精度调整，这会对临床应用带来障碍，因为临床医生缺乏计算资源和机器学习专家来在部署时训练模型。我们提议使用卷积网络（hypernetwork）来实现巡检配置任务 switching，其中每个配置视为一个任务。卷积网络在重建网络中插入任务特有的 weights，以利用不同巡检配置中图像特征的共同知识，并在测试时对未经见配置进行普适化。我们的方法可以在不同的巡检配置下进行适应，并且与特定配置模型和配置不变模型相比，能够提高PSNR/SSIM指标约1dB/0.03和0.3dB/0.02。我们的代码可以在https://github.com/sriprabhar/HyperCoil-Recon上找到。

Joint-Relation Transformer for Multi-Person Motion Prediction

paper_url: http://arxiv.org/abs/2308.04808
repo_url: https://github.com/mediabrain-sjtu/jrtransformer
paper_authors: Qingyao Xu, Weibo Mao, Jingze Gong, Chenxin Xu, Siheng Chen, Weidi Xie, Ya Zhang, Yanfeng Wang
for: 提高人体动作预测精度，具体是通过关注人体关节和关系信息来提高人体交互模型化。
methods: 提出了关节关系变换器（Joint-Relation Transformer），利用关系信息来增强交互模型化，并通过关系意识注意力来融合关节信息和关系信息。
results: 实验表明，我们的方法在3DPW-SoMoF/RC和CMU-Mpcap/MuPoTS-3D数据集上 achieved a 13.4% improvement of 900ms VIM and 17.8%/12.0% improvement of 3s MPJPE。

Abstract
Multi-person motion prediction is a challenging problem due to the dependency of motion on both individual past movements and interactions with other people. Transformer-based methods have shown promising results on this task, but they miss the explicit relation representation between joints, such as skeleton structure and pairwise distance, which is crucial for accurate interaction modeling. In this paper, we propose the Joint-Relation Transformer, which utilizes relation information to enhance interaction modeling and improve future motion prediction. Our relation information contains the relative distance and the intra-/inter-person physical constraints. To fuse relation and joint information, we design a novel joint-relation fusion layer with relation-aware attention to update both features. Additionally, we supervise the relation information by forecasting future distance. Experiments show that our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE on CMU-Mpcap/MuPoTS-3D dataset.

摘要
多人运动预测是一个复杂的问题，因为运动的依赖于每个人的过去运动和人们之间的交互。基于Transformer的方法在这个任务上表现出了承诺，但它们缺乏明确的关系表示，如骨架结构和对方之间的距离，这些信息对准确地模拟人体交互是非常重要。在这篇论文中，我们提出了关节关系变换器（Joint-Relation Transformer），它利用关系信息来增强交互模拟，并提高未来运动预测。我们的关系信息包括对方之间的距离和人体内部/外部的物理约束。为了融合关系和关节信息，我们设计了一种新的关节关系融合层，其中包含关注关系的注意力来更新两种特征。此外，我们还对关系信息进行预测未来距离的超vision，以便进一步训练关系信息。实验表明，我们的方法在3DPW-SoMoF/RC和CMU-Mpcap/MuPoTS-3D dataset上的900ms VIM和3s MPJPE上提高了13.4%和17.8%，分别。

Generalized Unbiased Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.04802
repo_url: None
paper_authors: Xinyu Lyu, Lianli Gao, Junlin Xie, Pengpeng Zeng, Yulu Tian, Jie Shao, Heng Tao Shen
for: 解决 predicate-level 和 concept-level 不均衡问题，提高Scene Graph Generation（SGG）模型的可靠性和 Compositional 能力。
methods: 提出了一种新的研究问题——Generalized Unbiased Scene Graph Generation（G-USGG），并提出了Multi-Concept Learning（MCL）框架，以确保学习过程中具有不同概念的均衡。同时，还引入了Balanced Prototypical Memory（BPM）来实现不同概念的平衡学习。
results: 对 VG-SGG 和 OI-SGG dataset进行了广泛的实验，证明了我们的模型独立技术在提高 predicate-level 不均衡关系识别和 concept-level compositional 生成能力方面具有极高的效果，并在两个关键方面达到了新的州OF-THE-ART纪录。

Abstract
Existing Unbiased Scene Graph Generation (USGG) methods only focus on addressing the predicate-level imbalance that high-frequency classes dominate predictions of rare ones, while overlooking the concept-level imbalance. Actually, even if predicates themselves are balanced, there is still a significant concept-imbalance within them due to the long-tailed distribution of contexts (i.e., subject-object combinations). This concept-level imbalance poses a more pervasive and challenging issue compared to the predicate-level imbalance since subject-object pairs are inherently complex in combinations. Hence, we introduce a novel research problem: Generalized Unbiased Scene Graph Generation (G-USGG), which takes into account both predicate-level and concept-level imbalance. To the end, we propose the Multi-Concept Learning (MCL) framework, which ensures a balanced learning process across rare/ uncommon/ common concepts. MCL first quantifies the concept-level imbalance across predicates in terms of different amounts of concepts, representing as multiple concept-prototypes within the same class. It then effectively learns concept-prototypes by applying the Concept Regularization (CR) technique. Furthermore, to achieve balanced learning over different concepts, we introduce the Balanced Prototypical Memory (BPM), which guides SGG models to generate balanced representations for concept-prototypes. Extensive experiments demonstrate the remarkable efficacy of our model-agnostic strategy in enhancing the performance of benchmark models on both VG-SGG and OI-SGG datasets, leading to new state-of-the-art achievements in two key aspects: predicate-level unbiased relation recognition and concept-level compositional generability.

摘要
现有的不偏 scene graph生成（USGG）方法只是解决 predicate-level 偏见问题，即高频类占据罕见类预测，而忽略了 concept-level 偏见。实际上，即使 predicate 本身具有平衡，也存在 context 中的 long-tailed 分布，导致 predicate 内部的 concept-level 偏见。这种 concept-level 偏见比 predicate-level 偏见更加广泛和困难，因为 subject-object 组合是复杂的。因此，我们提出了一个新的研究问题： Generalized Unbiased Scene Graph Generation（G-USGG），它考虑了 predicate-level 和 concept-level 偏见。为此，我们提出了 Multi-Concept Learning（MCL）框架，确保学习过程中具有平衡的概念。MCL 首先量化 predicate 中的 concept-level 偏见，并通过 Concept Regularization（CR）技术有效地学习概念权重。此外，为了实现不同概念之间的平衡学习，我们引入了 Balanced Prototypical Memory（BPM），它使得 SGG 模型生成的概念表示具有平衡。经验表明，我们的模型自适应策略可以提高 benchmark 模型在 VG-SGG 和 OI-SGG 数据集上的表现，并创造出新的 state-of-the-art 成绩在两个关键方面： predicate-level 不偏关系认识和 concept-level композиitional 可 generates。

High-Level Features Parallelization for Inference Cost Reduction Through Selective Attention

paper_url: http://arxiv.org/abs/2308.05128
repo_url: None
paper_authors: André Peter Kelm, Lucas Schmidt, Tim Rolff, Christian Wilms, Ehsan Yaghoubi, Simone Frintrop
for: 降低深度学习模型的执行成本，特别是适用于移动设备、工业应用和机器人应用等场景。
methods: 使用平行高级特征来选择性地跳过或选择类型特征，以降低推理成本。该方法基于人脑科学发现， Observation of spatially and contextually separated neural activations in the human brain.
results: 可以保持高性能，但可以减少参数数量、计算复杂度和电力消耗。在一些示例中，可以减少参数数量的75%，并且可以避免重新训练。此外，该方法还具有可以根据增强或抑制高级类型特征来直接影响处理的能力，类似于人脑中的选择性注意力机制。

Abstract
In this work, we parallelize high-level features in deep networks to selectively skip or select class-specific features to reduce inference costs. This challenges most deep learning methods due to their limited ability to efficiently and effectively focus on selected class-specific features without retraining. We propose a serial-parallel hybrid architecture with serial generic low-level features and parallel high-level features. This accounts for the fact that many high-level features are class-specific rather than generic, and has connections to recent neuroscientific findings that observe spatially and contextually separated neural activations in the human brain. Our approach provides the unique functionality of cutouts: selecting parts of the network to focus on only relevant subsets of classes without requiring retraining. High performance is maintained, but the cost of inference can be significantly reduced. In some of our examples, up to $75\,\%$ of parameters are skipped and $35\,\%$ fewer GMACs (Giga multiply-accumulate) operations are used as the approach adapts to a change in task complexity. This is important for mobile, industrial, and robotic applications where reducing the number of parameters, the computational complexity, and thus the power consumption can be paramount. Another unique functionality is that it allows processing to be directly influenced by enhancing or inhibiting high-level class-specific features, similar to the mechanism of selective attention in the human brain. This can be relevant for cross-modal applications, the use of semantic prior knowledge, and/or context-aware processing.

摘要
在这项工作中，我们平行化深度网络中的高级特征，以选择性地跳过或选择特定类型的特征，以降低推理成本。这会挑战大多数深度学习方法，因为它们具有限制性能 efficiently和有效地关注选择的特定类型的特征而不需要重新训练。我们提议一种序列-平行混合架构，其包括序列的通用低级特征和平行的高级特征。这是因为许多高级特征是特定的类型而不是通用的，并且与最近的神经科学发现相关，观察人脑中的空间和上下文分离的神经活动。我们的方法提供了独特的功能，即“剪辑”：可以在不需要重新训练的情况下，选择ively关注特定类型的特征。我们的方法可以保持高性能，同时降低推理成本。在一些我们的示例中，可以避免大约75%的参数和35% fewer GMACs（亿乘法积加）操作。这对移动、工业和机器人应用而言非常重要，因为减少参数、计算复杂性和电力消耗是 Paramount。另外，我们的方法还允许处理直接受到高级类型特征的增强或抑制影响，类似于人脑中的选择性注意力机制。这可能对cross-modal应用、使用Semantic prior知识和/或Context-aware处理有 relevance。

Enhancing Mobile Privacy and Security: A Face Skin Patch-Based Anti-Spoofing Approach

paper_url: http://arxiv.org/abs/2308.04798
repo_url: None
paper_authors: Qiushi Guo
for: 提高面Recognition系统的安全性，防止假面挡扰。
methods: 基于facial skin patches的方法，使用无隐私信息的图像作为输入，不需要加密或解密。
results: 在多个公共数据集上进行实验，结果表明我们的算法在准确率和速度两个方面具有优势。

Abstract
As Facial Recognition System(FRS) is widely applied in areas such as access control and mobile payments due to its convenience and high accuracy. The security of facial recognition is also highly regarded. The Face anti-spoofing system(FAS) for face recognition is an important component used to enhance the security of face recognition systems. Traditional FAS used images containing identity information to detect spoofing traces, however there is a risk of privacy leakage during the transmission and storage of these images. Besides, the encryption and decryption of these privacy-sensitive data takes too long compared to inference time by FAS model. To address the above issues, we propose a face anti-spoofing algorithm based on facial skin patches leveraging pure facial skin patch images as input, which contain no privacy information, no encryption or decryption is needed for these images. We conduct experiments on several public datasets, the results prove that our algorithm has demonstrated superiority in both accuracy and speed.

摘要
As Facial Recognition System(FRS) 广泛应用于访问控制和移动支付等领域，因为其方便性和高准确率。 Facial recognition 的安全性也备受重视。 Face anti-spoofing system(FAS) 是face recognition 系统中的一个重要组件，用于增强face recognition 系统的安全性。传统的 FAS 使用包含身份信息的图像检测冒险迹象，但存在隐私泄露的风险在传输和存储这些图像时。此外，对这些隐私敏感数据的加密和解密也需要很长时间，比推理时间更长。为解决上述问题，我们提出一种基于 facial skin patches 的面反射验证算法，使用纯度 facial skin patch 图像作为输入，这些图像不含隐私信息，无需加密或解密。我们在多个公共数据集上进行了实验，结果表明，我们的算法在准确率和速度两个方面具有显著的优势。

Multi-Scale Memory Comparison for Zero-/Few-Shot Anomaly Detection

paper_url: http://arxiv.org/abs/2308.04789
repo_url: None
paper_authors: Chaoqin Huang, Aofan Jiang, Ya Zhang, Yanfeng Wang
for: 这篇论文主要应用在过程异常检测中，特别是在工业异常检测中。
methods: 本研究提出了一个简单 yet powerful的多尺度记忆比较框架，用于零/几shot异常检测。这种方法使用全图像的全球内存储存器，以及对单一物体的个人内存储存器。
results: 本研究在Visual Anomaly and Novelty Detection（VAND）竞赛中的零shot追踪和几shot追踪中获得了4th和2nd名的佳绩。

Abstract
Anomaly detection has gained considerable attention due to its broad range of applications, particularly in industrial defect detection. To address the challenges of data collection, researchers have introduced zero-/few-shot anomaly detection techniques that require minimal normal images for each category. However, complex industrial scenarios often involve multiple objects, presenting a significant challenge. In light of this, we propose a straightforward yet powerful multi-scale memory comparison framework for zero-/few-shot anomaly detection. Our approach employs a global memory bank to capture features across the entire image, while an individual memory bank focuses on simplified scenes containing a single object. The efficacy of our method is validated by its remarkable achievement of 4th place in the zero-shot track and 2nd place in the few-shot track of the Visual Anomaly and Novelty Detection (VAND) competition.

摘要
“异常检测已经受到了广泛关注，特别是在工业缺陷检测方面，因为它们可以应用于各种领域。为了解决数据收集的挑战，研究人员已经提出了零/几个例图 anomaly detection 技术，这些技术需要最小的正常图像。然而，复杂的工业场景经常会包含多个物体，这成为一大挑战。为此，我们提出了一种简单 yet 强大的多级内存比较框架，用于零/几个例图 anomaly detection。我们的方法使用全图内存银行来捕捉图像中的特征，而各个内存银行则专注于单个物体的简化场景。我们的方法的有效性得到了 VAND 比赛中的 zero-shot 轨道和 few-shot 轨道的优秀成绩，排名第四和第二。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration

paper_url: http://arxiv.org/abs/2308.04782
repo_url: None
paper_authors: Mingzhi Yuan, Kexue Fu, Zhihao Li, Yucong Meng, Manning Wang
for: 本研究旨在提出一种基于学习的点云注册方法，以便更好地利用RGB-D数据，提高注册精度。
methods: 我们提出了一种基于多尺度双向融合的点云注册网络，通过双向融合视觉和 геометрические特征，从多个尺度获取更多的特征，提高对应关系估计的精度。
results: 我们在ScanNet和3DMatch上进行了广泛的实验，结果显示，我们的方法可以达到新的州OF-THE-ART性能水平。

Abstract
Point cloud registration is a task to estimate the rigid transformation between two unaligned scans, which plays an important role in many computer vision applications. Previous learning-based works commonly focus on supervised registration, which have limitations in practice. Recently, with the advance of inexpensive RGB-D sensors, several learning-based works utilize RGB-D data to achieve unsupervised registration. However, most of existing unsupervised methods follow a cascaded design or fuse RGB-D data in a unidirectional manner, which do not fully exploit the complementary information in the RGB-D data. To leverage the complementary information more effectively, we propose a network implementing multi-scale bidirectional fusion between RGB images and point clouds generated from depth images. By bidirectionally fusing visual and geometric features in multi-scales, more distinctive deep features for correspondence estimation can be obtained, making our registration more accurate. Extensive experiments on ScanNet and 3DMatch demonstrate that our method achieves new state-of-the-art performance. Code will be released at https://github.com/phdymz/PointMBF

摘要
点云注册是一个重要的计算机视觉任务，目的是估算两个不同的扫描中的固定变换。在许多计算机视觉应用中，点云注册扮演着关键的角色。先前的学习型工作通常是指监督式注册，它在实际应用中有限制。随着便宜的RGB-D感知器的普及，最近几年有很多学习型工作利用RGB-D数据实现无监督注册。然而，大多数现有的无监督方法采用层次结构或将RGB-D数据在单向方式混合，这并不能充分利用RGB-D数据的补充信息。为了更好地利用RGB-D数据的补充信息，我们提议一种网络实现多尺度粒度的拼接，并将RGB图像和从深度图像生成的点云进行多向拼接。通过多尺度粒度的拼接，可以更好地获得更明确的深度特征，从而提高注册的准确性。我们对ScanNet和3DMatch进行了广泛的实验，结果显示，我们的方法在新的状态艺术性能。代码将在GitHub上发布，请参考https://github.com/phdymz/PointMBF。

SUnAA: Sparse Unmixing using Archetypal Analysis

paper_url: http://arxiv.org/abs/2308.04771
repo_url: https://github.com/behnoodrasti/sunaa
paper_authors: Behnood Rasti, Alexandre Zouaoui, Julien Mairal, Jocelyn Chanussot
for: 这篇论文提出了一种新的稀缺混合技术，使用基本分析（SUnAA）。该技术的目的是将有兴趣的元件混合成为一个稀缺的混合物。
methods: 该技术首先设计了一个基于基本分析的新模型，假设有兴趣的元件是spectral库中提供的元件的几何聚合。然后，提出了一个非对称的最小化问题。与大多数传统稀缺混合方法不同，这里的最小化问题是非对称的。我们使用了活动集算法来逐步最小化优化目标。
results: 对于两个 simulated 数据集，结果表明 SUnAA 的性能比传统和先进的方法更好，具体来说是 signal-to-reconstruction error 下降。此外，SUnAA 还应用于 Cuprite 数据集，并与可用的地质地图进行比较。 Qualitative 评估表明 SUnAA 可以成功估计矿物含量，并在主要矿物的检测方面提供了显著改进。

Abstract
This paper introduces a new sparse unmixing technique using archetypal analysis (SUnAA). First, we design a new model based on archetypal analysis. We assume that the endmembers of interest are a convex combination of endmembers provided by a spectral library and that the number of endmembers of interest is known. Then, we propose a minimization problem. Unlike most conventional sparse unmixing methods, here the minimization problem is non-convex. We minimize the optimization objective iteratively using an active set algorithm. Our method is robust to the initialization and only requires the number of endmembers of interest. SUnAA is evaluated using two simulated datasets for which results confirm its better performance over other conventional and advanced techniques in terms of signal-to-reconstruction error. SUnAA is also applied to Cuprite dataset and the results are compared visually with the available geological map provided for this dataset. The qualitative assessment demonstrates the successful estimation of the minerals abundances and significantly improves the detection of dominant minerals compared to the conventional regression-based sparse unmixing methods. The Python implementation of SUnAA can be found at: https://github.com/BehnoodRasti/SUnAA.

摘要
The performance of SUnAA is evaluated using two simulated datasets, and the results show that it outperforms conventional and advanced techniques in terms of signal-to-reconstruction error. SUnAA is also applied to the Cuprite dataset and the results are compared visually with the available geological map. The qualitative assessment demonstrates the successful estimation of mineral abundances and the improved detection of dominant minerals compared to conventional regression-based sparse unmixing methods.The Python implementation of SUnAA can be found at the following link: .Translated into Simplified Chinese:这篇论文介绍了一种新的稀疏分解技术，基于体型分析（SUnAA）。该方法假设有兴趣的终端成分是spectral库中的终端成分的 convex combinaison，并且知道终端成分的数量。然后，我们提出了一个非凸优化问题，并使用活动集算法来逐步解决。这种方法对 initialization 非常稳定，只需要终端成分的数量。SUnAA 在两个 simulated 数据集上进行了评估，结果表明它在信号到重建错误方面与其他 conventinal 和高级方法相比，表现更好。此外，SUnAA 还应用于 Cuprite 数据集，并与可用的地质地图进行比较。质量评估表明，SUnAA 成功地估计了矿物质的含量，并在主要矿物质的探测方面提高了可见性。Python 实现的 SUnAA 可以在以下链接中找到：.

Objects do not disappear: Video object detection by single-frame object location anticipation

paper_url: http://arxiv.org/abs/2308.04770
repo_url: https://github.com/l-kid/video-object-detection-by-location-anticipation
paper_authors: Xin Liu, Fatemeh Karimi Nejadasl, Jan C. van Gemert, Olaf Booij, Silvia L. Pintea
for: 提高视频对象检测精度和效率，以及减少注释成本。
methods: 利用视频中对象的连续平滑运动，提高对象检测精度和效率，并减少注释成本。
results: 在四个 dataset 上达到了比state-of-the-art更高的 mean average precision，并且提高了计算效率和注释效率。

Abstract
Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation.

摘要
视频中的对象通常具有连续的平滑运动。我们利用连续的平滑运动来提高检测精度，并且在三种方面进行利用：1. 使用对象运动作为额外的监督来源，我们通过预测对象位置从静止关键帧中获取。2. 提高效率，只在一小部分帧上进行昂贵的特征计算。因为邻近帧往往是重复的，所以只计算关键帧上的特征，并预测后续帧中对象的位置。3. 降低注释成本，只需注释关键帧，并使用平滑 Pseudo-运动 между关键帧来预测后续帧中对象的位置。我们在四个数据集上进行了比较：ImageNet VID、EPIC KITCHENS-55、YouTube-BoundingBoxes 和 Waymo Open dataset，并demonstrate了计算效率、注释效率以及改进的平均准确率。我们的源代码可以在https://github.com/L-KID/Videoobject-detection-by-location-anticipation上获取。

FaceSkin: A Privacy Preserving Facial skin patch Dataset for multi Attributes classification

paper_url: http://arxiv.org/abs/2308.04765
repo_url: None
paper_authors: Qiushi Guo, Shisha Liao
for: attribute classification, such as age, race, and gender
methods: utilizes a dataset called FaceSkin, which includes diverse ages and races, as well as synthetic skin-patches from 2D and 3D attack images
results: effective in attribute classification and has potential for various downstream tasks, such as Face anti-spoofing and Age estimation.

Abstract
Human facial skin images contain abundant textural information that can serve as valuable features for attribute classification, such as age, race, and gender. Additionally, facial skin images offer the advantages of easy collection and minimal privacy concerns. However, the availability of well-labeled human skin datasets with a sufficient number of images is limited. To address this issue, we introduce a dataset called FaceSkin, which encompasses a diverse range of ages and races. Furthermore, to broaden the application scenarios, we incorporate synthetic skin-patches obtained from 2D and 3D attack images, including printed paper, replays, and 3D masks. We evaluate the FaceSkin dataset across distinct categories and present experimental results demonstrating its effectiveness in attribute classification, as well as its potential for various downstream tasks, such as Face anti-spoofing and Age estimation.

摘要
人脸皮肤图像含有丰富的文本特征，可以作为年龄、种族和性别等特征的有价值特征。此外，人脸皮肤图像具有易收集和低隐私问题的优点。然而，有限的人脸皮肤数据集的可用性是一个问题。为解决这个问题，我们介绍了一个名为FaceSkin的数据集，该数据集包含多个年龄和种族的多样化图像。此外，为扩展应用场景，我们添加了由2D和3D攻击图像生成的人工皮肤质感补充。我们在不同类别上评估了FaceSkin数据集，并提供了对attribute分类、Face anti-spoofing和年龄估计等下游任务的实验结果，以及其潜在应用场景。

SAfER: Layer-Level Sensitivity Assessment for Efficient and Robust Neural Network Inference

paper_url: http://arxiv.org/abs/2308.04753
repo_url: None
paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
for: 这个论文的目的是研究深度神经网络（DNN）的行为和决策的原因。
methods: 这篇论文使用了DNN归因方法，以研究DNN的输入和预测之间的关系。归因方法可以高亮最重要的权重或神经元，从而更有效地选择可以被剪辑的权重。
results: 本论文提出了一种新的方法来评估DNN层的重要性，并创建了一个新的数据集来评估这种方法。研究表明，DNN层的重要性与层之间的相互作用有关，并且可以通过层层的权重融合来评估层的重要性。这些结论可以用于提高DNN的效率（通过剪辑和量化）以及增强DNN的Robustness（例如硬件故障）。

Abstract
Deep neural networks (DNNs) demonstrate outstanding performance across most computer vision tasks. Some critical applications, such as autonomous driving or medical imaging, also require investigation into their behavior and the reasons behind the decisions they make. In this vein, DNN attribution consists in studying the relationship between the predictions of a DNN and its inputs. Attribution methods have been adapted to highlight the most relevant weights or neurons in a DNN, allowing to more efficiently select which weights or neurons can be pruned. However, a limitation of these approaches is that weights are typically compared within each layer separately, while some layers might appear as more critical than others. In this work, we propose to investigate DNN layer importance, i.e. to estimate the sensitivity of the accuracy w.r.t. perturbations applied at the layer level. To do so, we propose a novel dataset to evaluate our method as well as future works. We benchmark a number of criteria and draw conclusions regarding how to assess DNN layer importance and, consequently, how to budgetize layers for increased DNN efficiency (with applications for DNN pruning and quantization), as well as robustness to hardware failure (e.g. bit swaps).

摘要
In this work, we propose to investigate DNN layer importance, i.e., to estimate the sensitivity of the accuracy with respect to perturbations applied at the layer level. To do so, we propose a novel dataset to evaluate our method as well as future works. We benchmark a number of criteria and draw conclusions regarding how to assess DNN layer importance and, consequently, how to budgetize layers for increased DNN efficiency (with applications for DNN pruning and quantization), as well as robustness to hardware failure (e.g., bit swaps).

TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design

paper_url: http://arxiv.org/abs/2308.04733
repo_url: None
paper_authors: Yifan Gao, Jinpeng Lin, Min Zhou, Chuanbin Liu, Hongtao Xie, Tiezheng Ge, Yuning Jiang
for: This paper is written for researchers and practitioners in the field of text design and multimodal processing, with a focus on generating visually-and-semantically-harmonious text images for posters.
methods: The paper proposes a novel multimodal approach called TextPainter, which leverages contextual visual information and corresponding text semantics to generate text images. The approach takes the global-local background image as a hint of style and guides the text image generation with visual harmony. Additionally, the paper introduces a text comprehension module to achieve both sentence-level and word-level style variations.
results: The paper presents extensive quantitative and qualitative experiments that demonstrate the effectiveness of TextPainter in generating visually-and-semantically-harmonious text images for posters. The results show that TextPainter can generate high-quality text images that are both aesthetically pleasing and semantically consistent with the context.

Abstract
Text design is one of the most critical procedures in poster design, as it relies heavily on the creativity and expertise of humans to design text images considering the visual harmony and text-semantic. This study introduces TextPainter, a novel multimodal approach that leverages contextual visual information and corresponding text semantics to generate text images. Specifically, TextPainter takes the global-local background image as a hint of style and guides the text image generation with visual harmony. Furthermore, we leverage the language model and introduce a text comprehension module to achieve both sentence-level and word-level style variations. Besides, we construct the PosterT80K dataset, consisting of about 80K posters annotated with sentence-level bounding boxes and text contents. We hope this dataset will pave the way for further research on multimodal text image generation. Extensive quantitative and qualitative experiments demonstrate that TextPainter can generate visually-and-semantically-harmonious text images for posters.

摘要
文本设计是海报设计中最重要的过程之一，因为它几乎完全依赖人类的创造力和专业知识来设计文本图像，考虑到视觉和文本 semantics。本研究介绍了 TextPainter，一种新的多Modal方法，利用上下文ual visual information和相应的文本 semantics来生成文本图像。具体来说，TextPainter 利用全局-局部背景图像作为风格的提示，引导文本图像生成，同时还利用语言模型和引入文本理解模块，实现句子级和单词级样式变化。此外，我们构建了 PosterT80K 数据集，包含约80K 海报，每个海报都有 sentence-level bounding box 和文本内容。我们希望这个数据集能够推动未来的多Modal文本图像生成研究。EXTENSIVE 量化和质量实验表明，TextPainter 可以生成视觉和semantically 和谐的文本图像。

Self-supervised Learning of Rotation-invariant 3D Point Set Features using Transformer and its Self-distillation

paper_url: http://arxiv.org/abs/2308.04725
repo_url: None
paper_authors: Takahiko Furuya, Zhoujie Chen, Ryutarou Ohbuchi, Zhenzhong Kuang
for: 本研究提出了一种自然语言处理框架，用于从大量未标注的3D点云数据中学习精准的3D物体特征。
methods: 我们提出了一种自我超vised学习框架，使用多个全球尺度的Token来保持3D物体的空间布局，并使用自我注意机制来细化Token并将其汇聚成一个表示3D点云的旋转不变特征。
results: 我们的算法可以学习精准的3D点云特征，并且比现有的自然语言处理算法更加精准。我们还提出了一种 combining多种数据增强技术来增加训练数据的多样性，以便更好地学习3D点云特征。

Abstract
Invariance against rotations of 3D objects is an important property in analyzing 3D point set data. Conventional 3D point set DNNs having rotation invariance typically obtain accurate 3D shape features via supervised learning by using labeled 3D point sets as training samples. However, due to the rapid increase in 3D point set data and the high cost of labeling, a framework to learn rotation-invariant 3D shape features from numerous unlabeled 3D point sets is required. This paper proposes a novel self-supervised learning framework for acquiring accurate and rotation-invariant 3D point set features at object-level. Our proposed lightweight DNN architecture decomposes an input 3D point set into multiple global-scale regions, called tokens, that preserve the spatial layout of partial shapes composing the 3D object. We employ a self-attention mechanism to refine the tokens and aggregate them into an expressive rotation-invariant feature per 3D point set. Our DNN is effectively trained by using pseudo-labels generated by a self-distillation framework. To facilitate the learning of accurate features, we propose to combine multi-crop and cut-mix data augmentation techniques to diversify 3D point sets for training. Through a comprehensive evaluation, we empirically demonstrate that, (1) existing rotation-invariant DNN architectures designed for supervised learning do not necessarily learn accurate 3D shape features under a self-supervised learning scenario, and (2) our proposed algorithm learns rotation-invariant 3D point set features that are more accurate than those learned by existing algorithms. Code will be available at https://github.com/takahikof/RIPT_SDMM

摘要
“三维点云集数据中的不变性对三维物体的分析是非常重要的属性。传统的三维点云集DNN通常通过监督学习使用标签的三维点云集来获取精确的三维形状特征。但由于三维点云集数据的快速增长和标签成本的高昂，需要一个框架可以从大量未标签的三维点云集中学习精确的三维形状特征。本文提出了一个新的自我监督学习框架，可以从许多未标签的三维点云集中学习精确且不变性的三维形状特征。我们的提案的轻量级DNN架构可以将输入的三维点云集分解为多个全球缩尺的区域，称为“token”，这些区域可以保持三维物体中的空间布局。我们还使用自我注意力机制来精确地调整token，并将其聚合为一个表达三维点云集不变性的特征。我们的DNN可以通过使用自我养分析框架生成的pseudo-labels进行有效地训练。为了让学习精确的特征，我们提议使用多个拼接和切割资料增强技术来让训练集中的3D点云集更加多样化。经过实验验证，我们证明了以下两点：（1）现有的不变性DNN架构，设计来进行监督学习情况下，不一定会学习精确的三维形状特征；（2）我们的提案的算法可以从未标签的三维点云集中学习精确且不变性的三维形状特征，并且比现有的算法更精确。代码将会在https://github.com/takahikof/RIPT_SDMM中公开。”

paper_url: http://arxiv.org/abs/2308.04702
repo_url: None
paper_authors: Francesco Barbato, Elena Camuffo, Simone Milani, Pietro Zanuttigh
for: 这个研究旨在探讨多modal semantic segmentation的紧密结构和Symmetric information-sharing scheme，以实现当一个输入模式缺失时仍能正确显示结果。
methods: 本研究使用紧密结构和Symmetric information-sharing scheme，实现多modal semantic segmentation的稳定性和可靠性。
results: 在SemanticKITTI dataset上进行评估，与 closest competitor 进行比较，得到了良好的结果。此外，还引入了一个特殊的 continual learning 方案，在 class-incremental continual learning enario中证明了方法的有效性。

Abstract
State-of-the-art multimodal semantic segmentation approaches combining LiDAR and color data are usually designed on top of asymmetric information-sharing schemes and assume that both modalities are always available. Regrettably, this strong assumption may not hold in real-world scenarios, where sensors are prone to failure or can face adverse conditions (night-time, rain, fog, etc.) that make the acquired information unreliable. Moreover, these architectures tend to fail in continual learning scenarios. In this work, we re-frame the task of multimodal semantic segmentation by enforcing a tightly-coupled feature representation and a symmetric information-sharing scheme, which allows our approach to work even when one of the input modalities is missing. This makes our model reliable even in safety-critical settings, as is the case of autonomous driving. We evaluate our approach on the SemanticKITTI dataset, comparing it with our closest competitor. We also introduce an ad-hoc continual learning scheme and show results in a class-incremental continual learning scenario that prove the effectiveness of the approach also in this setting.

摘要
现代多模态Semantic segmentation方法通常基于不均衡信息分享模式和假设所有感知数据都可用。可惜，这强制假设在实际场景中可能不成立，感知器容易出现故障或面临不良天气（夜晚、雨、雾等），导致获取到的信息不可靠。此外，这些架构在连续学习场景下也存在问题。在这项工作中，我们重新定义多模态Semantic segmentation任务，强制实施紧密相关的特征表示和 symmetrical information-sharing模式，使我们的方法能够在一个模式缺失时仍然可靠。这使我们的模型在安全关键的应用场景中可靠，如自动驾驶。我们在SemanticKITTI数据集上评估我们的方法，与 closest competitor进行比较。我们还引入了特殊的连续学习方案，并在类增量连续学习场景中展示结果，证明了我们的方法在这种场景中的有效性。

GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization

paper_url: http://arxiv.org/abs/2308.04699
repo_url: https://github.com/ffhibnese/gifd
paper_authors: Hao Fang, Bin Chen, Xuan Wang, Zhi Wang, Shu-Tao Xia
For: This paper proposes a method to protect the privacy of clients in Federated Learning (FL) by defending against gradient inversion attacks.* Methods: The proposed method, called Gradient Inversion over Feature Domains (GIFD), disassembles the Generative Adversarial Network (GAN) model and searches for feature domains in the intermediate layers. It also includes a regularizer to avoid unreal image generation.* Results: The proposed method achieves pixel-level reconstruction and outperforms existing methods. It also demonstrates great generalizability under different defense strategy settings and batch sizes.Here’s the simplified Chinese text:
for: 这篇论文目的是为了在联合学习（Federated Learning，FL）中保护客户端隐私，并对梯度反向攻击进行防御。
methods: 提议的方法是Gradient Inversion over Feature Domains（GIFD），它将GAN模型分解成多个层次结构，并在这些层次结构中搜索特征领域。它还包括一个正则项来避免生成不实际的图像。
results: 提议的方法可以实现像素级重建，并超越现有的方法。它还在不同的防御策略设置和批处大小下展现出了优秀的一致性。

Abstract
Federated Learning (FL) has recently emerged as a promising distributed machine learning framework to preserve clients' privacy, by allowing multiple clients to upload the gradients calculated from their local data to a central server. Recent studies find that the exchanged gradients also take the risk of privacy leakage, e.g., an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, performing gradient inversion attacks in the latent space of the GAN model limits their expression ability and generalizability. To tackle these challenges, we propose \textbf{G}radient \textbf{I}nversion over \textbf{F}eature \textbf{D}omains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and is superior to the existing methods. Notably, GIFD also shows great generalizability under different defense strategy settings and batch sizes.

摘要
federated learning (FL) 最近 emerge as a promising distributed machine learning framework to preserve clients' privacy, by allowing multiple clients to upload the gradients calculated from their local data to a central server. recent studies find that the exchanged gradients also take the risk of privacy leakage, e.g., an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. however, performing gradient inversion attacks in the latent space of the GAN model limits their expression ability and generalizability. to tackle these challenges, we propose 《G》radient 《I》nversion over 《F》eature 《D》omains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers. instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. in addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. we also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. extensive experiments demonstrate that our method can achieve pixel-level reconstruction and is superior to the existing methods. notably, GIFD also shows great generalizability under different defense strategy settings and batch sizes.

Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising

paper_url: http://arxiv.org/abs/2308.04682
repo_url: None
paper_authors: Jun Cheng, Tao Liu, Shan Tan
for: 这个论文主要关注的是实际世界单一图像去噪的问题。
methods: 该论文提出了一种基于深度泛化推敲的构思，即ScoreDVI，以解决实际世界单一图像去噪的问题。这种方法利用了易于存取的最小MSE非-$i.i.d$ Gaussian推敲器和泛化推敲样本，从而简化了 posterior 推敲的过程。
results: 该论文的方法比其他单一图像基于的实际世界去噪方法表现更好，并且与 dataset-based 无监督方法相似。

Abstract
Real-world single image denoising is crucial and practical in computer vision. Bayesian inversions combined with score priors now have proven effective for single image denoising but are limited to white Gaussian noise. Moreover, applying existing score-based methods for real-world denoising requires not only the explicit train of score priors on the target domain but also the careful design of sampling procedures for posterior inference, which is complicated and impractical. To address these limitations, we propose a score priors-guided deep variational inference, namely ScoreDVI, for practical real-world denoising. By considering the deep variational image posterior with a Gaussian form, score priors are extracted based on easily accessible minimum MSE Non-$i.i.d$ Gaussian denoisers and variational samples, which in turn facilitate optimizing the variational image posterior. Such a procedure adaptively applies cheap score priors to denoising. Additionally, we exploit a Non-$i.i.d$ Gaussian mixture model and variational noise posterior to model the real-world noise. This scheme also enables the pixel-wise fusion of multiple image priors and variational image posteriors. Besides, we develop a noise-aware prior assignment strategy that dynamically adjusts the weight of image priors in the optimization. Our method outperforms other single image-based real-world denoising methods and achieves comparable performance to dataset-based unsupervised methods.

摘要
By considering the deep variational image posterior with a Gaussian form, score priors are extracted based on easily accessible minimum MSE Non-$i.i.d$ Gaussian denoisers and variational samples, which facilitate optimizing the variational image posterior. This procedure adaptively applies cheap score priors to denoising. Additionally, we exploit a Non-$i.i.d$ Gaussian mixture model and variational noise posterior to model the real-world noise. This scheme enables the pixel-wise fusion of multiple image priors and variational image posteriors.Moreover, we develop a noise-aware prior assignment strategy that dynamically adjusts the weight of image priors in the optimization. Our method outperforms other single image-based real-world denoising methods and achieves comparable performance to dataset-based unsupervised methods.Translation notes:* "Real-world" is translated as "实际世界" (shíjiè shìjì)* "Single image denoising" is translated as "单图干涂除" (dan tú gān bù)* "Bayesian inversions" is translated as " bayesian 逆转" (bài jiàn zhòng)* "Score priors" is translated as "Score 先验" (mù xiān yǐ)* "Non-$i.i.d$ Gaussian denoisers" is translated as "非-$i.i.d$ Gaussian 干涂除器" (fēi-$i.i.d$ Gaussian gān bù zhèng)* "Variational image posterior" is translated as "变分图 posterior" (biàn fēn tú zhèng)* "Pixel-wise fusion" is translated as "像素级融合" (xiàng xiàng jí yù)* "Noise-aware prior assignment" is translated as "针对噪声的先验分配" (jiào duì fāng xiàng zhòng yì)

A General Implicit Framework for Fast NeRF Composition and Rendering

paper_url: http://arxiv.org/abs/2308.04669
repo_url: None
paper_authors: Xinyu Gao, Ziyi Yang, Yunlu Zhao, Yuxiang Sun, Xiaogang Jin, Changqing Zou
for: 这篇论文主要是为了提高NeRF对象的速度compositing，使其能够在实时中进行多个NeRF对象的组合和预览。
methods: 该方法使用了一种新的表面表示方式 called Neural Depth Fields (NeDF), 它可以快速确定物体之间的空间关系，并且可以使用抽象光源来渲染动态阴影。
results: 该方法可以快速地进行NeRF对象的组合和预览，并且可以在实时中进行多个NeRF对象的组合和预览。此外，该方法还可以作为现有NeRF作品的预览插件使用。

Abstract
A variety of Neural Radiance Fields (NeRF) methods have recently achieved remarkable success in high render speed. However, current accelerating methods are specialized and incompatible with various implicit methods, preventing real-time composition over various types of NeRF works. Because NeRF relies on sampling along rays, it is possible to provide general guidance for acceleration. To that end, we propose a general implicit pipeline for composing NeRF objects quickly. Our method enables the casting of dynamic shadows within or between objects using analytical light sources while allowing multiple NeRF objects to be seamlessly placed and rendered together with any arbitrary rigid transformations. Mainly, our work introduces a new surface representation known as Neural Depth Fields (NeDF) that quickly determines the spatial relationship between objects by allowing direct intersection computation between rays and implicit surfaces. It leverages an intersection neural network to query NeRF for acceleration instead of depending on an explicit spatial structure.Our proposed method is the first to enable both the progressive and interactive composition of NeRF objects. Additionally, it also serves as a previewing plugin for a range of existing NeRF works.

摘要
各种神经辐射场（NeRF）方法在最近几年内取得了显著的成功，但现有的加速方法具有特定的限制，无法与各种隐式方法兼容，因此在实时组合不同类型的 NeRF 作品中存在限制。由于 NeRF 通过抽象线段进行抽象，因此可以提供一般的指导方针 для加速。为了实现这一目标，我们提议一种通用的隐式管道，用于快速组合 NeRF 对象。我们的方法允许在动态阴影中投射analytical 光源，并允许多个 NeRF 对象在任意的旋转变换下进行平铺渲染。主要地，我们的工作引入了一种新的表面表示方式，称为神经深度场（NeDF），它快速确定了物体之间的空间关系，通过直接计算抽象线段与隐式表面之间的交点。它利用了交叉神经网络来查询 NeRF 的加速而不是依赖于显式空间结构。我们的提议方法是首个允许 NeRF 对象进行进度式和交互式组合。此外，它还可以作为许多现有 NeRF 作品的预览插件。

Classification of lung cancer subtypes on CT images with synthetic pathological priors

paper_url: http://arxiv.org/abs/2308.04663
repo_url: None
paper_authors: Wentao Zhu, Yuan Jin, Gege Ma, Geng Chen, Jan Egger, Shaoting Zhang, Dimitris N. Metaxas
for: 针对肺癌分型的精准诊断，以提高后续治疗和诊断管理的重要性。
methods: 提出了自生成混合特征网络（SGHF-Net），使用深度神经网络 Quantitatively 映射跨模态关系，从 CT 图像中提取与病理图像相关的 “金标准” 信息，并结合 радиологиraphic 特征提取模块（RFEM），实现多Modal 特征拼接框架，以生成更指示性和特定的病理相关特征，最终输出更准确的预测结果。
results: 对于肺癌分型的类型，SGHF-Net 模型比 SOTA 模型具有显著的高精准度，包括准确率（ACC）、曲线面积（AUC）和 F1 分数等指标均有显著提高。

Abstract
The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score.

摘要
精准诊断lung cancer的临床亚型是诊断和治疗评估中的关键因素。本文提出一种自生成混合特征网络（SGHF-Net），用于精准分类lung cancer亚型的计算机 Tomatoes（CT）影像。研究表明，同一个患者的CT影像和病理图像之间存在跨Modal Association，我们采用Pathological Feature Synthetic Module（PFSM），通过深度神经网络，将CT影像中的病理信息转化为"标准"信息，并与放射学特征提取模块（RFEM）集成，以实现更加指示和特定的病理相关特征，最终输出更高精度的预测结果。我们的模型的优势在于，可以基于单模态输入生成多Modal特征，以提高分类精度。为评估我们的模型的有效性、适应性和普遍性，我们在三家医院的大规模多中心数据集（829例）上进行了广泛的实验，与一系列现有的SOTA分类模型进行比较。实验结果表明，我们的模型在lung cancer亚型分类中表现出了显著的高精度，ACC、AUC和F1分数均达到了SOTA水平。

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

paper_url: http://arxiv.org/abs/2308.04657
repo_url: https://github.com/JoakimHaurum/TokenReduction
paper_authors: Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
For: 这 paper 的目的是为了理解不同的token reduction方法在不同的图像分类任务中的减少模式。* Methods: 这 paper 使用了10种不同的token reduction方法，并在四个图像分类dataset上进行了系统比较。* Results: 研究发现，Top-K pruning方法是一个意外的强大基线。通过深入分析不同方法的减少模式，发现减少模式通常不是随机变化的，pruning-based方法的减少模式与固定圆形模式不同，并且在不同的分类任务中减少模式的相似程度是一个中等至强的代理。Here’s the English version of the information for reference:* For: The purpose of this paper is to understand the reduction patterns of different token reduction methods on different image classification tasks.* Methods: The paper uses 10 different token reduction methods and compares them systematically on four image classification datasets.* Results: The study finds that the Top-K pruning method is a surprisingly strong baseline. Through in-depth analysis of the different methods, it is found that the reduction patterns are not consistent when varying the capacity of the backbone model, the reduction patterns of pruning-based methods significantly differ from fixed radial patterns, and the reduction patterns of pruning-based methods are correlated across classification datasets. Finally, it is reported that the similarity of reduction patterns is a moderate-to-strong proxy for model performance. The project page is available at https://vap.aau.dk/tokens.

Abstract
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs more efficient by removing redundant information in the processed tokens. While different methods have been explored to achieve this goal, we still lack understanding of the resulting reduction patterns and how those patterns differ across token reduction methods and datasets. To close this gap, we set out to understand the reduction patterns of 10 different token reduction methods using four image classification datasets. By systematically comparing these methods on the different classification tasks, we find that the Top-K pruning method is a surprisingly strong baseline. Through in-depth analysis of the different methods, we determine that: the reduction patterns are generally not consistent when varying the capacity of the backbone model, the reduction patterns of pruning-based methods significantly differ from fixed radial patterns, and the reduction patterns of pruning-based methods are correlated across classification datasets. Finally we report that the similarity of reduction patterns is a moderate-to-strong proxy for model performance. Project page at https://vap.aau.dk/tokens.

摘要
自 introduce Vision Transformer (ViT) 以来，研究人员努力减少 ViT 中处理符号中的重复信息，以提高模型效率。然而，不同的方法在实现这个目标上有所不同，我们仍然缺乏对减少模型的理解和不同减少方法和数据集之间的差异。为了填补这个空白，我们决心了解不同减少方法在四个图像分类任务中的减少模式。我们系统地比较了这些方法，并发现：1. 顶峰权重剪除法是一个意外的强基线。2. 随着后向模型的容量变化，减少模式不一致。3. 剪除基于方法的减少模式与固定圆形减少模式显著不同。4. 剪除基于方法的减少模式在不同的分类任务中呈 corrleation 关系。5. 减少模式之间的相似性是模型性能的中等到强的代表。更多信息请访问我们的项目页面：。

Assessing the performance of deep learning-based models for prostate cancer segmentation using uncertainty scores

paper_url: http://arxiv.org/abs/2308.04653
repo_url: None
paper_authors: Pablo Cesar Quihui-Rubio, Daniel Flores-Araiza, Gilberto Ochoa-Ruiz, Miguel Gonzalez-Mendoza, Christian Mata
for: 这个研究旨在比较深度学习方法用于脊梁磁共振图像中的不确定性分 segmentation和评估。
methods: 这个研究使用了七种不同的 U-Net 架构，其中包括 Monte Carlo dropout 的扩展。
results: 这个研究发现，使用 Attention R2U-Net 模型可以获得最高的 Mean Intersection over Union (IoU) 和 Dice Similarity Coefficient (DSC) 值，它可以准确地 segmentation所有区域，并且在transition zone和肿瘤边界处具有最低的uncertainty值。

Abstract
This study focuses on comparing deep learning methods for the segmentation and quantification of uncertainty in prostate segmentation from MRI images. The aim is to improve the workflow of prostate cancer detection and diagnosis. Seven different U-Net-based architectures, augmented with Monte-Carlo dropout, are evaluated for automatic segmentation of the central zone, peripheral zone, transition zone, and tumor, with uncertainty estimation. The top-performing model in this study is the Attention R2U-Net, achieving a mean Intersection over Union (IoU) of 76.3% and Dice Similarity Coefficient (DSC) of 85% for segmenting all zones. Additionally, Attention R2U-Net exhibits the lowest uncertainty values, particularly in the boundaries of the transition zone and tumor, when compared to the other models.

摘要

Long-Distance Gesture Recognition using Dynamic Neural Networks

paper_url: http://arxiv.org/abs/2308.04643
repo_url: None
paper_authors: Shubhang Bhatnagar, Sharath Gopal, Narendra Ahuja, Liu Ren
for: 本研究旨在提出一种新的、准确和高效的手势识别方法，可以在更远的距离上识别手势。
methods: 该方法使用动态神经网络选择手势包含的空间区域中的特征进行进一步处理，以提高识别精度和计算效率。
results: 在LD-ConGR长距离数据集上，该方法与前一代方法相比，在识别精度和计算效率两个方面均有显著提高。

Abstract
Gestures form an important medium of communication between humans and machines. An overwhelming majority of existing gesture recognition methods are tailored to a scenario where humans and machines are located very close to each other. This short-distance assumption does not hold true for several types of interactions, for example gesture-based interactions with a floor cleaning robot or with a drone. Methods made for short-distance recognition are unable to perform well on long-distance recognition due to gestures occupying only a small portion of the input data. Their performance is especially worse in resource constrained settings where they are not able to effectively focus their limited compute on the gesturing subject. We propose a novel, accurate and efficient method for the recognition of gestures from longer distances. It uses a dynamic neural network to select features from gesture-containing spatial regions of the input sensor data for further processing. This helps the network focus on features important for gesture recognition while discarding background features early on, thus making it more compute efficient compared to other techniques. We demonstrate the performance of our method on the LD-ConGR long-distance dataset where it outperforms previous state-of-the-art methods on recognition accuracy and compute efficiency.

摘要
人机之间的姿势成为通信的重要媒体。现有的大多数姿势识别方法都是为短距离场景设计的，而这个假设不符合一些交互，如floor cleaning robot或者飞行器的姿势交互。这些短距离的方法在长距离识别中表现不佳，因为姿势只占输入数据中的一小部分。它们在有限的计算资源下表现特别糟糕，无法有效地专注于捕捉姿势表达者。我们提出了一种新的、准确和高效的姿势识别方法，使用动态神经网络选择 gesture-containing 的空间区域输入感知器数据进行进一步处理。这帮助网络在执行姿势识别时选择重要的特征，而不是浪费计算资源于背景特征。我们在 LD-ConGR 长距离数据集上证明了我们的方法的性能，其在识别精度和计算效率两个方面都高于之前的状态 искусственныйints。

GeoAdapt: Self-Supervised Test-Time Adaption in LiDAR Place Recognition Using Geometric Priors

paper_url: http://arxiv.org/abs/2308.04638
repo_url: https://github.com/csiro-robotics/geoadapt
paper_authors: Joshua Knights, Stephen Hausler, Sridha Sridharan, Clinton Fookes, Peyman Moghadam
for: 提高 LiDAR 场景认知系统在不同环境下的性能，尤其是在训练和测试数据集中存在域名shift的情况下。
methods: 提出了一种基于深度学习的自动生成 pseudo-标签的方法，通过自我监督学习来提高模型在不同环境下的性能和可靠性。
results: 实验表明，GeoAdapt 可以在中度至严重的域名shift情况下显著提高场景认知性能，并与完全监督的测试时适应方法相比赛得竞争力。

Abstract
LiDAR place recognition approaches based on deep learning suffer a significant degradation in performance when there is a shift between the distribution of the training and testing datasets, with re-training often required to achieve top performance. However, obtaining accurate ground truth on new environments can be prohibitively expensive, especially in complex or GPS-deprived environments. To address this issue we propose GeoAdapt, which introduces a novel auxiliary classification head to generate pseudo-labels for re-training on unseen environments in a self-supervised manner. GeoAdapt uses geometric consistency as a prior to improve the robustness of our generated pseudo-labels against domain shift, improving the performance and reliability of our Test-Time Adaptation approach. Comprehensive experiments show that GeoAdapt significantly boosts place recognition performance across moderate to severe domain shifts, and is competitive with fully supervised test-time adaptation approaches. Our code will be available at https://github.com/csiro-robotics/GeoAdapt.

摘要
“LiDAR位置识别方法基于深度学习受到分布不同的训练和测试数据集之间的偏移会导致性能下降，并且经常需要重新训练以达到最佳性能。然而，在新环境中获取准确的测试数据可以非常昂贵，特别是在复杂或GPS缺乏环境中。为解决这个问题，我们提出了GeoAdapt，它 introduce了一个新的辅助分类头来生成 Pseudo-标签，以便在无监督的自适应方式下重新训练在未看过的环境中。GeoAdapt使用几何一致性作为假设，以提高我们生成的 Pseudo-标签对域转移的可靠性，从而提高了我们的测试时适应方法的性能和可靠性。我们的实验表明，GeoAdapt在中等至严重的域转移情况下能够显著提高位置识别性能，并与完全监督的测试时适应方法竞争。我们的代码将在https://github.com/csiro-robotics/GeoAdapt上公开。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Rendering Humans from Object-Occluded Monocular Videos

paper_url: http://arxiv.org/abs/2308.04622
repo_url: https://github.com/tiangexiang/OccNeRF
paper_authors: Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, Li Fei-Fei
for: 本研究旨在解决三维人体重建和渲染从单目视频中的问题，尤其是在实际场景中，其中可能会有障碍物阻挡相机视野并导致人体部分遮挡。现有方法无法处理这些缺陷，主要是因为标准渲染策略依赖点对点映射，可能导致人体部分的不一致。
methods: 我们提出了一种名为OccNeRF的神经渲染方法，可以更好地渲染人体在受到干扰的场景中。我们直接解决了两个缺陷，一是使用点对点映射的标准渲染策略可能导致人体部分的不一致，二是直接 regression approach 不考虑任何可行性条件（即先验信息） для 渲染下遮挡。为解决这两个缺陷，我们提出了基于表面和可见性先验的渲染方法。
results: 我们验证了我们的方法在both simulated和实际 occlusions 中的超过人体渲染和渲染效果，并证明了我们的方法的优越性。

Abstract
3D understanding and rendering of moving humans from monocular videos is a challenging task. Despite recent progress, the task remains difficult in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. First, the standard rendering strategy relies on point-point mapping, which could lead to dramatic disparities between the visible and occluded areas of the body. Second, the naive direct regression approach does not consider any feasibility criteria (ie, prior information) for rendering under occlusions. To tackle the above drawbacks, we present OccNeRF, a neural rendering method that achieves better rendering of humans in severely occluded scenes. As direct solutions to the two drawbacks, we propose surface-based rendering by integrating geometry and visibility priors. We validate our method on both simulated and real-world occlusions and demonstrate our method's superiority.

摘要
三维理解和渲染移动人体从单目视频中是一项具有挑战性的任务。尽管最近有所进步，但在真实世界场景中，障碍物可能会阻挡摄像头视野，导致视频中的部分遮挡。现有方法无法处理这些缺陷，主要因两点：首先，标准渲染策略基于点对点映射，可能导致人体部分遮挡和可见部分之间的差异极大。其次，直接回归方法不考虑任何可行性条件（即先验知识），在遮挡下进行渲染。为解决以上缺陷，我们提出OccNeRF方法，实现在严重遮挡场景中更好的人体渲染。为直接解决两个缺陷，我们提议基于几何和可见约束的表面渲染。我们在模拟和实际遮挡场景中验证了我们的方法，并证明其超越性。

PSRFlow: Probabilistic Super Resolution with Flow-Based Models for Scientific Data

paper_url: http://arxiv.org/abs/2308.04605
repo_url: None
paper_authors: Jingyi Shen, Han-Wei Shen
for: 这个论文的目的是提出一种基于正则化流的生成模型，用于科学数据超分辨化，并在超分辨化过程中进行不确定性评估。
methods: 该模型使用正则化流来学习高分辨度数据的Conditional分布，并通过随机抽取各个维度的GaussianLatent空间来实现不确定性评估。
results: 对比其他方法如 interpolate 和 GAN 基本的超分辨化网络，PSRFlow 模型在不确定性评估方面表现出色，并且在不同的数据比例下进行灵活的超分辨化。

Abstract
Although many deep-learning-based super-resolution approaches have been proposed in recent years, because no ground truth is available in the inference stage, few can quantify the errors and uncertainties of the super-resolved results. For scientific visualization applications, however, conveying uncertainties of the results to scientists is crucial to avoid generating misleading or incorrect information. In this paper, we propose PSRFlow, a novel normalizing flow-based generative model for scientific data super-resolution that incorporates uncertainty quantification into the super-resolution process. PSRFlow learns the conditional distribution of the high-resolution data based on the low-resolution counterpart. By sampling from a Gaussian latent space that captures the missing information in the high-resolution data, one can generate different plausible super-resolution outputs. The efficient sampling in the Gaussian latent space allows our model to perform uncertainty quantification for the super-resolved results. During model training, we augment the training data with samples across various scales to make the model adaptable to data of different scales, achieving flexible super-resolution for a given input. Our results demonstrate superior performance and robust uncertainty quantification compared with existing methods such as interpolation and GAN-based super-resolution networks.

摘要
尽管最近几年内提出了许多深度学习基于超分辨率方法，但由于无法在推理阶段获得测试数据，因此只能很难量化和不确定性的超分辨率结果。在科学视觉应用中，却是非常重要的，通过传递结果的不确定性给科学家，以避免生成错误或不准确的信息。在这篇论文中，我们提出了PSRFlow，一种基于Normalizing Flow的生成模型，用于科学数据超分辨率中的不确定性评估。PSRFlow学习了高分辨率数据的Conditional分布，基于低分辨率数据。通过在Gaussian准则空间中采样，可以生成不同的可能的超分辨率输出。我们的模型可以在Gaussian准则空间中高效采样，从而实现对超分辨率结果的不确定性评估。在模型训练时，我们将训练数据扩展到不同的尺度，使模型适应不同的数据尺度，实现数据的灵活超分辨率。我们的结果表明，PSRFlow比既有 interpolate和GAN基于超分辨率网络的方法具有更高的性能和稳定性。

1st Place Solution for CVPR2023 BURST Long Tail and Open World Challenges

paper_url: http://arxiv.org/abs/2308.04598
repo_url: None
paper_authors: Kaer Huang
for:* The paper is written to address the challenge of video instance segmentation (VIS) in long-tailed and open-world scenarios.methods:* The authors use a combination of LVISv0.5 and the COCO dataset with repeat factor sampling to train their model.* They train the detector with segmentation and CEM on the LVISv0.5 + COCO dataset, and then train the instance appearance similarity head on the TAO dataset.results:* The authors achieve 14.9 HOTAall in the BURST test set, ranking 1st in the benchmark.* They also achieve 61.4 OWTAall in the open-world challenges, ranking 1st in the benchmark.

Abstract
Currently, Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories that contain only a few dozen of categories, lacking the ability to handle diverse objects in real-world videos. As TAO and BURST datasets release, we have the opportunity to research VIS in long-tailed and open-world scenarios. Traditional VIS methods are evaluated on benchmarks limited to a small number of common classes, But practical applications require trackers that go beyond these common classes, detecting and tracking rare and even never-before-seen objects. Inspired by the latest MOT paper for the long tail task (Tracking Every Thing in the Wild, Siyuan Li et), for the BURST long tail challenge, we train our model on a combination of LVISv0.5 and the COCO dataset using repeat factor sampling. First, train the detector with segmentation and CEM on LVISv0.5 + COCO dataset. And then, train the instance appearance similarity head on the TAO dataset. at last, our method (LeTracker) gets 14.9 HOTAall in the BURST test set, ranking 1st in the benchmark. for the open-world challenges, we only use 64 classes (Intersection classes of BURST Train subset and COCO dataset, without LVIS dataset) annotations data training, and testing on BURST test set data and get 61.4 OWTAall, ranking 1st in the benchmark. Our code will be released to facilitate future research.

摘要
当前，视频实例分割（VIS）目标是将视频中的对象分类和分割，但现有的方法仅能处理固定的训练类别，无法涵盖实际世界中的多样化对象。随着TAO和BURST数据集的发布，我们有机会进行VIS在长尾和开放世界enario中的研究。传统的VIS方法通常被评估在限制于一些常见类别的benchmark上，但实际应用需要满足更多的类别，检测和跟踪 rare和even never-before-seen对象。受latest MOT论文的长尾任务（Tracking Every Thing in the Wild，Siyuan Li et al）的启发，我们在BURST长尾挑战中使用repeat factor sampling训练我们的模型。首先，我们使用LVISv0.5和COCO数据集训练探测器的segmentation和CEM。然后，我们在TAO数据集上训练实例外观相似度头。最后，我们的方法（LeTracker）在BURST测试集上得到14.9 HOTAall，排名第一名。对于开放世界挑战，我们只使用64个类别（BURST训练集和COCO数据集的交集类别，不包括LVIS数据集）的注释数据训练，并在BURST测试集上进行测试，得到61.4 OWTAall，排名第一名。我们的代码将被释出，以便未来的研究。

LATR: 3D Lane Detection from Monocular Images with Transformer

paper_url: http://arxiv.org/abs/2308.04583
repo_url: https://github.com/jmoonr/latr
paper_authors: Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, Zhen Li
for: 本研究旨在解决自动驾驶中3D车道检测问题，具体来说是从单视图图像中检测3D车道。
methods: 本研究使用了3D感知前视图特征，而不是基于转换后的视图表示。 Specifically, LATR使用了跨注意力的查询和键值对来检测3D车道，其中查询基于2D车道感知特征，并采用混合嵌入来增强车道信息。另一方面，3D空间信息通过位坐标嵌入从iteratively更新的3D地面来注入。
results: LATR在synthetic Apollo、realistic OpenLane和ONCE-3DLanes等数据集上表现出优于之前的状态 искусственный方法（例如，OpenLane上的F1分数提高11.4个）。

Abstract
3D lane detection from monocular images is a fundamental yet challenging task in autonomous driving. Recent advances primarily rely on structural 3D surrogates (e.g., bird's eye view) built from front-view image features and camera parameters. However, the depth ambiguity in monocular images inevitably causes misalignment between the constructed surrogate feature map and the original image, posing a great challenge for accurate lane detection. To address the above issue, we present a novel LATR model, an end-to-end 3D lane detector that uses 3D-aware front-view features without transformed view representation. Specifically, LATR detects 3D lanes via cross-attention based on query and key-value pairs, constructed using our lane-aware query generator and dynamic 3D ground positional embedding. On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance lane information. On the other hand, 3D space information is injected as positional embedding from an iteratively-updated 3D ground plane. LATR outperforms previous state-of-the-art methods on both synthetic Apollo, realistic OpenLane and ONCE-3DLanes by large margins (e.g., 11.4 gain in terms of F1 score on OpenLane). Code will be released at https://github.com/JMoonr/LATR .

摘要
三维车道检测从单视图图像是自主驾驶中的基本 yet 挑战性任务。当前的进步主要基于结构三维代理（如鸟瞰视图），从前视图图像特征和摄像头参数构建。然而，单视图图像中的深度不确定性无法准确对应原始图像，这对准确车道检测 pose 大问题。为解决上述问题，我们提出了一种新的LATR模型，一个端到端的三维车道检测器，使用三维感知的前视图特征而不需要转换视图表示。具体来说，LATR通过对查询和关键值对进行跨注意力的注意力机制来检测三维车道。一方面，每个查询基于二维车道意识特征，采用混合嵌入以增强车道信息。另一方面，3D空间信息通过循环更新的3D地面嵌入注入到扩展特征中。LATR在 Apollo 和 OpenLane 等实际数据集上的性能明显超过了之前的状态对照方法（例如，OpenLane 上的 F1 分数提高11.4）。代码将在 GitHub 上发布。

Optimizing Algorithms From Pairwise User Preferences

paper_url: http://arxiv.org/abs/2308.04571
repo_url: https://github.com/leonidk/pairwise
paper_authors: Leonid Keselman, Katherine Shih, Martial Hebert, Aaron Steinfeld
for: 本研究旨在优化机器人算法参数配置，以便更好地适应人类中心的情境。
methods: 本研究提出了 SortCMA 算法，通过对用户喜好进行排序，以不直接模型奖励函数的方式，高效地和稳定地优化算法参数。
results: 研究中应用 SortCMA 算法成功地优化了无地平 truth 的深度探测仪器和人工社交导航问题，并进行了用户研究以评估社交导航结果。

Abstract
Typical black-box optimization approaches in robotics focus on learning from metric scores. However, that is not always possible, as not all developers have ground truth available. Learning appropriate robot behavior in human-centric contexts often requires querying users, who typically cannot provide precise metric scores. Existing approaches leverage human feedback in an attempt to model an implicit reward function; however, this reward may be difficult or impossible to effectively capture. In this work, we introduce SortCMA to optimize algorithm parameter configurations in high dimensions based on pairwise user preferences. SortCMA efficiently and robustly leverages user input to find parameter sets without directly modeling a reward. We apply this method to tuning a commercial depth sensor without ground truth, and to robot social navigation, which involves highly complex preferences over robot behavior. We show that our method succeeds in optimizing for the user's goals and perform a user study to evaluate social navigation results.

摘要
传统的黑盒优化方法在机器人学中通常是通过学习度量分数来进行。但是，不 все开发者拥有地面 truth，而学习合适的机器人行为在人类中心的上下文中经常需要询问用户，用户通常无法提供精确的度量分数。现有方法利用用户反馈来尝试模型一个隐式奖励函数，但这个奖励可能很难或者无法有效地捕捉。在这项工作中，我们介绍 SortCMA 来优化算法参数配置在高维度基于用户偏好的情况下。SortCMA 能够高效地和稳定地利用用户输入来找到参数集，而不需直接模型一个奖励。我们在没有地面 truth 的情况下使用 SortCMA 来调整一个商业深度探测器，以及机器人社交导航，这些导航包括机器人行为的复杂偏好。我们示示了我们的方法可以在用户的目标下进行优化，并进行了用户研究来评估社交导航结果。

FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

paper_url: http://arxiv.org/abs/2308.04556
repo_url: https://github.com/NVlabs/FocalFormer3D
paper_authors: Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, Animashree Anandkumar, Jiaya Jia, Jose Alvarez
for: 提高3D对象检测中的缺失预测（False Negative）的精度，尤其是在自动驾驶场景中，以免导致危险的情况。
methods: 提出了一种通用的管道方法 Hard Instance Probing (HIP)，可以在多个阶段进行缺失预测的识别和提高模型的预测精度。在3D对象检测方面，我们实现了这个方法为FocalFormer3D，一种简单又高效的检测器，能够高效地检测难对象并提高预测精度。FocalFormer3D使用多个阶段生成查询来找到困难对象，并使用箱级变换器解码器来快速分辨对象和庞大对象候选。
results: 在nuScenes和Waymo datasets上进行实验， validate FocalFormer3D的超越性性能。FocalFormer3D在LiDAR和多模态设置下都具有出色的检测和跟踪性能，其中 nuScenes检测benchmark中的70.5 mAP和73.9 NDS都在1位的LiDAR领导板块上。

Abstract
False negatives (FN) in 3D object detection, {\em e.g.}, missing predictions of pedestrians, vehicles, or other obstacles, can lead to potentially dangerous situations in autonomous driving. While being fatal, this issue is understudied in many current 3D detection methods. In this work, we propose Hard Instance Probing (HIP), a general pipeline that identifies \textit{FN} in a multi-stage manner and guides the models to focus on excavating difficult instances. For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall. FocalFormer3D features a multi-stage query generation to discover hard objects and a box-level transformer decoder to efficiently distinguish objects from massive object candidates. Experimental results on the nuScenes and Waymo datasets validate the superior performance of FocalFormer3D. The advantage leads to strong performance on both detection and tracking, in both LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP and 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking benchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR leaderboard. Our code is available at \url{https://github.com/NVlabs/FocalFormer3D}.

摘要
假阳性（FN）在三维 объек的检测中，例如缺失人行车辆或其他障碍物的预测，可能导致自动驾驶中的危险 Situations. 虽然这是致命的问题，但在许多当前的三维检测方法中却被不 suficiently studied. 在这种工作中，我们提议使用 Hard Instance Probing（HIP），一种总体来说是多个阶段的方法，用于标识假阳性。为三维对象检测，我们实现了 FocalFormer3D，一种简单又高效的检测器，可以帮助模型更好地挖掘困难对象。 FocalFormer3D 的设计包括多个阶段的查询生成，用于找到困难对象，以及一个箱级别的 transformer 解码器，用于高效地分辨对象和大量对象候选者之间。实验结果表明，FocalFormer3D 在 nuScenes 和 Waymo 数据集上表现出色，其优势在于强大的检测和跟踪性能，包括 LiDAR 和多模式设置下的性能。特别是，FocalFormer3D 在 nuScenes 检测benchmark上达到了 70.5 mAP 和 73.9 NDS，在 nuScenes 跟踪benchmark上达到了 72.1 AMOTA，都在 nuScenes LiDAR 领先者板块上排名第一。我们的代码可以在上获取。

Towards Automatic Scoring of Spinal X-ray for Ankylosing Spondylitis

paper_url: http://arxiv.org/abs/2308.05123
repo_url: None
paper_authors: Yuanhan Mo, Yao Chen, Aimee Readie, Gregory Ligozio, Thibaud Coroller, Bartłomiej W. Papież
for: 这份研究旨在开发一个自动分配 modified Stoke Ankylosing Spondylitis Spinal Score (mSASSS) 的运算流程，以便对骨架X射像中的脊椎影像进行自动评分。
methods: 这个研究使用了一个2步骤的自动评分管线，称为VertXGradeNet，将骨架X射像中的脊椎影像转换为 modified Stoke Ankylosing Spondylitis Spinal Score (mSASSS) 分数。
results: 研究结果显示，VertXGradeNet 可以对有限量和不均匀的数据进行自动评分，并且在两个试验数据集上实现了0.56和0.51的平衡精度。

Abstract
Manually grading structural changes with the modified Stoke Ankylosing Spondylitis Spinal Score (mSASSS) on spinal X-ray imaging is costly and time-consuming due to bone shape complexity and image quality variations. In this study, we address this challenge by prototyping a 2-step auto-grading pipeline, called VertXGradeNet, to automatically predict mSASSS scores for the cervical and lumbar vertebral units (VUs) in X-ray spinal imaging. The VertXGradeNet utilizes VUs generated by our previously developed VU extraction pipeline (VertXNet) as input and predicts mSASSS based on those VUs. VertXGradeNet was evaluated on an in-house dataset of lateral cervical and lumbar X-ray images for axial spondylarthritis patients. Our results show that VertXGradeNet can predict the mSASSS score for each VU when the data is limited in quantity and imbalanced. Overall, it can achieve a balanced accuracy of 0.56 and 0.51 for 4 different mSASSS scores (i.e., a score of 0, 1, 2, 3) on two test datasets. The accuracy of the presented method shows the potential to streamline the spinal radiograph readings and therefore reduce the cost of future clinical trials.

摘要
人工评估结构变化的成本和时间费用很高，主要是因为骨形态复杂性和图像质量变化。在这项研究中，我们解决这个挑战，推出了一个两步自动评估管线，称为VertXGradeNet，可以自动预测X光影像中肩峰病变综合分数（mSASSS）。VertXGradeNet使用我们先前开发的VU提取管线（VertXNet）生成的VU作为输入，并预测基于这些VU的mSASSS分数。我们的结果表明，VertXGradeNet可以在数量有限、不均衡的数据集上预测每个VU的mSASSS分数。总的来说，它可以在两个测试数据集上实现平均准确率为0.56和0.51，对四个不同的mSASSS分数（即分数为0、1、2、3）进行预测。本方法的准确率表明，可以通过自动化肩峰X光影像读取，提高未来临床试验的效率，降低成本。

Copy Number Variation Informs fMRI-based Prediction of Autism Spectrum Disorder

paper_url: http://arxiv.org/abs/2308.05122
repo_url: None
paper_authors: Nicha C. Dvornek, Catherine Sullivan, James S. Duncan, Abha R. Gupta
for: This paper aims to develop a more integrative model for combining genetic, demographic, and neuroimaging data to better understand the multifactorial etiology of autism spectrum disorder (ASD).
methods: The proposed approach uses an attention-based model that guides attention to neuroimaging features of importance for model prediction based on genetic data derived from copy number variation parameters.
results: The attention-based model combining genetic information, demographic data, and functional magnetic resonance imaging results in superior prediction performance compared to other multimodal approaches, as demonstrated on ASD classification and severity prediction tasks using a sex-balanced dataset of 228 ASD and typically developing subjects.

Abstract
The multifactorial etiology of autism spectrum disorder (ASD) suggests that its study would benefit greatly from multimodal approaches that combine data from widely varying platforms, e.g., neuroimaging, genetics, and clinical characterization. Prior neuroimaging-genetic analyses often apply naive feature concatenation approaches in data-driven work or use the findings from one modality to guide posthoc analysis of another, missing the opportunity to analyze the paired multimodal data in a truly unified approach. In this paper, we develop a more integrative model for combining genetic, demographic, and neuroimaging data. Inspired by the influence of genotype on phenotype, we propose using an attention-based approach where the genetic data guides attention to neuroimaging features of importance for model prediction. The genetic data is derived from copy number variation parameters, while the neuroimaging data is from functional magnetic resonance imaging. We evaluate the proposed approach on ASD classification and severity prediction tasks, using a sex-balanced dataset of 228 ASD and typically developing subjects in a 10-fold cross-validation framework. We demonstrate that our attention-based model combining genetic information, demographic data, and functional magnetic resonance imaging results in superior prediction performance compared to other multimodal approaches.

摘要
Autism spectrum disorder (ASD) 的多因素起源表示，研究它会受到多模态方法的 combinatio。例如，神经成像、遗传学和临床特征的数据可以结合在一起来研究。在过去的神经成像-遗传学分析中，通常采用了简单的特征串接方法或者使用一个模式来导向另一个模式的分析，而忽略了对复合数据进行真正的统一分析。在这篇论文中，我们开发了一种更集成的方法，将遗传学、人口统计学和神经成像数据结合在一起。我们受到遗传型的影响，使用了一种注意力机制，使遗传学数据引导神经成像特征的重要性对模型预测。我们的数据来自于拷贝数变化参数，而神经成像数据来自功能核磁共振成像。我们在ASD分类和严重程度预测任务中使用了10次交叉验证框架，并证明了我们的注意力机制，结合遗传信息、人口统计学和神经成像结果，在多模态方法中实现了更高的预测性能。

From Fake to Real (FFR): A two-stage training pipeline for mitigating spurious correlations with synthetic data

paper_url: http://arxiv.org/abs/2308.04553
repo_url: None
paper_authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer
for: 减少图像识别模型学习偏见，特别是因为训练集中某些组（如女性）被下标 Represented 在某些类（如程序员）中。
methods: 使用生成模型生成偏见训练集的 sintetic 数据，以增加训练集中的维度和多样性，从而减少模型学习到的偏见。
results: 提出了一种两阶段管道，首先在偏见synthetic dataset上预训练模型，然后在真实数据上细化。这种管道可以避免训练both real和synthetic数据，从而避免real和synthetic数据之间的偏见。此外，我们的管道还能够学习减轻偏见的特征，从而在第二阶段中减少偏见。此外，我们的管道可以自然地与偏见缓解方法集成，这些方法可以简单地应用于细化阶段。我们的实验证明，我们的管道可以进一步提高偏见缓解方法的性能，在三个大规模数据集上达到状态空间的表现。

Abstract
Visual recognition models are prone to learning spurious correlations induced by an imbalanced training set where certain groups (\eg Females) are under-represented in certain classes (\eg Programmers). Generative models offer a promising direction in mitigating this bias by generating synthetic data for the minority samples and thus balancing the training set. However, prior work that uses these approaches overlooks that visual recognition models could often learn to differentiate between real and synthetic images and thus fail to unlearn the bias in the original dataset. In our work, we propose a novel two-stage pipeline to mitigate this issue where 1) we pre-train a model on a balanced synthetic dataset and then 2) fine-tune on the real data. Using this pipeline, we avoid training on both real and synthetic data, thus avoiding the bias between real and synthetic data. Moreover, we learn robust features against the bias in the first step that mitigate the bias in the second step. Moreover, our pipeline naturally integrates with bias mitigation methods; they can be simply applied to the fine-tuning step. As our experiments prove, our pipeline can further improve the performance of bias mitigation methods obtaining state-of-the-art performance on three large-scale datasets.

摘要
“视觉识别模型容易学习偏见，由于训练集中某些组（如女性）被下标，导致训练集不均衡。生成模型提供了一个有前途的方向，即通过生成少数样本的 sintetic 数据，以增加训练集的均衡性。然而，现有的方法忽略了视觉识别模型可能会学习分辨真实和 sintetic 图像的 diference，从而失去原始数据中的偏见。在我们的工作中，我们提出了一个新的两个阶段管道来解决这个问题：1）我们首先在一个均衡的 sintetic 数据上预训练模型，然后2）在真实数据上细化。使用这个管道，我们可以避免在真实和 sintetic 数据上进行训练，从而避免偏见的问题。此外，我们在第一个阶段学习了对偏见的Robust特征，以 Mitigate 在第二个阶段的偏见。此外，我们的管道自然地与偏见缓解方法集成，这些方法可以简单地应用于细化阶段。根据我们的实验，我们的管道可以进一步提高偏见缓解方法的性能，在三个大规模数据集上达到了状态之作�的表现。”

Improving Medical Image Classification in Noisy Labels Using Only Self-supervised Pretraining

paper_url: http://arxiv.org/abs/2308.04551
repo_url: https://github.com/bbrattoli/JigsawPuzzlePytorch
paper_authors: Bidur Khanal, Binod Bhattarai, Bishesh Khanal, Cristian A. Linte
for: 这个研究旨在测试自我超vised学习初始化方法可以改善随机标签影像分类性能。
methods: 研究使用了两种自我超vised学习方法：contrastive自我超vised学习和预texte任务基于的自我超vised学习。
results: 研究发现，使用自我超vised学习初始化的模型可以更好地学习随机标签影像，并提高分类性能。

Abstract
Noisy labels hurt deep learning-based supervised image classification performance as the models may overfit the noise and learn corrupted feature extractors. For natural image classification training with noisy labeled data, model initialization with contrastive self-supervised pretrained weights has shown to reduce feature corruption and improve classification performance. However, no works have explored: i) how other self-supervised approaches, such as pretext task-based pretraining, impact the learning with noisy label, and ii) any self-supervised pretraining methods alone for medical images in noisy label settings. Medical images often feature smaller datasets and subtle inter class variations, requiring human expertise to ensure correct classification. Thus, it is not clear if the methods improving learning with noisy labels in natural image datasets such as CIFAR would also help with medical images. In this work, we explore contrastive and pretext task-based self-supervised pretraining to initialize the weights of a deep learning classification model for two medical datasets with self-induced noisy labels -- NCT-CRC-HE-100K tissue histological images and COVID-QU-Ex chest X-ray images. Our results show that models initialized with pretrained weights obtained from self-supervised learning can effectively learn better features and improve robustness against noisy labels.

摘要
噪声标签会对深度学习基于监督图像分类的性能产生负面影响，因为模型可能会适应噪声并学习损坏的特征提取器。在自然图像分类训练中使用噪声标签数据，使用对比自我超vised预训练的初始化方法可以减少特征损坏并提高分类性能。然而，没有任何研究探讨了：i) 其他自我超视任务基本预训练方法对噪声标签学习的影响，ii) 任何自我超视任务基本预训练方法在医学图像上的效果。医学图像通常具有较小的数据集和柔微的间类差异，需要人类专业知识来确保正确的分类。因此，不清楚自然图像数据集CIFAR中的方法会在医学图像上有效。在这项工作中，我们探讨了对比和预 Text Task-based自我超视预训练来初始化深度学习分类模型的两个医学数据集——NCT-CRC-HE-100K组织组织肿瘤图像和COVID-QU-Ex胸部X射图像。我们的结果表明，使用自我超视预训练获得的预 initialize的模型可以更好地学习特征并提高对噪声标签的Robustness。

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

paper_url: http://arxiv.org/abs/2308.04549
repo_url: None
paper_authors: Shuangrui Ding, Peisen Zhao, Xiaopeng Zhang, Rui Qian, Hongkai Xiong, Qi Tian
for: 提高视频识别领域中Transformers的速度-准确性贸易 balance。
methods: 提出Semantic-aware Temporal Accumulation score（STA）模块，通过考虑两个关键因素： temporal redundancy和semantic importance，进行token权重调整和减少。
results: 在Kinetics-400和Something-Something V2 datasets上，使用STA模块可以实现约30%的计算减少，而准确率下降幅度只有0.2%。

Abstract
Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal Accumulation score (STA) to prune spatio-temporal tokens integrally. STA score considers two critical factors: temporal redundancy and semantic importance. The former depicts a specific region based on whether it is a new occurrence or a seen entity by aggregating token-to-token similarity in consecutive frames while the latter evaluates each token based on its contribution to the overall prediction. As a result, tokens with higher scores of STA carry more temporal redundancy as well as lower semantics thus being pruned. Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training. We directly apply the STA module to off-the-shelf ViT and VideoSwin backbones, and the empirical results on Kinetics-400 and Something-Something V2 achieve over 30% computation reduction with a negligible ~0.2% accuracy drop. The code is released at https://github.com/Mark12Ding/STA.

摘要
孵化器在计算机视觉领域中成为主要脊梁，但它们的计算成本却限制其在视频识别领域的应用。为了优化速率和准确度之间的负荷，我们提议使用 semantic-aware temporal accumulation score（STA）来精炼时空 токен。STA Score考虑了两个关键因素：时间重复和semantic importance。前者描述特定区域是新出现还是已经看到的实体，通过在连续帧中 Token-to-token 相似性的积累来评估；后者根据每个 Token 对总预测的贡献来评估。因此，具有更高 STA 分数的 Token 会有更高的时间重复和更低的semantic importance，因此被折射。基于 STA 分数，我们可以不需要额外参数或重新训练，直接应用 STA 模块到 off-the-shelf ViT 和 VideoSwin 脊梁上。我们在 Kinetics-400 和 Something-Something V2 上进行了实验，实际结果达到了30% 的计算减少，准确率下降约0.2%。代码可以在 GitHub 上找到：https://github.com/Mark12Ding/STA。

YUDO: YOLO for Uniform Directed Object Detection

paper_url: http://arxiv.org/abs/2308.04542
repo_url: https://github.com/djordjened92/yudo
paper_authors: Đorđe Nedeljković
for: 本文提出了一种高效的对象探测方法，通过预测对象的中心坐标和方向角来实现。由于对象均匀大，该模型不需要预测对象的宽高。
methods: 该方法使用了YoloV7实时对象检测架构，并对其进行了定制。该方法使用了一个非常高效的、小巧的版本，并且只使用了一个无锚头的检测头。
results: 该方法可以准确地检测对象的位置和方向，并且可以快速地进行计算。 authors还引入了扩展的Skew Intersection over Union (SkewIoU)计算方法，用于处理旋转的盒子。

Abstract
This paper presents an efficient way of detecting directed objects by predicting their center coordinates and direction angle. Since the objects are of uniform size, the proposed model works without predicting the object's width and height. The dataset used for this problem is presented in Honeybee Segmentation and Tracking Datasets project. One of the contributions of this work is an examination of the ability of the standard real-time object detection architecture like YoloV7 to be customized for position and direction detection. A very efficient, tiny version of the architecture is used in this approach. Moreover, only one of three detection heads without anchors is sufficient for this task. We also introduce the extended Skew Intersection over Union (SkewIoU) calculation for rotated boxes - directed IoU (DirIoU), which includes an absolute angle difference. DirIoU is used both in the matching procedure of target and predicted bounding boxes for mAP calculation, and in the NMS filtering procedure. The code and models are available at https://github.com/djordjened92/yudo.

摘要
这篇论文提出了一种高效的指向对象探测方法，通过预测对象的中心坐标和方向角来实现。由于对象均匀大小，该模型不需要预测对象的宽高。使用的数据集是Honeybee Segmentation and Tracking Datasets项目中的 dataset。本研究的一个贡献是对标准实时对象检测架构如YoloV7进行定制，以实现位置和方向检测。此外，还引入了扩展的倾斜交叉Union（SkewIoU）计算方法，该方法包括绝对角度差。SkewIoU被用于练习步骤中的匹配过程和NMS筛选步骤。代码和模型可以在https://github.com/djordjened92/yudo中下载。

Facial Prior Based First Order Motion Model for Micro-expression Generation

paper_url: http://arxiv.org/abs/2308.04536
repo_url: https://github.com/necolizer/facial-prior-based-fomm
paper_authors: Yi Zhang, Youjun Zhao, Yuhang Wen, Zixuan Tang, Xinhua Xu, Mengyuan Liu
for: 这篇论文旨在提出一个新的任务：从视频中检测和生成微表情，以应对对于脸部表情识别和探索等领域的应用。
methods: 本文提出了一个新的模型，具有三个模块：首先，我们从视频中提取脸部优先知识特征；其次，我们使用关键点和本地拟合变数估算脸部运动；最后，我们使用表情生成模块将目标脸部驱动生成微表情视频。
results: 本文使用公共的CASME II、SAMM和SMIC datasets进行训练，并使用模型生成新的微表情视频进行评估。模型在Facial Micro-Expression Challenge 2021（MEGC2021）中获得了第一名，并被三位专家认证为Facial Action Coding System认证。

Abstract
Spotting facial micro-expression from videos finds various potential applications in fields including clinical diagnosis and interrogation, meanwhile this task is still difficult due to the limited scale of training data. To solve this problem, this paper tries to formulate a new task called micro-expression generation and then presents a strong baseline which combines the first order motion model with facial prior knowledge. Given a target face, we intend to drive the face to generate micro-expression videos according to the motion patterns of source videos. Specifically, our new model involves three modules. First, we extract facial prior features from a region focusing module. Second, we estimate facial motion using key points and local affine transformations with a motion prediction module. Third, expression generation module is used to drive the target face to generate videos. We train our model on public CASME II, SAMM and SMIC datasets and then use the model to generate new micro-expression videos for evaluation. Our model achieves the first place in the Facial Micro-Expression Challenge 2021 (MEGC2021), where our superior performance is verified by three experts with Facial Action Coding System certification. Source code is provided in https://github.com/Necolizer/Facial-Prior-Based-FOMM.

摘要
发现面部微表情从视频中的应用场景广泛，包括临床诊断和问候，但这个任务仍然具有挑战性，原因是训练数据的限制。为解决这个问题，这篇论文提出了一个新的任务 called micro-expression generation，并提供了一个强大的基线模型，它结合了首要动作模型和面部先验知识。给定一个目标面，我们希望使其生成微表情视频，根据源视频的动作模式。我们的新模型包括三个模块。首先，我们从一个区域专注模块中提取面部先验特征。第二，我们使用关键点和本地拟合变换来估算面部动作，并使用运动预测模块。第三，我们使用表达生成模块来驱动目标面生成视频。我们在公共的 CASME II、SAMM 和 SMIC 数据集上训练了我们的模型，然后使用模型生成新的微表情视频进行评估。我们的模型在 Facial Micro-Expression Challenge 2021 (MEGC2021) 中获得了第一名，并被三位专家（具有 Facial Action Coding System 证书）验证了我们的优秀表现。源代码可以在 https://github.com/Necolizer/Facial-Prior-Based-FOMM 上获取。

Estimation of Human Condition at Disaster Site Using Aerial Drone Images

paper_url: http://arxiv.org/abs/2308.04535
repo_url: None
paper_authors: Tomoki Arai, Kenji Iwata, Kensho Hara, Yutaka Satoh
for: 这个研究是为了快速理解灾难现场和减少劳动力。
methods: 这个研究使用了人体动作在空中无人机图像中自动估计人员受损状况，并使用3D ResNet分类人类动作特征。
results: 研究结果显示，可以达到超过80%的准确率来分类特征人类动作状况，而其他类似人类动作状况只能达到约50%的准确率。此外，云端VR演示应用程序表明了使用无人机理解灾难现场和估计人员状况的可能性。

Abstract
Drones are being used to assess the situation in various disasters. In this study, we investigate a method to automatically estimate the damage status of people based on their actions in aerial drone images in order to understand disaster sites faster and save labor. We constructed a new dataset of aerial images of human actions in a hypothetical disaster that occurred in an urban area, and classified the human damage status using 3D ResNet. The results showed that the status with characteristic human actions could be classified with a recall rate of more than 80%, while other statuses with similar human actions could only be classified with a recall rate of about 50%. In addition, a cloud-based VR presentation application suggested the effectiveness of using drones to understand the disaster site and estimate the human condition.

摘要
<>translate english text into simplified chineseDrones are being used to assess the situation in various disasters. In this study, we investigate a method to automatically estimate the damage status of people based on their actions in aerial drone images in order to understand disaster sites faster and save labor. We constructed a new dataset of aerial images of human actions in a hypothetical disaster that occurred in an urban area, and classified the human damage status using 3D ResNet. The results showed that the status with characteristic human actions could be classified with a recall rate of more than 80%, while other statuses with similar human actions could only be classified with a recall rate of about 50%. In addition, a cloud-based VR presentation application suggested the effectiveness of using drones to understand the disaster site and estimate the human condition.中文简体版：<>将英文文本翻译成中文简体版用悬浮机评估灾害现场的情况，这项研究探讨了基于悬浮机上空图像的人类行为自动评估人员受损状况的方法，以便更快地理解灾害现场和降低劳动成本。我们创建了一个新的飞行图像人类行为数据集，使用3D ResNet分类人员受损状况，结果显示了特征人类行为状况的分类率高于80%，而其他类似人类行为状况的分类率只有约50%。此外，一个云端VR演示应用程序表明了使用悬浮机理解灾害现场和估算人员状况的效果。

Unsupervised Camouflaged Object Segmentation as Domain Adaptation

paper_url: http://arxiv.org/abs/2308.04528
repo_url: https://github.com/Jun-Pu/UCOS-DA
paper_authors: Yi Zhang, Chengyi Wu
for: 本研究探讨了一新任务，即无监督隐形物 segmentation（UCOS），其中目标对象具有罕见的抽象属性，即隐形。不幸地，现有的无监督模型在适应UCOS时会遇到域之间的差距问题。
methods: 我们在本研究中提出了一种源自无监督领域的适应UCOS任务（UCOS-DA），其中无源标签和目标标签在整个模型训练过程中缺失。我们定义了一个源模型，即基于ImageNet的自我监督视觉变换器。而目标领域包括一个简单的线性层（我们的目标模型）和无标签的隐形对象。我们 THEN设计了一个图像前景背景对比自我挑战性适应预处理管道，以实现Robust UCOS。
results: 我们的基eline模型在UCOS数据集上 achieve superior segmentation性能，与竞争对手无监督模型相比，即使训练集规模只有一半于监督COS counterpart。

Abstract
Deep learning for unsupervised image segmentation remains challenging due to the absence of human labels. The common idea is to train a segmentation head, with the supervision of pixel-wise pseudo-labels generated based on the representation of self-supervised backbones. By doing so, the model performance depends much on the distance between the distributions of target datasets and the pre-training dataset (e.g., ImageNet). In this work, we investigate a new task, namely unsupervised camouflaged object segmentation (UCOS), where the target objects own a common rarely-seen attribute, i.e., camouflage. Unsurprisingly, we find that the state-of-the-art unsupervised models struggle in adapting UCOS, due to the domain gap between the properties of generic and camouflaged objects. To this end, we formulate the UCOS as a source-free unsupervised domain adaptation task (UCOS-DA), where both source labels and target labels are absent during the whole model training process. Specifically, we define a source model consisting of self-supervised vision transformers pre-trained on ImageNet. On the other hand, the target domain includes a simple linear layer (i.e., our target model) and unlabeled camouflaged objects. We then design a pipeline for foreground-background-contrastive self-adversarial domain adaptation, to achieve robust UCOS. As a result, our baseline model achieves superior segmentation performance when compared with competing unsupervised models on the UCOS benchmark, with the training set which's scale is only one tenth of the supervised COS counterpart.

摘要
深度学习无监督图像分割仍然是挑战，因为缺乏人类标签。常见的想法是训练一个分割头，通过自我监督核心的代码生成的 Pseudo-标签来监督。由于模型性能很大程度取决于目标数据集和预训练集（如ImageNet）之间的分布距离，在这个工作中，我们 investigate一个新任务，即无监督隐形对象分割（UCOS）。不 surprisingly，我们发现现有state-of-the-art无监督模型在适应UCOS时受到域 gap的限制，即隐形对象的特性与普通对象的特性之间的域差异。为此，我们将UCOS定义为一个源无监督领域适应任务（UCOS-DA），其中源标签和目标标签都缺失在模型训练过程中。我们定义了一个源模型，即基于ImageNet自我监督vision transformers预训练的模型。然而，目标领域包括一个简单的线性层（即我们的目标模型）和无标签的隐形对象。我们然后设计了一个干扰对比自我适应领域域适应管道，以实现robust UCOS。结果，我们的基线模型在UCOS benchmark上比同类无监督模型表现出色，训练集的规模只是一个十分之一的超vised COS counterpart。

Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps

paper_url: http://arxiv.org/abs/2308.04526
repo_url: https://github.com/royerlab/ultrack
paper_authors: Jordão Bragantini, Merlin Lange, Loïc Royer
for: 这篇论文是为了描述一种大规模3D细胞跟踪方法，包括一种选择性分割方法和一种基于最大重叠的分割选择算法。
methods: 该方法使用了一种层次分割假设来计算细胞跟踪和分割，并通过选择不同层次分割来实现细胞跟踪。
results: 该方法在3D图像中实现了状态领先的细胞跟踪结果，并且比使用深度学习方法要快得多。此外，该方法可以支持多种 Cell segmentation 模型，并可以将它们组合成一个 ensemble 来提高跟踪性能。

Abstract
In this work, we describe a method for large-scale 3D cell-tracking through a segmentation selection approach. The proposed method is effective at tracking cells across large microscopy datasets on two fronts: (i) It can solve problems containing millions of segmentation instances in terabyte-scale 3D+t datasets; (ii) It achieves competitive results with or without deep learning, which requires 3D annotated data, that is scarce in the fluorescence microscopy field. The proposed method computes cell tracks and segments using a hierarchy of segmentation hypotheses and selects disjoint segments by maximizing the overlap between adjacent frames. We show that this method achieves state-of-the-art results in 3D images from the cell tracking challenge and has a faster integer linear programming formulation. Moreover, our framework is flexible and supports segmentations from off-the-shelf cell segmentation models and can combine them into an ensemble that improves tracking. The code is available https://github.com/royerlab/ultrack.

摘要
在这个工作中，我们描述了一种大规模3D细胞跟踪方法，通过选择分割选择方法实现。我们提出的方法能够在大规模微型生物图像 dataset 中跟踪细胞，包括：(i) 解决包含数百万个分割实例的 terrabyte 级 3D+t 数据集中的问题;(ii) 在 fluorescence 微型生物图像领域中罕见的3D标注数据不足的情况下，与或 без深度学习模型，达到竞争性的结果。我们的方法计算细胞跟踪和分割使用层次结构的分割假设，并选择不相交的分割。我们展示了该方法在3D图像中的细胞跟踪挑战中达到了状态的前导 результа。此外，我们的框架是灵活的，支持自动生成的细胞分割模型，并可以将其组合成一个 ensemble，以提高跟踪性。代码可以在 https://github.com/royerlab/ultrack 上获取。

Toward unlabeled multi-view 3D pedestrian detection by generalizable AI: techniques and performance analysis

paper_url: http://arxiv.org/abs/2308.04515
repo_url: None
paper_authors: João Paulo Lima, Diego Thomas, Hideaki Uchiyama, Veronica Teichrieb
for: 提高多视图3D人体检测中的泛化能力
methods: 自动标注目标数据和使用无需训练的检测器
results: 使用自动标注方法可以获得更高的泛化性能，比直接使用无需训练的检测器或使用现有的标注源数据训练的检测器更好。

Abstract
We unveil how generalizable AI can be used to improve multi-view 3D pedestrian detection in unlabeled target scenes. One way to increase generalization to new scenes is to automatically label target data, which can then be used for training a detector model. In this context, we investigate two approaches for automatically labeling target data: pseudo-labeling using a supervised detector and automatic labeling using an untrained detector (that can be applied out of the box without any training). We adopt a training framework for optimizing detector models using automatic labeling procedures. This framework encompasses different training sets/modes and multi-round automatic labeling strategies. We conduct our analyses on the publicly-available WILDTRACK and MultiviewX datasets. We show that, by using the automatic labeling approach based on an untrained detector, we can obtain superior results than directly using the untrained detector or a detector trained with an existing labeled source dataset. It achieved a MODA about 4% and 1% better than the best existing unlabeled method when using WILDTRACK and MultiviewX as target datasets, respectively.

摘要
我们揭示了如何使用通用化AI提高多视图3D人体检测在无标目标场景中。一种方法是自动将目标数据标注，然后用于训练检测模型。在这个上下文中，我们研究了两种自动标注目标数据的方法：假标签使用supervised检测器和自动标注使用无学习检测器。我们采用了一个用于优化检测模型的自动标注训练框架。这个框架包括不同的训练集/模式和多轮自动标注策略。我们在公共可用的WILDTRACK和MultiviewX数据集上进行了分析。我们发现，通过使用基于无学习检测器的自动标注方法，可以获得更高的性能，比直接使用无学习检测器或使用现有标注源数据集来训练检测器更好。它在使用WILDTRACK和MultiviewX作为目标数据集时，分别提高了4%和1%。

When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations

paper_url: http://arxiv.org/abs/2308.04431
repo_url: https://github.com/basedrhys/ood-generalization
paper_authors: Rhys Compton, Lily Zhang, Aahlad Puli, Rajesh Ranganath
for: 这个研究旨在探讨 whether incorporating more data can always improve machine learning model performance，以及在医学影像数据中存在假相关性的问题。
methods: 这个研究使用了大规模的实验，对四个开源胸部X射线图像集和九个标签进行了组合。
results: 研究发现，在43%的情况下，将两个医院的数据作为训练数据，会使模型在两个医院的数据上具有更差的最坏群体精度。这种结果尽管训练数据更加相似于测试数据，但是由医院特有的图像 artifacts 导致的假相关性的出现。

Abstract
In machine learning, incorporating more data is often seen as a reliable strategy for improving model performance; this work challenges that notion by demonstrating that the addition of external datasets in many cases can hurt the resulting model's performance. In a large-scale empirical study across combinations of four different open-source chest x-ray datasets and 9 different labels, we demonstrate that in 43% of settings, a model trained on data from two hospitals has poorer worst group accuracy over both hospitals than a model trained on just a single hospital's data. This surprising result occurs even though the added hospital makes the training distribution more similar to the test distribution. We explain that this phenomenon arises from the spurious correlation that emerges between the disease and hospital, due to hospital-specific image artifacts. We highlight the trade-off one encounters when training on multiple datasets, between the obvious benefit of additional data and insidious cost of the introduced spurious correlation. In some cases, balancing the dataset can remove the spurious correlation and improve performance, but it is not always an effective strategy. We contextualize our results within the literature on spurious correlations to help explain these outcomes. Our experiments underscore the importance of exercising caution when selecting training data for machine learning models, especially in settings where there is a risk of spurious correlations such as with medical imaging. The risks outlined highlight the need for careful data selection and model evaluation in future research and practice.

摘要
在机器学习中，通常认为更多数据会提高模型性能，但这项工作挑战了这一观点，示出在许多情况下，外部数据集的添加实际上会降低模型性能。我们在四个不同的开源胸部X射线图像集和九个标签之间进行了大规模的实验，发现在43%的情况下，使用两家医院的数据进行训练的模型在两家医院的数据上具有较差的最坏群 accuracy。这种意外的结果尽管训练集在测试集中变得更加相似，但是由医院特有的图像artefact引起的假相关性导致这种现象。我们解释了这种现象的起因，并提出了在多个数据集训练时存在的贸易关系。我们发现在某些情况下，平衡数据可以消除假相关性，提高性能，但并不总是有效的策略。我们将这些结果与文献中的假相关性相关研究进行比较，以帮助解释这些结果。我们的实验警示了在机器学习模型训练时，特别是在医疗影像领域，必须仔细选择训练数据，以避免假相关性的风险。这些风险描述了未来研究和实践中需要进行仔细的数据选择和模型评估。

A Deep-Learning Method Using Auto-encoder and Generative Adversarial Network for Anomaly Detection on Ancient Stone Stele Surfaces

paper_url: http://arxiv.org/abs/2308.04426
repo_url: None
paper_authors: Yikun Liu, Yuning Wang, Cheng Liu
for: 本研究旨在提供一种深度学习方法，用于自动检测古代石刻上的自然衰老和人工损害。
methods: 该方法使用自动编码器（AE）和生成对抗网络（GAN），不需要大量的异常样本，可以全面检测不可预测的异常。
results: 在使用Longmen洞雕像石刻为案例研究中，提出了一种无监督学习模型，实现了99.74%的重建精度。该方法可以准确地检测七种人工设计的异常，无误告警。

Abstract
Accurate detection of natural deterioration and man-made damage on the surfaces of ancient stele in the first instance is essential for their preventive conservation. Existing methods for cultural heritage preservation are not able to achieve this goal perfectly due to the difficulty of balancing accuracy, efficiency, timeliness, and cost. This paper presents a deep-learning method to automatically detect above mentioned emergencies on ancient stone stele in real time, employing autoencoder (AE) and generative adversarial network (GAN). The proposed method overcomes the limitations of existing methods by requiring no extensive anomaly samples while enabling comprehensive detection of unpredictable anomalies. the method includes stages of monitoring, data acquisition, pre-processing, model structuring, and post-processing. Taking the Longmen Grottoes' stone steles as a case study, an unsupervised learning model based on AE and GAN architectures is proposed and validated with a reconstruction accuracy of 99.74\%. The method's evaluation revealed the proficient detection of seven artificially designed anomalies and demonstrated precision and reliability without false alarms. This research provides novel ideas and possibilities for the application of deep learning in the field of cultural heritage.

摘要
通过检测古代碑刻表面的自然衰败和人工损害，可以采取预防保护措施。现有的文化遗产保护方法不能完全实现这个目标，因为很难平衡准确性、效率、时效性和成本。本文提出了一种基于深度学习的方法，可以在实时中自动检测古代石碑上的紧急情况，使用自适应网络（AE）和生成对抗网络（GAN）。该方法可以减少现有方法的限制，不需要大量的异常样本，同时可以全面检测不可预测的异常。该方法包括监测、数据收集、预处理、模型结构和后处理等阶段。通过使用长门石窟的石碑作为案例研究，我们提出了一种无监督学习模型，并在重建精度为99.74%的基础上验证了其可靠性和精度。测试结果表明该方法可以准确检测七种人工设计的异常情况，而无 FALSE ALARM 问题。这些研究提供了深度学习在文化遗产保护领域的新想法和可能性。

DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal from Optical Satellite Images

paper_url: http://arxiv.org/abs/2308.04417
repo_url: None
paper_authors: Xuechao Zou, Kai Li, Junliang Xing, Yu Zhang, Shiying Wang, Lei Jin, Pin Tao
for: 这个论文主要目标是提出一种基于扩散模型的高性能云除法方法，以解决光学卫星图像中云膜的影响。
methods: 该方法基于条件导向扩散模型，并利用深度卷积神经网络进行图像特征提取和云膜模拟。具有独立Encoder来提取条件图像的特征，以保证出处相似的外观信息。同时，我们提出了一种新的时间和条件融合块，以准确模拟条件图像中的出处和目标图像之间的相似性。
results: 我们在两个常用的数据集上进行了广泛的实验评估，并证明了DiffCR在所有指标上均达到了当前最佳性能，其参数和计算复杂度分别为5.1%和5.4%。所有实验结果和代码将在https://github.com/XavierJiezou/DiffCR上公开发布。

Abstract
Optical satellite images are a critical data source; however, cloud cover often compromises their quality, hindering image applications and analysis. Consequently, effectively removing clouds from optical satellite images has emerged as a prominent research direction. While recent advancements in cloud removal primarily rely on generative adversarial networks, which may yield suboptimal image quality, diffusion models have demonstrated remarkable success in diverse image-generation tasks, showcasing their potential in addressing this challenge. This paper presents a novel framework called DiffCR, which leverages conditional guided diffusion with deep convolutional networks for high-performance cloud removal for optical satellite imagery. Specifically, we introduce a decoupled encoder for conditional image feature extraction, providing a robust color representation to ensure the close similarity of appearance information between the conditional input and the synthesized output. Moreover, we propose a novel and efficient time and condition fusion block within the cloud removal model to accurately simulate the correspondence between the appearance in the conditional image and the target image at a low computational cost. Extensive experimental evaluations on two commonly used benchmark datasets demonstrate that DiffCR consistently achieves state-of-the-art performance on all metrics, with parameter and computational complexities amounting to only 5.1% and 5.4%, respectively, of those previous best methods. The source code, pre-trained models, and all the experimental results will be publicly available at https://github.com/XavierJiezou/DiffCR upon the paper's acceptance of this work.

摘要
“Optical 卫星图像是一种重要的数据源；然而，云覆盖通常会降低图像质量，使图像应用和分析变得困难。因此，从云除去技术已成为一个主要的研究方向。而在最近的进展中，大多数研究都基于生成对抗网络，尽管它们可能会导致图像质量下降。Diffusion 模型在多种图像生成任务中表现出色，这表明它们可能会解决这个挑战。本文提出了一个新的框架called DiffCR，它利用 conditional 导向扩散和深度卷积网络来实现高性能的云除去技术。特别是，我们引入了独立的编码器来提取 conditional 图像特征，以确保 conditional 输入和生成输出的外观信息几乎相同。此外，我们还提出了一种新的时间和条件融合块，以准确模拟 conditional 图像中的外观和目标图像之间的对应关系，并且在低计算成本下完成。我们对两种常用的参考数据集进行了广泛的实验评估，结果表明，DiffCR 在所有指标上具有状态之冠的表现，与参数和计算复杂度分别为 5.1% 和 5.4%，与之前最佳方法相比。源代码、预训练模型和所有实验结果将在 https://github.com/XavierJiezou/DiffCR 上公开发布， waits for the paper's acceptance of this work。”

Digging into Depth Priors for Outdoor Neural Radiance Fields

paper_url: http://arxiv.org/abs/2308.04413
repo_url: https://github.com/cwchenwang/outdoor-nerf-depth
paper_authors: Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, Liangjun Zhang
for: 本研究旨在探讨在户外NeRF训练中使用深度假设的影响，以解决 radiance fields 中的形态-辐射强度杂化问题。
methods: 本研究使用了两种常见的NeRF方法，并与四种常用的深度假设进行了对比。
results: 实验结果显示了各种深度假设和深度使用方式在户外NeRF训练中的效果，并提供了一些可能有用的实践经验和研究方向。In English, the three key points are:
for: The paper aims to investigate the impact of using depth priors in outdoor NeRF training, to solve the shape-radiance ambiguity problem in radiance fields.
methods: The study uses two commonly used NeRF methods and compares them with four commonly used depth priors.
results: The experimental results show the effects of different depth priors and depth usage strategies in outdoor NeRF training, and provide useful practical experience and research directions.

Abstract
Neural Radiance Fields (NeRF) have demonstrated impressive performance in vision and graphics tasks, such as novel view synthesis and immersive reality. However, the shape-radiance ambiguity of radiance fields remains a challenge, especially in the sparse viewpoints setting. Recent work resorts to integrating depth priors into outdoor NeRF training to alleviate the issue. However, the criteria for selecting depth priors and the relative merits of different priors have not been thoroughly investigated. Moreover, the relative merits of selecting different approaches to use the depth priors is also an unexplored problem. In this paper, we provide a comprehensive study and evaluation of employing depth priors to outdoor neural radiance fields, covering common depth sensing technologies and most application ways. Specifically, we conduct extensive experiments with two representative NeRF methods equipped with four commonly-used depth priors and different depth usages on two widely used outdoor datasets. Our experimental results reveal several interesting findings that can potentially benefit practitioners and researchers in training their NeRF models with depth priors. Project Page: https://cwchenwang.github.io/outdoor-nerf-depth

摘要
神经辐射场（NeRF）在视觉和图形任务中表现出色，如新视角合成和吸引实际。然而，辐射场的形态-辐射权 ambiguity仍然是一大挑战，特别是在稀疏视点设置下。现有研究通过将深度假设 incorporated into outdoor NeRF 训练来解决这个问题。然而，选择depth priors的准则和不同假设的相对优劣还没有进行了全面的研究。此外，使用不同方法选择depth priors的问题也是一个未解决的问题。本文提供了对使用depth priors来outdoor神经辐射场的全面研究和评价，涵盖了常见的深度探测技术和大多数应用方式。特别是，我们进行了两个代表性的NeRF方法和四种常用的深度假设进行了广泛的实验，并在两个广泛使用的户外数据集上进行了extensive experiments。我们的实验结果表明了一些有价值的发现，可能对实践者和研究人员在训练NeRF模型时有所帮助。项目页面：https://cwchenwang.github.io/outdoor-nerf-depth

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

paper_url: http://arxiv.org/abs/2308.04409
repo_url: https://github.com/yichaoshen-ms/v-detr
paper_authors: Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo
For: The paper is written for proposing a highly performant 3D object detector for point clouds using the DETR framework.* Methods: The paper introduces a novel 3D Vertex Relative Position Encoding (3DV-RPE) method that computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, which helps the model focus on points near the objects and improve object detection accuracy.* Results: The paper achieves significant improvements over the previous 3DETR in $\rm{AP}{25}$/$\rm{AP}{50}$ from 65.0%/47.0% to 77.8%/66.0%, respectively, on the challenging ScanNetV2 benchmark. The method also sets a new record on ScanNetV2 and SUN RGB-D datasets.

Abstract
We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training data. In particular, the queries often attend to points that are far away from the target objects, violating the locality principle in object detection. To address the limitation, we introduce a novel 3D Vertex Relative Position Encoding (3DV-RPE) method which computes position encoding for each point based on its relative position to the 3D boxes predicted by the queries in each decoder layer, thus providing clear information to guide the model to focus on points near the objects, in accordance with the principle of locality. In addition, we systematically improve the pipeline from various aspects such as data normalization based on our understanding of the task. We show exceptional results on the challenging ScanNetV2 benchmark, achieving significant improvements over the previous 3DETR in $\rm{AP}_{25}$/$\rm{AP}_{50}$ from 65.0\%/47.0\% to 77.8\%/66.0\%, respectively. In addition, our method sets a new record on ScanNetV2 and SUN RGB-D datasets.Code will be released at http://github.com/yichaoshen-MS/V-DETR.

摘要
我们介绍一个高性能的3D物体探测器使用DETR框架。先前的尝试都会得到不佳结果，因为它们无法从训练数据的限制范围中学习正确的归胚。特别是，问题经常对距离目标物体较远的点进行参考，违反了物体探测中的地方性原则。为解决这个限制，我们提出了一个新的3D点 cloud 相对位置编码方法（3DV-RPE），它在每个层的解oder层中计算每个点的位置编码，基于这个点与Predicted 3D Box的相对位置，以提供明确的信息，使模型能够专注于靠近物体的点，遵循物体探测中的地方性原则。此外，我们系统地提高了管线，包括根据我们对任务的认识而进行的数据Normalization。我们在ScanNetV2标准benchmark上获得了出色的结果，从65.0%/47.0%的AP25/AP50提高到77.8%/66.0%，增进了过去3DETR的significant。此外，我们的方法在ScanNetV2和SUN RGB-D数据集上设置了新的纪录。代码将在http://github.com/yichaoshen-MS/V-DETR上发布。

Person Re-Identification without Identification via Event Anonymization

paper_url: http://arxiv.org/abs/2308.04402
repo_url: https://github.com/IIT-PAVIS/ReId_without_Id
paper_authors: Shafiq Ahmad, Pietro Morerio, Alessio Del Bue
for: 避免人员隐私泄露，对于event-camera视觉应用进行隐私保护。
methods: 提出了一个统一的网络架构，同时实现隐私保护和下游任务（如人ReId）的两重目标。
results: 实现了对于event-camera资料的隐私保护，并在实验中证明了其效果。

Abstract
Wide-scale use of visual surveillance in public spaces puts individual privacy at stake while increasing resource consumption (energy, bandwidth, and computation). Neuromorphic vision sensors (event-cameras) have been recently considered a valid solution to the privacy issue because they do not capture detailed RGB visual information of the subjects in the scene. However, recent deep learning architectures have been able to reconstruct images from event cameras with high fidelity, reintroducing a potential threat to privacy for event-based vision applications. In this paper, we aim to anonymize event-streams to protect the identity of human subjects against such image reconstruction attacks. To achieve this, we propose an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream task such as person ReId. Our network learns to scramble events, enforcing the degradation of images recovered from the privacy attacker. In this work, we also bring to the community the first ever event-based person ReId dataset gathered to evaluate the performance of our approach. We validate our approach with extensive experiments and report results on the synthetic event data simulated from the publicly available SoftBio dataset and our proposed Event-ReId dataset.

摘要
广泛使用视觉监测在公共空间 puts 个人隐私受到威胁，同时增加资源消耗（能源、带宽、计算）。 neuromorphic vision sensors（事件摄像头）在最近被视为隐私问题的有效解决方案，因为它们不捕捉Scene中主体的详细RGB视觉信息。然而，最近的深度学习架构可以很好地从事件摄像头中重建图像，重新引入隐私问题 для事件视觉应用。在这篇论文中，我们想要匿名事件流以保护人类主体的身份 against 这种图像重建攻击。为了实现这一目标，我们提议了一个综合架构，该架构同时满足隐私保护和下游任务（如人ReId）的两重目标。我们的网络学习混合事件，使得恢复图像的攻击者难以获得有用信息。在这项工作中，我们还为社区提供了首次使用事件基于人ReId数据集进行评估我们的方法的机会。我们通过了详细的实验，并在Synthetic事件数据和我们提出的Event-ReId数据集上进行了评估。

LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery

paper_url: http://arxiv.org/abs/2308.04397
repo_url: None
paper_authors: Ben Chen, Xuechao Zou, Yu Zhang, Jiayu Li, Kai Li, Pin Tao
For: Accurate lake extraction from remote sensing imagery.* Methods: Hybrid CNN-Transformer architecture (LEFormer) with four main modules: CNN encoder, Transformer encoder, cross-encoder fusion, and lightweight decoder.* Results: Consistently achieves state-of-the-art (SOTA) performance and efficiency on two datasets (Surface Water and Qinghai-Tibet Plateau Lake) with mIoU scores of 90.86% and 97.42%, outperforming existing methods while being 20x more efficient.

Abstract
Lake extraction from remote sensing imagery is challenging due to the complex shapes of lakes and the presence of noise. Existing methods suffer from blurred segmentation boundaries and poor foreground modeling. In this paper, we propose a hybrid CNN-Transformer architecture, called LEFormer, for accurate lake extraction. LEFormer contains four main modules: CNN encoder, Transformer encoder, cross-encoder fusion, and lightweight decoder. The CNN encoder recovers local spatial information and improves fine-scale details. Simultaneously, the Transformer encoder captures long-range dependencies between sequences of any length, allowing them to obtain global features and context information better. Finally, a lightweight decoder is employed for mask prediction. We evaluate the performance and efficiency of LEFormer on two datasets, the Surface Water (SW) and the Qinghai-Tibet Plateau Lake (QTPL). Experimental results show that LEFormer consistently achieves state-of-the-art (SOTA) performance and efficiency on these two datasets, outperforming existing methods. Specifically, LEFormer achieves 90.86% and 97.42% mIoU on the SW and QTPL datasets with a parameter count of 3.61M, respectively, while being 20x minor than the previous SOTA method.

摘要
湖水抽取从遥感影像中是一项复杂的任务，由于湖泊的复杂形态和干扰的存在。现有的方法受到模糊的分割边界和质地模型的限制。在这篇论文中，我们提出了一种混合CNN-Transformer架构，称为LEFormer，用于准确的湖水抽取。LEFormer包括四个主要模块：CNN编码器、Transformer编码器、交叉编码器融合和轻量级解码器。CNN编码器恢复本地空间信息，提高细致细节。同时，Transformer编码器捕捉任意长度序列之间的长距离依赖关系，以获得更好的全球特征和上下文信息。最后，一个轻量级解码器被employmed для推测Mask。我们对LEFormer在两个数据集上进行了性能和效率测试，结果表明LEFormer在这两个数据集上具有SOTA性能和效率，并且在参数计数3.61M的情况下，与之前的SOTA方法相比，LEFormer的参数计数减少了20倍。特别是，LEFormer在SW数据集上达到了90.86%和QTPL数据集上达到了97.42%的mIoU，而且参数计数只有3.61M。

Data Augmentation-Based Unsupervised Domain Adaptation In Medical Imaging

paper_url: http://arxiv.org/abs/2308.04395
repo_url: None
paper_authors: Sebastian Nørgaard Llambias, Mads Nielsen, Mostafa Mehdipour Ghazi
for: 这个论文是为了提高医疗成像中的深度学习模型对新扫描的数据统一性，以便更好地应用到临床实践中。
methods: 这个方法使用了MRI特有的增强技术，并进行了广泛的实验，评估了不同数据集、模式、分类任务中的表现。
results: 结果显示，该方法能够实现高精度、广泛适用和对数据集shift具有强大的韧性，在大多数情况下超越了现有的表现。

Abstract
Deep learning-based models in medical imaging often struggle to generalize effectively to new scans due to data heterogeneity arising from differences in hardware, acquisition parameters, population, and artifacts. This limitation presents a significant challenge in adopting machine learning models for clinical practice. We propose an unsupervised method for robust domain adaptation in brain MRI segmentation by leveraging MRI-specific augmentation techniques. To evaluate the effectiveness of our method, we conduct extensive experiments across diverse datasets, modalities, and segmentation tasks, comparing against the state-of-the-art methods. The results show that our proposed approach achieves high accuracy, exhibits broad applicability, and showcases remarkable robustness against domain shift in various tasks, surpassing the state-of-the-art performance in the majority of cases.

摘要
深度学习模型在医疗影像中经常陷于新扫描数据不好适应的问题，这导致模型在实际应用中表现不佳。我们提出了一种无监督的多元领域适应方法，通过利用特定于MRI的扩展技术来强化模型的鲁棒性。为评估我们的方法的有效性，我们在多个数据集、模式和分割任务中进行了广泛的实验，与当前最佳方法进行比较。结果显示，我们的提议方法在多个任务中达到了高精度，具有广泛的应用性和强大的鲁棒性，在大多数情况下超过了当前最佳性能。

DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds

paper_url: http://arxiv.org/abs/2308.04383
repo_url: https://github.com/IRMVLab/DELFlow
paper_authors: Chensheng Peng, Guangming Wang, Xian Wan Lo, Xinrui Wu, Chenfeng Xu, Masayoshi Tomizuka, Wei Zhan, Hesheng Wang
for: 用于解决点云模式下场景流计算中的缺失问题，提高场景流计算的效率和准确性。
methods: 使用精度补做法将原始点云转换为 dense 格式，并使用新的投影减小信息损失问题。
results: 与优先艺术相比，该方法在 FlyingThings3D 和 KITTI 数据集上实现了更高的效率和准确性。

Abstract
Point clouds are naturally sparse, while image pixels are dense. The inconsistency limits feature fusion from both modalities for point-wise scene flow estimation. Previous methods rarely predict scene flow from the entire point clouds of the scene with one-time inference due to the memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation. To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids. Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion. We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset.

摘要
点云自然 sparse，而图像像素 dense。这种不一致性限制了从两个Modalities的特征融合，用于点云流场估计。前一代方法rarely predict scene flow from the entire point clouds of the scene with one-time inference due to memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation. To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids. Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion. We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset.Here's a word-for-word translation of the text into Simplified Chinese:点云自然 sparse，而图像像素 dense。这种不一致性限制了从两个Modalities的特征融合，用于点云流场估计。前一代方法rarely predict scene flow from the entire point clouds of the scene with one-time inference due to memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation。 To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids。 Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion。 We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume。 Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset。

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination

paper_url: http://arxiv.org/abs/2308.04380
repo_url: https://github.com/luminosityx/fne
paper_authors: Haoxuan Li, Yi Bin, Junrong Liao, Yang Yang, Heng Tao Shen
for: 提高图文匹配模型的表现，减少假性负样本的影响
methods: 提出了一种False Negative Elimination（FNE）策略，通过样本选择来避免假性负样本的影响，并通过权重赋值来强制模型对假性负样本进行学习
results: 经过广泛的实验 validate 了我们的提案，在Flickr30K和MS-COCO数据集上达到了比较好的表现，并且在假性负样本的存在下保持了模型的表现稳定性。

Abstract
Most existing image-text matching methods adopt triplet loss as the optimization objective, and choosing a proper negative sample for the triplet of is important for effectively training the model, e.g., hard negatives make the model learn efficiently and effectively. However, we observe that existing methods mainly employ the most similar samples as hard negatives, which may not be true negatives. In other words, the samples with high similarity but not paired with the anchor may reserve positive semantic associations, and we call them false negatives. Repelling these false negatives in triplet loss would mislead the semantic representation learning and result in inferior retrieval performance. In this paper, we propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling, which could alleviate the problem introduced by false negatives. Specifically, we first construct the distributions of positive and negative samples separately via their similarities with the anchor, based on the features extracted from image and text encoders. Then we calculate the false negative probability of a given sample based on its similarity with the anchor and the above distributions via the Bayes' rule, which is employed as the sampling weight during negative sampling process. Since there may not exist any false negative in a small batch size, we design a memory module with momentum to retain a large negative buffer and implement our negative sampling strategy spanning over the buffer. In addition, to make the model focus on hard negatives, we reassign the sampling weights for the simple negatives with a cut-down strategy. The extensive experiments are conducted on Flickr30K and MS-COCO, and the results demonstrate the superiority of our proposed false negative elimination strategy. The code is available at https://github.com/LuminosityX/FNE.

摘要
现有的图像文本匹配方法大多采用 triplet loss 作为优化目标，选择正确的负样本 для triplet 中的链接、正例和负例是重要的，以提高模型的训练效率和效果。然而，我们发现现有的方法主要使用最相似的样本作为负样本，这可能不是真正的负例。即，具有高相似度但不是链接的样本可能具有正面 semantic association，我们称之为假负例。对 triplet loss 中的这种假负例的排斥会导致 semantic representation 的学习受到干扰，结果导致对图像文本匹配的 Retrieval 性能下降。在这篇文章中，我们提出了一个 False Negative Elimination (FNE) 策略，用于选择负例via sampling，以解决上述问题。具体来说，我们首先将图像和文本编码器提取出的特征用于分别建立正例和负例的分布，然后计算对链接的 false negative probability 基于它们的相似度和上述分布，并运用 Bayes 规则作为 sampling weight。在小批量大小中可能没有假负例，我们设计了一个内存模组，以维持一个大型的负例缓存，并实现我们的负例抽样策略。此外，为使模型专注于困难的负例，我们将简单的负例重新分配 sampling weight 的策略。实验结果显示，我们提出的 False Negative Elimination 策略在 Flickr30K 和 MS-COCO 上得到了优异的结果。代码可以在 https://github.com/LuminosityX/FNE 上获取。

Pelta: Shielding Transformers to Mitigate Evasion Attacks in Federated Learning

paper_url: http://arxiv.org/abs/2308.04373
repo_url: None
paper_authors: Simon Queyrut, Yérom-David Bromberg, Valerio Schiavoni
for: 保护 Federated Learning 中的机器学习模型更新，以保持用户数据隐私，并且防止恶意客户端探测模型内部。
methods: Pelta 是一种新的防御机制，利用可信硬件（TEEs）来隐藏部分反Prop链规则，防止攻击者利用这些规则设计恶意样本。
results: Pelta 在一个 state-of-the-art 的集成模型上进行评估，并证明其效iveness против Self Attention Gradient 攻击。

Abstract
The main premise of federated learning is that machine learning model updates are computed locally, in particular to preserve user data privacy, as those never leave the perimeter of their device. This mechanism supposes the general model, once aggregated, to be broadcast to collaborating and non malicious nodes. However, without proper defenses, compromised clients can easily probe the model inside their local memory in search of adversarial examples. For instance, considering image-based applications, adversarial examples consist of imperceptibly perturbed images (to the human eye) misclassified by the local model, which can be later presented to a victim node's counterpart model to replicate the attack. To mitigate such malicious probing, we introduce Pelta, a novel shielding mechanism leveraging trusted hardware. By harnessing the capabilities of Trusted Execution Environments (TEEs), Pelta masks part of the back-propagation chain rule, otherwise typically exploited by attackers for the design of malicious samples. We evaluate Pelta on a state of the art ensemble model and demonstrate its effectiveness against the Self Attention Gradient adversarial Attack.

摘要
主要假设联邦学习的概念是，机器学习模型的更新是在本地进行，具体是为了维护用户数据隐私，因为这些数据从不会离开用户的设备。这个机制假设整个模型，一旦统计，会被协力且不可靠的节点所传递。但是， Without proper defenses, compromised clients can easily probe the model inside their local memory in search of adversarial examples. For instance, considering image-based applications, adversarial examples consist of imperceptibly perturbed images (to the human eye) misclassified by the local model, which can be later presented to a victim node's counterpart model to replicate the attack. To mitigate such malicious probing, we introduce Pelta, a novel shielding mechanism leveraging trusted hardware. By harnessing the capabilities of Trusted Execution Environments (TEEs), Pelta masks part of the back-propagation chain rule, otherwise typically exploited by attackers for the design of malicious samples. We evaluate Pelta on a state of the art ensemble model and demonstrate its effectiveness against the Self Attention Gradient adversarial Attack.Here's a word-for-word translation of the text into Simplified Chinese:主要假设联邦学习的概念是，机器学习模型的更新是在本地进行，具体是为了保护用户数据隐私，因为这些数据从不会离开用户的设备。这个机制假设整个模型，一旦统计，会被协力且不可靠的节点所传递。但是，无法防止受损客户端可以轻松地在本地内存中探测模型，寻找攻击性的例子。例如，考虑图像基于应用程序，攻击性的例子包括不可见地修改图像（对人类眼可见），使本地模型错误分类，这些修改图像可以在后来被让客户端对方的模型中复制攻击。为了消除这种恶意探测，我们介绍了Pelta，一种新的遮盾机制，利用可信硬件。通过利用可信执行环境（TEEs）的能力，Pelta遮盖一部分的反向传播链规则，通常由攻击者利用来设计恶意样本。我们对一个现代集成模型进行评估，并证明Pelta在自注意Gradient攻击下的效果。

When Super-Resolution Meets Camouflaged Object Detection: A Comparison Study

paper_url: http://arxiv.org/abs/2308.04370
repo_url: None
paper_authors: Juan Wen, Shupeng Cheng, Peng Xu, Bowen Zhou, Radu Timofte, Weiyan Hou, Luc Van Gool
for: 本研究旨在探讨超解像（SR）和掩蔽物检测（COD）两个领域的共同应用，如低分辨率监测图像可以通过超解像技术进行进一步加工，并且使用COD模型进行掩蔽物检测。
methods: 本研究使用了不同的SR方法在通常使用的COD数据集上进行比较性评估，同时使用SR处理的COD数据集来评估不同的COD模型的Robustness。
results: 本研究通过对SR和COD两个领域的综合评估，探讨了这两个领域之间的关系，发现了一些新的实验现象，并summarized新的研究方向。

Abstract
Super Resolution (SR) and Camouflaged Object Detection (COD) are two hot topics in computer vision with various joint applications. For instance, low-resolution surveillance images can be successively processed by super-resolution techniques and camouflaged object detection. However, in previous work, these two areas are always studied in isolation. In this paper, we, for the first time, conduct an integrated comparative evaluation for both. Specifically, we benchmark different super-resolution methods on commonly used COD datasets, and meanwhile, we evaluate the robustness of different COD models by using COD data processed by SR methods. Our goal is to bridge these two domains, discover novel experimental phenomena, summarize new experim.

摘要
superResolution (SR) 和 camouflagedObjectDetection (COD) 是计算机视觉中两个热门的话题，它们在各种应用场景中可以结合使用。例如，低分辨率的视频监测图像可以被一次接一次地处理superResolution 技术，并且使用 COD 模型进行掩蔽物检测。然而，在过去的研究中，这两个领域一直被研究独立。在这篇论文中，我们首次对这两个领域进行了集成性评估。 Specifically, we 对常用的 COD 数据集进行了不同的 superResolution 方法的比较，而同时，我们也使用 SR 方法处理 COD 数据来评估 COD 模型的可靠性。我们的目标是将这两个领域相连，发现新的实验现象，总结新的经验。

SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

paper_url: http://arxiv.org/abs/2308.04369
repo_url: https://github.com/event-ahu/sstformer
paper_authors: Xiao Wang, Zongzhen Wu, Yao Rong, Lin Zhu, Bo Jiang, Jin Tang, Yonghong Tian
for: 这个论文的目的是提出一个新的RGB架构对照数据推导架构，用于融合RGB帧和事件流进行模式识别。
methods: 本论文使用了一个新的混合模式识别架构，包括内存支持transformer网络、对原始事件流进行编码、多模式瓶颈融合模组和预测头。
results: 实验结果显示，提出的混合模式识别架构可以实现高性能和能效的模式识别，并且可以处理RGB帧和事件流的融合运算。

Abstract
Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer.

摘要
event镜头基于模式识别是一个相对较新的研究领域，现有研究者通常将事件流转换为图像、图表或 voxel，并采用深度神经网络进行事件基本类型的识别。尽管在简单的事件识别数据上可以获得良好的性能，但是其结果可能受到以下两个问题的限制。首先，他们仅使用空间稀疏的事件流进行识别，这可能会忽略颜色和细节 тексту层的信息。其次，他们可能采用神经元突发网络（SNN）进行能效识别，或者人工神经网络（ANN）进行能量浪费、高性能识别。然而，很少的人会考虑在这两个方面寻求平衡。在本文中，我们正式提出将RGB帧和事件流同时识别，并提出了一种新的RGB帧-事件识别框架。该框架包括四个主要模块：记忆支持变换网络 дляRGB帧编码、突发神经网络 для raw事件流编码、多模态瓶颈融合模块 дляRGB-Event特征聚合，以及预测头。由于RGB-Event基本类型的识别数据稀缺，我们还提出了一个大规模的PokerEvent数据集，该数据集包含114个类别，27102帧-事件对。我们对两个RGB-Event基本类型的识别数据集进行了广泛的实验，并证明了我们提出的框架的效果。我们希望这种工作能够促进RGB帧和事件流的模式识别的发展。我们的数据集和源代码将在https://github.com/Event-AHU/SSTFormer上发布。

2023-08-09

Density Crop-guided Semi-supervised Object Detection in Aerial Images

An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images

Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution

Robust Object Modeling for Visual Tracking

Do Diffusion Models Suffer Error Propagation? Theoretical Analysis and Consistency Regularization

Deep Learning Model Transfer in Forest Mapping using Multi-source Satellite SAR and Optical Images

Discrepancy-based Active Learning for Weakly Supervised Bleeding Segmentation in Wireless Capsule Endoscopy Images

IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models

Foreground Object Search by Distilling Composite Image Feature

Self-supervised Landmark Learning with Deformation Reconstruction and Cross-subject Consistency Objectives

ACE-HetEM for ab initio Heterogenous Cryo-EM 3D Reconstruction

Branches Mutual Promotion for End-to-End Weakly Supervised Semantic Segmentation

SelectNAdapt: Support Set Selection for Few-Shot Domain Adaptation

JEDI: Joint Expert Distillation in a Semi-Supervised Multi-Dataset Student-Teacher Scenario for Video Action Recognition

GeodesicPSIM: Predicting the Quality of Static Mesh with Texture Map via Geodesic Patch Similarity

Deep Learning-Based Prediction of Fractional Flow Reserve along the Coronary Artery

Cross-view Semantic Alignment for Livestreaming Product Recognition

StableVQA: A Deep No-Reference Quality Assessment Model for Video Stability

Histogram-guided Video Colorization Structure with Spatial-Temporal Connection

Transmission and Color-guided Network for Underwater Image Enhancement

Deep Generative Networks for Heterogeneous Augmentation of Cranial Defects

Learning multi-domain feature relation for visible and Long-wave Infrared image patch matching

Tracking Players in a Badminton Court by Two Cameras

InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering

Are Sex-based Physiological Differences the Cause of Gender Bias for Chest X-ray Diagnosis?

View while Moving: Efficient Video Recognition in Long-untrimmed Videos

VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields

HyperCoil-Recon: A Hypernetwork-based Adaptive Coil Configuration Task Switching Network for MRI Reconstruction

Joint-Relation Transformer for Multi-Person Motion Prediction

Generalized Unbiased Scene Graph Generation

High-Level Features Parallelization for Inference Cost Reduction Through Selective Attention

Enhancing Mobile Privacy and Security: A Face Skin Patch-Based Anti-Spoofing Approach

Multi-Scale Memory Comparison for Zero-/Few-Shot Anomaly Detection

PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration

SUnAA: Sparse Unmixing using Archetypal Analysis

Objects do not disappear: Video object detection by single-frame object location anticipation

FaceSkin: A Privacy Preserving Facial skin patch Dataset for multi Attributes classification

SAfER: Layer-Level Sensitivity Assessment for Efficient and Robust Neural Network Inference

TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design

Self-supervised Learning of Rotation-invariant 3D Point Set Features using Transformer and its Self-distillation

Continual Road-Scene Semantic Segmentation via Feature-Aligned Symmetric Multi-Modal Network

GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization

Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising

A General Implicit Framework for Fast NeRF Composition and Rendering

Classification of lung cancer subtypes on CT images with synthetic pathological priors

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Assessing the performance of deep learning-based models for prostate cancer segmentation using uncertainty scores

Long-Distance Gesture Recognition using Dynamic Neural Networks

GeoAdapt: Self-Supervised Test-Time Adaption in LiDAR Place Recognition Using Geometric Priors

Rendering Humans from Object-Occluded Monocular Videos

PSRFlow: Probabilistic Super Resolution with Flow-Based Models for Scientific Data

1st Place Solution for CVPR2023 BURST Long Tail and Open World Challenges

LATR: 3D Lane Detection from Monocular Images with Transformer

Optimizing Algorithms From Pairwise User Preferences

FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

Towards Automatic Scoring of Spinal X-ray for Ankylosing Spondylitis

Copy Number Variation Informs fMRI-based Prediction of Autism Spectrum Disorder

From Fake to Real (FFR): A two-stage training pipeline for mitigating spurious correlations with synthetic data

Improving Medical Image Classification in Noisy Labels Using Only Self-supervised Pretraining

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

YUDO: YOLO for Uniform Directed Object Detection

Facial Prior Based First Order Motion Model for Micro-expression Generation

Estimation of Human Condition at Disaster Site Using Aerial Drone Images

Unsupervised Camouflaged Object Segmentation as Domain Adaptation

Large-Scale Multi-Hypotheses Cell Tracking Using Ultrametric Contours Maps

Toward unlabeled multi-view 3D pedestrian detection by generalizable AI: techniques and performance analysis

When More is Less: Incorporating Additional Datasets Can Hurt Performance By Introducing Spurious Correlations

A Deep-Learning Method Using Auto-encoder and Generative Adversarial Network for Anomaly Detection on Ancient Stone Stele Surfaces

DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal from Optical Satellite Images

Digging into Depth Priors for Outdoor Neural Radiance Fields

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Person Re-Identification without Identification via Event Anonymization

LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery

Data Augmentation-Based Unsupervised Domain Adaptation In Medical Imaging

DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds

Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination