2023-08-07

cs.CV

cs.CV - 2023-08-07

Improving FHB Screening in Wheat Breeding Using an Efficient Transformer Model

paper_url: http://arxiv.org/abs/2308.03670
repo_url: None
paper_authors: Babak Azad, Ahmed Abdalla, Kwanghee Won, Ali Mirzakhani Nafchi
for: 这个研究旨在开发一个基于视觉 трансформер的抗疫菌头病毒检测方法，以提高小麦和黑麦生产计划中的抗疫菌头病毒检测效率和精度。methods: 这个方法使用了一种新的 Context Bridge，将U-Net网络的本地表现力与视觉 трансформер模型的全球自我注意力机制结合起来，以提高模型的多对多关联能力和模型化能力。此外，这个方法使用了Efficient Self-attention机制，取代了原始视觉 трансформер模型的标准注意力机制，以减少模型的复杂度。results: 这个研究透过广泛的实验和评估，展示了这个基于视觉 трансформер的方法在抗疫菌头病毒检测任务中的效果。

Abstract
Fusarium head blight is a devastating disease that causes significant economic losses annually on small grains. Efficiency, accuracy, and timely detection of FHB in the resistance screening are critical for wheat and barley breeding programs. In recent years, various image processing techniques have been developed using supervised machine learning algorithms for the early detection of FHB. The state-of-the-art convolutional neural network-based methods, such as U-Net, employ a series of encoding blocks to create a local representation and a series of decoding blocks to capture the semantic relations. However, these methods are not often capable of long-range modeling dependencies inside the input data, and their ability to model multi-scale objects with significant variations in texture and shape is limited. Vision transformers as alternative architectures with innate global self-attention mechanisms for sequence-to-sequence prediction, due to insufficient low-level details, may also limit localization capabilities. To overcome these limitations, a new Context Bridge is proposed to integrate the local representation capability of the U-Net network in the transformer model. In addition, the standard attention mechanism of the original transformer is replaced with Efficient Self-attention, which is less complicated than other state-of-the-art methods. To train the proposed network, 12,000 wheat images from an FHB-inoculated wheat field at the SDSU research farm in Volga, SD, were captured. In addition to healthy and unhealthy plants, these images encompass various stages of the disease. A team of expert pathologists annotated the images for training and evaluating the developed model. As a result, the effectiveness of the transformer-based method for FHB-disease detection, through extensive experiments across typical tasks for plant image segmentation, is demonstrated.

摘要
fusarium 头炎是一种致命的疾病，每年在小谷物上造成重大经济损失。效率、准确性和时效检测 fusarium 头炎在小谷物抗性培养计划中是关键。在过去的几年中，一些基于超vised机器学习算法的图像处理技术被开发出来，用于早期检测 fusarium 头炎。现有的 convolutional neural network（CNN）方法，如 U-Net，通过一系列的编码块创建地方表示，并通过一系列的解码块捕捉 semantic 关系。但这些方法通常无法模型输入数据中的远程相互关系，而且对于具有不同文化和形状的多尺度对象进行模型化也有限制。为了突破这些限制，我们提出了一种新的 Context Bridge，用于将 U-Net 网络的本地表示能力integrated into transformer模型中。此外，原始 transformer 模型的标准注意机制被 replaced with Efficient Self-attention，这是比其他当前状态最简单的方法。为了训练我们提出的网络，我们使用了12,000个小谷物图像，这些图像来自于南达大学农业实验室Volga, SD的FHB-感染小谷物田。除了健康和病毒植物之外，这些图像还包括不同阶段的疾病。一 команópathologists expert annotated the images for training and evaluating the developed model. As a result, the effectiveness of the transformer-based method for FHB-disease detection, through extensive experiments across typical tasks for plant image segmentation, is demonstrated.

Distributionally Robust Classification on a Data Budget

paper_url: http://arxiv.org/abs/2308.03821
repo_url: https://github.com/penfever/vlhub
paper_authors: Benjamin Feuer, Ameya Joshi, Minh Pham, Chinmay Hegde
For: The paper aims to address the challenge of training robust deep learning models under distribution shifts, specifically in domains with limited data budgets.* Methods: The authors introduce a new dataset called JANuS (Joint Annotations and Names Set) and perform a series of carefully controlled investigations to evaluate the factors contributing to robustness in image classification. They use a standard ResNet-50 model trained with the cross-entropy loss on 2.4 million image samples and compare the results to a CLIP ResNet-50 model trained on 400 million samples.* Results: The authors show that the standard ResNet-50 model can attain comparable robustness to the CLIP ResNet-50 model on limited data budgets, which is the first result of its kind to our knowledge.Here’s the Simplified Chinese text format you requested:* For: 本文目标是在分布shift下训练深度学习模型，特别是在数据预算有限的情况下。* Methods: 作者引入了新的JANuS（联合注释和名称集）数据集，并通过仔细控制的调查来评估图像分类 task中的Robustness因素。他们使用标准的ResNet-50模型，使用十字积分损失函数进行训练，并与400万样本进行比较。* Results: 作者发现，使用标准的ResNet-50模型可以在有限数据预算下实现相似的Robustness性能，这是我们知道的第一个结果。

Abstract
Real world uses of deep learning require predictable model behavior under distribution shifts. Models such as CLIP show emergent natural distributional robustness comparable to humans, but may require hundreds of millions of training samples. Can we train robust learners in a domain where data is limited? To rigorously address this question, we introduce JANuS (Joint Annotations and Names Set), a collection of four new training datasets with images, labels, and corresponding captions, and perform a series of carefully controlled investigations of factors contributing to robustness in image classification, then compare those results to findings derived from a large-scale meta-analysis. Using this approach, we show that standard ResNet-50 trained with the cross-entropy loss on 2.4 million image samples can attain comparable robustness to a CLIP ResNet-50 trained on 400 million samples. To our knowledge, this is the first result showing (near) state-of-the-art distributional robustness on limited data budgets. Our dataset is available at \url{https://huggingface.co/datasets/penfever/JANuS_dataset}, and the code used to reproduce our experiments can be found at \url{https://github.com/penfever/vlhub/}.

摘要
实际应用中的深度学习需要模型在分布转移时具有预测可靠性。如CLIP模型，它们可以在自然分布下显示出类似于人类的分布弹性，但可能需要数百万个训练样本。可以在具有有限的数据库中训练强健的学习者吗？为了系统地回答这个问题，我们提出了JANuS（共同注释和名称集），包括四个新的训练集，每个集包含图像、标签和相应的描述，并进行了一系列仔细控制的调查，以研究影响模型强健性的因素。我们发现，使用权重平衡损失函数，只需训练240万个图像样本的标准ResNet-50模型，可以达到与CLIP ResNet-50模型在400万样本上训练后的相似的分布弹性。我们认为这是首次在有限数据预算下实现（近）顶尖分布弹性的结果。我们的数据集可以在\url{https://huggingface.co/datasets/penfever/JANuS_dataset}中找到，并且使用来复制我们的实验的代码可以在\url{https://github.com/penfever/vlhub/}找到。

WarpEM: Dynamic Time Warping for Accurate Catheter Registration in EM-guided Procedures

paper_url: http://arxiv.org/abs/2308.03652
repo_url: None
paper_authors: Ardit Ramadani, Peter Ewert, Heribert Schunkert, Nassir Navab
for: 这篇论文旨在提供一种自动化的电磁诊断追踪方法，以便在微侵入性医疗程序中精确地追踪静脉管。
methods: 本文使用了3D信号时间分析技术，如动态时间截图（DTW）算法，以改善追踪精度和可靠性。
results: 结果显示，DTW方法可以精确地调整和匹配EM追踪路径与血管中心轴，与 marker-based 追踪作为参考值得出高度相似的追踪结果，mean error 为2.22mm。

Abstract
Accurate catheter tracking is crucial during minimally invasive endovascular procedures (MIEP), and electromagnetic (EM) tracking is a widely used technology that serves this purpose. However, registration between preoperative images and the EM tracking system is often challenging. Existing registration methods typically require manual interactions, which can be time-consuming, increase the risk of errors and change the procedural workflow. Although several registration methods are available for catheter tracking, such as marker-based and path-based approaches, their limitations can impact the accuracy of the resulting tracking solution, consequently, the outcome of the medical procedure. This paper introduces a novel automated catheter registration method for EM-guided MIEP. The method utilizes 3D signal temporal analysis, such as Dynamic Time Warping (DTW) algorithms, to improve registration accuracy and reliability compared to existing methods. DTW can accurately warp and match EM-tracked paths to the vessel's centerline, making it particularly suitable for registration. The introduced registration method is evaluated for accuracy in a vascular phantom using a marker-based registration as the ground truth. The results indicate that the DTW method yields accurate and reliable registration outcomes, with a mean error of $2.22$mm. The introduced registration method presents several advantages over state-of-the-art methods, such as high registration accuracy, no initialization required, and increased automation.

摘要
准确的导管跟踪是在微创综合性术（MIEP）中非常重要，电磁（EM）跟踪技术是广泛使用的。然而，在 préoperative图像和EM跟踪系统之间的注册通常是困难的。现有的注册方法通常需要手动交互，这可能会耗时，增加错误的风险，并改变操作工作流程。虽然有几种注册方法可以用于导管跟踪，如标记基于的和路径基于的方法，但它们的局限性可能会影响导管跟踪解决方案的准确性，从而影响医疗程序的结果。这篇论文介绍了一种新的自动化导管注册方法，用于EM-引导MIEP。该方法利用3D信号时间分析算法，如动态时间战斗（DTW）算法，以提高注册准确性和可靠性。DTW算法可以准确地扭曲和匹配EM跟踪的路径与血管中心线，使其特别适用于注册。引入的注册方法在vascular模拟器中使用 marker-based注册作为参照值进行评估。结果表明，DTW方法可以提供高准确性和可靠性的注册结果，平均错误为2.22毫米。引入的注册方法具有许多优点，如高注册准确性、无需初始化、提高自动化等。

MOMA-Force: Visual-Force Imitation for Real-World Mobile Manipulation

paper_url: http://arxiv.org/abs/2308.03624
repo_url: None
paper_authors: Taozheng Yang, Ya Jing, Hongtao Wu, Jiafeng Xu, Kuankuan Sima, Guangzeng Chen, Qie Sima, Tao Kong
for: 这个论文旨在提出一种用于移动抓取器完成多种接触rich manipulate任务的新方法。
methods: 该方法 combinest representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control to achieve both robustness and controllability.
results: 在实际家庭环境中，该方法比基线方法更高的成功率和更小的接触力和力异常值。Here’s the full text in Simplified Chinese:
for: 这个论文旨在提出一种用于移动抓取器完成多种接触rich manipulate任务的新方法。
methods: 该方法 combinest representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control to achieve both robustness and controllability.
results: 在实际家庭环境中，该方法比基线方法更高的成功率和更小的接触力和力异常值。I hope that helps! Let me know if you have any other questions.

Abstract
In this paper, we present a novel method for mobile manipulators to perform multiple contact-rich manipulation tasks. While learning-based methods have the potential to generate actions in an end-to-end manner, they often suffer from insufficient action accuracy and robustness against noise. On the other hand, classical control-based methods can enhance system robustness, but at the cost of extensive parameter tuning. To address these challenges, we present MOMA-Force, a visual-force imitation method that seamlessly combines representation learning for perception, imitation learning for complex motion generation, and admittance whole-body control for system robustness and controllability. MOMA-Force enables a mobile manipulator to learn multiple complex contact-rich tasks with high success rates and small contact forces. In a real household setting, our method outperforms baseline methods in terms of task success rates. Moreover, our method achieves smaller contact forces and smaller force variances compared to baseline methods without force imitation. Overall, we offer a promising approach for efficient and robust mobile manipulation in the real world. Videos and more details can be found on \url{https://visual-force-imitation.github.io}

摘要
在这篇论文中，我们提出了一种新的方法，以便移动抓取机器人执行多种接触rich的抓取任务。而学习基于方法可以在端到端的方式生成动作，但它们经常受到不精准的动作和噪声的影响。而经典控制基于方法可以提高系统的稳定性，但是在Parameter tuning的代价上。为了解决这些挑战，我们提出了MOMA-Force方法，这是一种基于视觉力学学习、模仿学习和总体控制的视觉力学抓取方法。MOMA-Force方法可以让移动抓取机器人学习多种复杂的接触rich任务，并且具有高成功率和小接触力。在一个真实的家庭环境中，我们的方法比基准方法更高的任务成功率，同时也比基准方法更小的接触力和力矩变化。总的来说，我们的方法可以带来有效和稳定的移动抓取在真实世界中。视频和更多细节可以在 \url{https://visual-force-imitation.github.io} 上找到。

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

paper_url: http://arxiv.org/abs/2308.03620
repo_url: None
paper_authors: Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, Tao Kong
for: 本研究旨在探讨视觉预训练策略对机器人 manipulate 任务的影响，从三个基本角度进行了全面的调查。
methods: 本研究使用了大规模的实际数据，并对模型结构和训练方法进行了广泛的试验。
results: 实验结果表明，提议的 Vi-PRoM 方案在不同的 simulate 环境和真实机器人上都达到了显著的改进。

Abstract
Visual pre-training with large-scale real-world data has made great progress in recent years, showing great potential in robot learning with pixel observations. However, the recipes of visual pre-training for robot manipulation tasks are yet to be built. In this paper, we thoroughly investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives: pre-training datasets, model architectures and training methods. Several significant experimental findings are provided that are beneficial for robot learning. Further, we propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning. Concretely, the former employs contrastive learning to acquire underlying patterns from large-scale unlabeled data, while the latter aims learning visual semantics and temporal dynamics. Extensive experiments on robot manipulations in various simulation environments and the real robot demonstrate the superiority of the proposed scheme. Videos and more details can be found on \url{https://explore-pretrain-robot.github.io}.

摘要
“在过去几年，使用大规模实际数据进行视觉预训练已经做出了大量的进步，表明了视觉预训练在机器人学习中的潜力。然而，机器人 manipulate 任务中的视觉预训练的秘诀仍未得到建立。本文对视觉预训练策略在机器人 manipulate 任务中的影响进行了全面的调查，从三个基本的角度出发：预训练数据集、模型架构和训练方法。我们提供了许多有用的实验结果，这些结果对机器人学习具有帮助作用。此外，我们还提出了一种视觉预训练方案 для机器人 manipulate 任务，名为 Vi-PRoM，它将自我监督学习和监督学习相结合。具体来说，前者通过对大规模无标记数据中的对比学习来捕捉下层模式，而后者则是学习视觉 semantics 和时间动力学。我们在各种 simulate 环境和真实机器人上进行了广泛的实验，并证明了我们的方案的优越性。视频和更多细节可以在 \url{https://explore-pretrain-robot.github.io} 上找到。”

Adaptive Semi-Supervised Segmentation of Brain Vessels with Ambiguous Labels

paper_url: http://arxiv.org/abs/2308.03613
repo_url: None
paper_authors: Fengming Lin, Yan Xia, Nishant Ravikumar, Qiongyao Liu, Michael MacRaild, Alejandro F Frangi
for: 本研究旨在提高脑血管分割精度，以便脑血管疾病诊断和治疗。
methods: 该方法采用进步式半监督学习、适应性训练策略和边界增强等技术。
results: 实验结果表明，该方法在3DRA数据集上实现了高精度的网格基分割结果，并且能够处理部分或抽象标注的数据集。

Abstract
Accurate segmentation of brain vessels is crucial for cerebrovascular disease diagnosis and treatment. However, existing methods face challenges in capturing small vessels and handling datasets that are partially or ambiguously annotated. In this paper, we propose an adaptive semi-supervised approach to address these challenges. Our approach incorporates innovative techniques including progressive semi-supervised learning, adaptative training strategy, and boundary enhancement. Experimental results on 3DRA datasets demonstrate the superiority of our method in terms of mesh-based segmentation metrics. By leveraging the partially and ambiguously labeled data, which only annotates the main vessels, our method achieves impressive segmentation performance on mislabeled fine vessels, showcasing its potential for clinical applications.

摘要
精准分割脑血管是脑血管疾病诊断和治疗中的关键。然而，现有方法在捕捉小血管和处理部分或杂杂标注的数据集时受到挑战。在这篇论文中，我们提出了一种适应性半supervised方法来解决这些挑战。我们的方法包括进步半supervised学习、适应性训练策略和边界增强等创新技术。实验结果表明，我们的方法在3DRA数据集上的笔直基于分割指标上具有显著的优势。通过利用部分和杂杂标注的数据，我们的方法在杂杂标注的细血管上达到了很好的分割性能，这显示了其在临床应用中的潜力。

AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose

paper_url: http://arxiv.org/abs/2308.03610
repo_url: None
paper_authors: Huichao Zhang, Bowen Chen, Hao Yang, Liao Qu, Xu Wang, Li Chen, Chao Long, Feida Zhu, Kang Du, Min Zheng
for: 高品质、多样化的3D人物模型自文本描述和姿势指导生成
methods: 基于Diffusion Model和DensePose信号的2D图像Conditional Generation、高解度3Dsynthesis策略
results: 实现零shot3D模型生成高品质、多样化的3D人物模型，比前工作有更高的表现质量和稳定性。

Abstract
Creating expressive, diverse and high-quality 3D avatars from highly customized text descriptions and pose guidance is a challenging task, due to the intricacy of modeling and texturing in 3D that ensure details and various styles (realistic, fictional, etc). We present AvatarVerse, a stable pipeline for generating expressive high-quality 3D avatars from nothing but text descriptions and pose guidance. In specific, we introduce a 2D diffusion model conditioned on DensePose signal to establish 3D pose control of avatars through 2D images, which enhances view consistency from partially observed scenarios. It addresses the infamous Janus Problem and significantly stablizes the generation process. Moreover, we propose a progressive high-resolution 3D synthesis strategy, which obtains substantial improvement over the quality of the created 3D avatars. To this end, the proposed AvatarVerse pipeline achieves zero-shot 3D modeling of 3D avatars that are not only more expressive, but also in higher quality and fidelity than previous works. Rigorous qualitative evaluations and user studies showcase AvatarVerse's superiority in synthesizing high-fidelity 3D avatars, leading to a new standard in high-quality and stable 3D avatar creation. Our project page is: https://avatarverse3d.github.io

摘要
创建高质量、多样化和自然表达的3D人物模型从高级定制文本描述和姿势指导是一项复杂的任务，因为3D模型和Texture的细节和不同风格（现实、虚构等）的涉及。我们介绍了AvatarVerse，一个稳定的生成高质量3D人物模型的管道，从文本描述和姿势指导开始。具体来说，我们引入了基于DensePose信号的2D扩散模型，以确保3D人物的姿势控制，并解决了著名的托尼问题，从部分观察的场景中提高了视觉一致性。此外，我们提出了一种进步的高分辨率3D生成策略，实现了对创建的3D人物模型的质量提升。因此，我们的AvatarVerse管道实现了零式3D模型化，创造出更加表达力强、高质量和真实的3D人物模型，胜过前一个工作。我们的项目页面是：https://avatarverse3d.github.io。

Recurrent Self-Supervised Video Denoising with Denser Receptive Field

paper_url: http://arxiv.org/abs/2308.03608
repo_url: None
paper_authors: Zichun Wang, Yulun Zhang, Debing Zhang, Ying Fu
for: 自动化视频干净（video denoising）
methods: 使用自适应损块网络（blind spot networks）和自适应循环视频干净方法（self-supervised recurrent video denoising method）
results: 提高视频干净效果，利用参照帧和邻帧帧的更多信息，同时具有较好的泛化能力和稳定性。

Abstract
Self-supervised video denoising has seen decent progress through the use of blind spot networks. However, under their blind spot constraints, previous self-supervised video denoising methods suffer from significant information loss and texture destruction in either the whole reference frame or neighbor frames, due to their inadequate consideration of the receptive field. Moreover, the limited number of available neighbor frames in previous methods leads to the discarding of distant temporal information. Nonetheless, simply adopting existing recurrent frameworks does not work, since they easily break the constraints on the receptive field imposed by self-supervision. In this paper, we propose RDRF for self-supervised video denoising, which not only fully exploits both the reference and neighbor frames with a denser receptive field, but also better leverages the temporal information from both local and distant neighbor features. First, towards a comprehensive utilization of information from both reference and neighbor frames, RDRF realizes a denser receptive field by taking more neighbor pixels along the spatial and temporal dimensions. Second, it features a self-supervised recurrent video denoising framework, which concurrently integrates distant and near-neighbor temporal features. This enables long-term bidirectional information aggregation, while mitigating error accumulation in the plain recurrent framework. Our method exhibits superior performance on both synthetic and real video denoising datasets. Codes will be available at https://github.com/Wang-XIaoDingdd/RDRF.

摘要
自我监督视频干扰有很好的进步，特别是通过盲区网络。然而，在这些盲区约束下，前一代的自我监督视频干扰方法会导致重要信息的损失和图像的破坏，主要是因为它们对接收场的不充分考虑。此外，过去的方法中的可用邻帧数量有限，导致远端的时间信息抛弃。然而，直接采用现有的循环框架不行，因为它们容易违反自我监督中的接收场约束。在这篇论文中，我们提出了RDRF方法，该方法不仅能充分利用参照帧和邻帧帧的信息，而且能更好地利用邻帧帧的时间特征。首先，RDRF方法实现了更 dense的接收场，通过在空间和时间维度上接受更多的邻帧像素。其次，它提供了一种自我监督循环视频干扰框架，该框架同时集成了远端和近邻邻帧特征。这使得长期双向信息集成，并减少了循环框架中的错误积累。我们的方法在Synthetic和实际视频干扰数据上表现出色。代码将在https://github.com/Wang-XIaoDingdd/RDRF中提供。

FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

paper_url: http://arxiv.org/abs/2308.03594
repo_url: None
paper_authors: Khurram Azeem Hashmi, Goutham Kallempudi, Didier Stricker, Muhammamd Zeshan Afzal
for: 提高低光照下的视觉任务表现，特别是提取下游任务中有用的视觉cue。
methods: 提出了一种新的模块——FeatEnHancer，通过多级层次结合多头注意力和任务相关损失函数来生成适应性的图像表示。
results: 在多个低光照视觉任务中，FeatEnHancer模块可以带来显著和一致的提高，包括黑bject检测 (+5.7 mAP on ExDark)、人脸检测 (+1.5 mAPon DARK FACE)、夜间 semantic segmentation (+5.1 mIoU on ACDC) 和视频对象检测 (+1.8 mAP on DarkVision)，这显示了增强层次特征的有效性。

Abstract
Extracting useful visual cues for the downstream tasks is especially challenging under low-light vision. Prior works create enhanced representations by either correlating visual quality with machine perception or designing illumination-degrading transformation methods that require pre-training on synthetic datasets. We argue that optimizing enhanced image representation pertaining to the loss of the downstream task can result in more expressive representations. Therefore, in this work, we propose a novel module, FeatEnHancer, that hierarchically combines multiscale features using multiheaded attention guided by task-related loss function to create suitable representations. Furthermore, our intra-scale enhancement improves the quality of features extracted at each scale or level, as well as combines features from different scales in a way that reflects their relative importance for the task at hand. FeatEnHancer is a general-purpose plug-and-play module and can be incorporated into any low-light vision pipeline. We show with extensive experimentation that the enhanced representation produced with FeatEnHancer significantly and consistently improves results in several low-light vision tasks, including dark object detection (+5.7 mAP on ExDark), face detection (+1.5 mAPon DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ), and video object detection (+1.8 mAP on DarkVision), highlighting the effectiveness of enhancing hierarchical features under low-light vision.

摘要
<>将文本翻译成简化中文。<>低光照下的视觉特征提取是特别困难，先前的工作通过与机器感知相关的视质质量相关或设计产生杂质变换方法来创建增强的表示。我们认为，根据下游任务的损失函数优化增强图像表示可以获得更表现тив的表示。因此，在这个工作中，我们提出了一种新的模块，FeatEnHancer，它通过多级划分特征并使用多头注意力指导任务相关损失函数来创建适合的表示。此外，我们的内部划分增强可以提高每级特征提取的质量，同时将不同级划分特征组合在一起，以反映它们在任务中的相对重要性。FeatEnHancer是一个通用的插件和撤退模块，可以在任何低光照视觉管道中使用。我们通过广泛的实验表明，FeatEnHancer生成的增强表示可以在多个低光照视觉任务中提高结果，包括黑影物体检测 (+5.7 mAP on ExDark)、人脸检测 (+1.5 mAPon DARK FACE)、夜间 semantic segmentation (+5.1 mIoU on ACDC) 和视频对象检测 (+1.8 mAP on DarkVision)，这 highlights the effectiveness of enhancing hierarchical features under low-light vision。

SoilNet: An Attention-based Spatio-temporal Deep Learning Framework for Soil Organic Carbon Prediction with Digital Soil Mapping in Europe

paper_url: http://arxiv.org/abs/2308.03586
repo_url: None
paper_authors: Nafiseh Kakhani, Moien Rangzan, Ali Jamali, Sara Attarchi, Seyed Kazem Alavipanah, Thomas Scholten
for: 这个研究旨在提高土壤地图的精度和可靠性，并且运用深度学习技术来预测土壤碳的空间分布。
methods: 这个研究使用了一种新的架构，具有空间注意 Mechanism和气候时间序列对应的LSTM网络，以预测欧洲各地的土壤碳含量。这个模型使用了一系列的环境特征，包括landsat-8图像、地形、遥测指数和气候时间序列，作为输入特征。
results: 研究结果显示，提案的架构在预测土壤碳含量方面比常用的机器学习方法（如随机森林）有更好的表现，具体而言，这个模型的误差值较低。这个模型是一个可靠的工具，可以用来预测土壤碳和其他土壤特征，并且可以帮助土地管理和决策过程中的准确信息。

Abstract
Digital soil mapping (DSM) is an advanced approach that integrates statistical modeling and cutting-edge technologies, including machine learning (ML) methods, to accurately depict soil properties and their spatial distribution. Soil organic carbon (SOC) is a crucial soil attribute providing valuable insights into soil health, nutrient cycling, greenhouse gas emissions, and overall ecosystem productivity. This study highlights the significance of spatial-temporal deep learning (DL) techniques within the DSM framework. A novel architecture is proposed, incorporating spatial information using a base convolutional neural network (CNN) model and spatial attention mechanism, along with climate temporal information using a long short-term memory (LSTM) network, for SOC prediction across Europe. The model utilizes a comprehensive set of environmental features, including Landsat-8 images, topography, remote sensing indices, and climate time series, as input features. Results demonstrate that the proposed framework outperforms conventional ML approaches like random forest commonly used in DSM, yielding lower root mean square error (RMSE). This model is a robust tool for predicting SOC and could be applied to other soil properties, thereby contributing to the advancement of DSM techniques and facilitating land management and decision-making processes based on accurate information.

摘要
《数字土壤地图（DSM）是一种先进的方法，它将统计模型和前沿技术，包括机器学习（ML）方法，融合在一起以准确地表示土壤属性和其空间分布。土壤有机碳（SOC）是一个重要的土壤特征，它为土壤健康、营养循环、温室气体排放和生态系统产生力提供了重要的信息。本研究发现，在 DSM 框架中使用空间时间深度学习（DL）技术可以提高 SOC 预测的准确性。本文提出了一种新的架构，其包括基于 Convolutional Neural Network（CNN）模型的空间注意机制和基于 Long Short-Term Memory（LSTM）网络的时间注意机制，用于预测欧洲各地的 SOC。该模型使用了包括 Landsat-8 图像、地形、远程感知指数和气候时间序列在内的全面环境特征作为输入特征。结果表明，提议的框架可以比常见的多项式学习方法，如Random Forest，更好地预测 SOC，具有较低的根圆方差误差（RMSE）。这种模型是一种可靠的 SOC 预测工具，可以应用于其他土壤属性，从而为土地管理和决策过程提供准确信息的支持。

Feature Decoupling-Recycling Network for Fast Interactive Segmentation

paper_url: http://arxiv.org/abs/2308.03529
repo_url: None
paper_authors: Huimin Zeng, Weinong Wang, Xin Tao, Zhiwei Xiong, Yu-Wing Tai, Wenjie Pei
for: 提高交互式分割的效率，特别是在需要长期交互的复杂enario下（提高4.25倍），同时保持适用性和可靠性。
methods: 提出Feature Decoupling-Recycling Network（FDRN），通过基于不同类型的差异来隔离模型组件，然后再次利用这些组件进行每次交互。
results: 在6个不同领域和模式的数据集上进行了广泛的实验，表明：1）比其他方法更高效（最多4.25倍），特别是在复杂的场景下；2）可以作为通用增强技术应用于不同的方法；3）具有跨任务普适性和鲁棒性。

Abstract
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input without considering the invariant nature of the source image. As a result, extracting features from the source image is repeated in each interaction, resulting in substantial computational redundancy. In this work, we propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies and then recycles components for each user interaction. Thus, the efficiency of the whole interactive process can be significantly improved. To be specific, we apply the Decoupling-Recycling strategy from three perspectives to address three types of discrepancies, respectively. First, our model decouples the learning of source image semantics from the encoding of user guidance to process two types of input domains separately. Second, FDRN decouples high-level and low-level features from stratified semantic representations to enhance feature learning. Third, during the encoding of user guidance, current user guidance is decoupled from historical guidance to highlight the effect of current user guidance. We conduct extensive experiments on 6 datasets from different domains and modalities, which demonstrate the following merits of our model: 1) superior efficiency than other methods, particularly advantageous in challenging scenarios requiring long-term interactions (up to 4.25x faster), while achieving favorable segmentation performance; 2) strong applicability to various methods serving as a universal enhancement technique; 3) well cross-task generalizability, e.g., to medical image segmentation, and robustness against misleading user guidance.

摘要
最近的互动式分割方法会 iteratively 使用源图像、用户指导和先前预测的面积作为输入，而不考虑源图像的不变性。这会导致在每次互动中提取源图像的特征，从而导致计算重复，从而导致计算浪费。在这种情况下，我们提出了Feature Decoupling-Recycling Network（FDRN），它将模型组件基于其内在差异分解，然后将组件重新使用。这有助于提高整个互动过程的效率。具体来说，我们在三个方面应用Decoupling-Recycling策略来解决三种不同的差异：首先，我们的模型将源图像 semantics 学习与用户指导编码分解成两个不同的输入领域。第二，FDRN将高级和低级特征从层次结构的 semantic representation 分解，以提高特征学习。第三，在用户指导编码时，当前用户指导与历史指导分解，以强调当前用户指导的效果。我们在6个不同领域和模式的数据集上进行了广泛的实验，其结果表明：1. 我们的模型在长期互动（最长4.25倍）中表现出了明显的高效性，而且在多种互动方法上表现出了优秀的分割性能。2. FDRN 是一种通用的增强技术，可以应用于多种方法。3. 我们的模型在不同的任务上具有良好的跨任务泛化性和鲁棒性，例如医学影像分割。

Keyword Spotting Simplified: A Segmentation-Free Approach using Character Counting and CTC re-scoring

paper_url: http://arxiv.org/abs/2308.03515
repo_url: https://github.com/georgeretsi/segfreekws
paper_authors: George Retsinas, Giorgos Sfikas, Christophoros Nikou
for: 这篇论文targets文档图像中的关键词检索问题，具体来说是一种基于对象检测模式的分类方法，并借鉴了当今最佳的检测系统来同时提出词 bounding box 提议机制和计算相应的表示。
methods: 该方法不同于常见的使用复杂大型神经网络模型的方法，而是提出了一种简单并 компакт的分类系统，通过高效扫描文档图像来找到包含查询信息的矩形区域，并通过一个隐式学习的尺度图来预测字符出现的区域。
results: 实验 validate了这种方法可以卓越于当今最佳的方法，尽管使用的模型非常简单和占用空间小。

Abstract
Recent advances in segmentation-free keyword spotting treat this problem w.r.t. an object detection paradigm and borrow from state-of-the-art detection systems to simultaneously propose a word bounding box proposal mechanism and compute a corresponding representation. Contrary to the norm of such methods that rely on complex and large DNN models, we propose a novel segmentation-free system that efficiently scans a document image to find rectangular areas that include the query information. The underlying model is simple and compact, predicting character occurrences over rectangular areas through an implicitly learned scale map, trained on word-level annotated images. The proposed document scanning is then performed using this character counting in a cost-effective manner via integral images and binary search. Finally, the retrieval similarity by character counting is refined by a pyramidal representation and a CTC-based re-scoring algorithm, fully utilizing the trained CNN model. Experimental validation on two widely-used datasets shows that our method achieves state-of-the-art results outperforming the more complex alternatives, despite the simplicity of the underlying model.

摘要
近年来， segmentation-free 关键词检索技术发展，将这个问题转化为对象检测模式，借鉴国际一级检测系统，同时提出词框报告机制和相应的表示计算。与传统方法不同，我们提出了一种简单、占地小的 segmentation-free 系统，通过高效扫描文档图像，找到包含查询信息的矩形区域。这个模型简单、巧妙，通过隐式学习的Scale Map，在word级图像上预测字符出现的区域。然后，通过 integral images 和 binary search 来实现cost-effective的文档扫描。最后，通过 pyramidal representation 和 CTC-based re-scoring algorithm，完全利用训练的 CNN 模型，进行了 Retrieval 相关性的补做。我们在两个常用的数据集上进行了实验验证，发现我们的方法可以在与更复杂的对比下，即使模型本身简单，却能够达到国际一级的Result。

Learning Photometric Feature Transform for Free-form Object Scan

paper_url: http://arxiv.org/abs/2308.03492
repo_url: None
paper_authors: Xiang Feng, Kaizhang Kang, Fan Pei, Huakeng Ding, Jinjiang You, Ping Tan, Kun Zhou, Hongzhi Wu
for: 提高3D重建的精度和速度
methods: 使用自动学习的多视图投影和变换方法，并与照明条件进行共同训练
results: 实现了高精度和高速的3D重建，并与专业3D扫描仪和照片进行比较，与当前技术相比较有优势

Abstract
We propose a novel framework to automatically learn to aggregate and transform photometric measurements from multiple unstructured views into spatially distinctive and view-invariant low-level features, which are fed to a multi-view stereo method to enhance 3D reconstruction. The illumination conditions during acquisition and the feature transform are jointly trained on a large amount of synthetic data. We further build a system to reconstruct the geometry and anisotropic reflectance of a variety of challenging objects from hand-held scans. The effectiveness of the system is demonstrated with a lightweight prototype, consisting of a camera and an array of LEDs, as well as an off-the-shelf tablet. Our results are validated against reconstructions from a professional 3D scanner and photographs, and compare favorably with state-of-the-art techniques.

摘要
我们提出了一种新的框架，用于自动学习将多视角不结构化测量数据转化为空间特征和视角不变的低级特征，这些特征被传递给多视角斯tereo方法以增强3D重建。在获取过程中的照明条件和特征变换被同时训练在大量的 sintetic数据上。我们还建立了一个系统，用于从手持扫描获取的数据中重建物体的几何和方向异otropic反射。我们的结果通过使用轻量级的 прототип，包括一个相机和一个LED阵列，以及一个商业化的平板电脑，与专业3D扫描仪和照片进行比较，并与现有技术相比较有着良好的效果。

Improving Mass Detection in Mammography Images: A Study of Weakly Supervised Learning and Class Activation Map Methods

paper_url: http://arxiv.org/abs/2308.03486
repo_url: None
paper_authors: Vicente Sampaio, Filipe R. Cordeiro
for: 这个研究旨在测试不同的启动图示方法，以提高静脉癌检测模型的精度。
methods: 这个研究使用了state-of-the-art的弱监督训练方法，并考虑了不同的启动图示技术，包括Class Activation Maps (CAM)、GradCAM、GradCAM++、XGradCAM和LayerCAM。
results: 研究发现，使用不同的启动图示方法在训练和测试阶段可以提高模型的性能，尤其是降低False Positive Per Image (FPPI)值，提高True Positive Rate (TPR)。

Abstract
In recent years, weakly supervised models have aided in mass detection using mammography images, decreasing the need for pixel-level annotations. However, most existing models in the literature rely on Class Activation Maps (CAM) as the activation method, overlooking the potential benefits of exploring other activation techniques. This work presents a study that explores and compares different activation maps in conjunction with state-of-the-art methods for weakly supervised training in mammography images. Specifically, we investigate CAM, GradCAM, GradCAM++, XGradCAM, and LayerCAM methods within the framework of the GMIC model for mass detection in mammography images. The evaluation is conducted on the VinDr-Mammo dataset, utilizing the metrics Accuracy, True Positive Rate (TPR), False Negative Rate (FNR), and False Positive Per Image (FPPI). Results show that using different strategies of activation maps during training and test stages leads to an improvement of the model. With this strategy, we improve the results of the GMIC method, decreasing the FPPI value and increasing TPR.

摘要
Recently, weakly supervised models have been used for mass detection in mammography images, reducing the need for pixel-level annotations. However, most existing models in the literature rely on Class Activation Maps (CAM) as the activation method, without exploring other activation techniques. This study aims to explore and compare different activation maps in conjunction with state-of-the-art methods for weakly supervised training in mammography images. Specifically, we investigate CAM, GradCAM, GradCAM++, XGradCAM, and LayerCAM methods within the framework of the GMIC model for mass detection in mammography images. The evaluation is conducted on the VinDr-Mammo dataset, using Accuracy, True Positive Rate (TPR), False Negative Rate (FNR), and False Positive Per Image (FPPI) metrics. Results show that using different strategies of activation maps during training and test stages leads to improved model performance, with a decrease in FPPI and an increase in TPR.

Deepfake Detection: A Comparative Analysis

paper_url: http://arxiv.org/abs/2308.03471
repo_url: https://github.com/akshay-atam/Deepfake-Detection-Using-Machine-Learning
paper_authors: Sohail Ahmed Khan, Duc-Tien Dang-Nguyen
for: This paper aims to provide insights into the effectiveness of different deep learning architectures, training strategies, and deepfake detection benchmarks for developing more accurate and reliable deepfake detection systems.
methods: The paper evaluates eight supervised deep learning architectures and two transformer-based models pre-trained using self-supervised strategies on four benchmarks, including intra-dataset and inter-dataset evaluations, to examine the best performing models, generalisation capabilities, and impact of augmentations.
results: The paper presents a comprehensive comparative analysis of supervised and self-supervised models for deepfake detection, including the best performing models, generalisation capabilities, and impact of augmentations, to provide insights into the effectiveness of different deep learning architectures, training strategies, and deepfake detection benchmarks.Here are the three points in Simplified Chinese text:
for: 这篇论文的目的是为了提供不同的深度学习架构、训练策略和深度假像检测 benchmark 的效果，以开发更加准确和可靠的深度假像检测系统。
methods: 论文评估了 eight 个supervised深度学习架构和 two 个 transformer-based 模型，通过自我超vised策略进行预训练，然后在四个 benchmark 上进行评估，包括内部和外部评估，以 исследова最佳性能、泛化能力和增强策略的影响。
results: 论文提供了一项全面的比较分析，探讨不同的深度学习架构、训练策略和深度假像检测 benchmark 的效果，包括最佳性能、泛化能力和增强策略的影响，以帮助开发更加准确和可靠的深度假像检测系统。

Abstract
This paper present a comprehensive comparative analysis of supervised and self-supervised models for deepfake detection. We evaluate eight supervised deep learning architectures and two transformer-based models pre-trained using self-supervised strategies (DINO, CLIP) on four benchmarks (FakeAVCeleb, CelebDF-V2, DFDC, and FaceForensics++). Our analysis includes intra-dataset and inter-dataset evaluations, examining the best performing models, generalisation capabilities, and impact of augmentations. We also investigate the trade-off between model size and performance. Our main goal is to provide insights into the effectiveness of different deep learning architectures (transformers, CNNs), training strategies (supervised, self-supervised), and deepfake detection benchmarks. These insights can help guide the development of more accurate and reliable deepfake detection systems, which are crucial in mitigating the harmful impact of deepfakes on individuals and society.

摘要
translate into Simplified Chinese:这篇论文提供了深度伪造检测中超级和自动驱动模型的比较分析。我们评估了8个超级深度学习架构和2个基于转换器的模型（DINO、CLIP）在4个标准测试集（FakeAVCeleb、CelebDF-V2、DFDC、FaceForensics++）上的性能。我们的分析包括内部数据集和间部数据集的评估，检查最佳性能模型，泛化能力和数据增强的影响。我们还进行了模型大小和性能之间的负面关系的研究。我们的主要目标是提供不同深度学习架构（转换器、CNN）、训练策略（supervised、self-supervised）和深度伪造检测标准集的情况，以帮助开发更加准确和可靠的深度伪造检测系统，这些系统对个人和社会的影响是非常重要的。

RoadScan: A Novel and Robust Transfer Learning Framework for Autonomous Pothole Detection in Roads

paper_url: http://arxiv.org/abs/2308.03467
repo_url: None
paper_authors: Guruprasad Parasnis, Anmol Chokshi, Kailas Devadkar
for: 本研究旨在提出一种基于深度学习和图像处理技术的坑洞检测方法，以解决道路上坑洞的问题，该问题对道路用户造成了重大风险。
methods: 该方法利用VGG16模型进行特征提取，并使用自定义的Siamese网络和 triplet损失函数，称为RoadScan。
results: 该方法在准确地检测坑洞方面达到了显著的表现，其准确率达96.12%，EER值为3.89%，AUROC值为0.988，与其他现状顶尖研究相比表现高效。

Abstract
This research paper presents a novel approach to pothole detection using Deep Learning and Image Processing techniques. The proposed system leverages the VGG16 model for feature extraction and utilizes a custom Siamese network with triplet loss, referred to as RoadScan. The system aims to address the critical issue of potholes on roads, which pose significant risks to road users. Accidents due to potholes on the roads have led to numerous accidents. Although it is necessary to completely remove potholes, it is a time-consuming process. Hence, a general road user should be able to detect potholes from a safe distance in order to avoid damage. Existing methods for pothole detection heavily rely on object detection algorithms which tend to have a high chance of failure owing to the similarity in structures and textures of a road and a pothole. Additionally, these systems utilize millions of parameters thereby making the model difficult to use in small-scale applications for the general citizen. By analyzing diverse image processing methods and various high-performing networks, the proposed model achieves remarkable performance in accurately detecting potholes. Evaluation metrics such as accuracy, EER, precision, recall, and AUROC validate the effectiveness of the system. Additionally, the proposed model demonstrates computational efficiency and cost-effectiveness by utilizing fewer parameters and data for training. The research highlights the importance of technology in the transportation sector and its potential to enhance road safety and convenience. The network proposed in this model performs with a 96.12 % accuracy, 3.89 % EER, and a 0.988 AUROC value, which is highly competitive with other state-of-the-art works.

摘要

DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis

paper_url: http://arxiv.org/abs/2308.03463
repo_url: https://github.com/alibaba/EasyNLP
paper_authors: Zhongjie Duan, Lizhou You, Chengyu Wang, Cen Chen, Ziheng Wu, Weining Qian, Jun Huang
for: 这个论文旨在将图像生成模型应用到视频生成中，以提高视频生成质量。
methods: 该论文提出了一种新的方法，称为DiffSynth，它包括两个关键组件：一个 latent in-iteration deflickering 框架和一个 video deflickering 算法。
results: 实验结果表明，DiffSynth 可以有效地避免视频中的闪烁问题，并且可以在不同的视频生成任务中表现出色，包括文本引导视频风格化、时尚视频生成、图像引导视频风格化、视频修复和3D 渲染。

Abstract
In recent years, diffusion models have emerged as the most powerful approach in image synthesis. However, applying these models directly to video synthesis presents challenges, as it often leads to noticeable flickering contents. Although recently proposed zero-shot methods can alleviate flicker to some extent, we still struggle to generate coherent videos. In this paper, we propose DiffSynth, a novel approach that aims to convert image synthesis pipelines to video synthesis pipelines. DiffSynth consists of two key components: a latent in-iteration deflickering framework and a video deflickering algorithm. The latent in-iteration deflickering framework applies video deflickering to the latent space of diffusion models, effectively preventing flicker accumulation in intermediate steps. Additionally, we propose a video deflickering algorithm, named patch blending algorithm, that remaps objects in different frames and blends them together to enhance video consistency. One of the notable advantages of DiffSynth is its general applicability to various video synthesis tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering. In the task of text-guided video stylization, we make it possible to synthesize high-quality videos without cherry-picking. The experimental results demonstrate the effectiveness of DiffSynth. All videos can be viewed on our project page. Source codes will also be released.

摘要
近年来，Diffusion模型在图像生成领域中得到了广泛应用，但是直接应用这些模型到视频生成中存在一些挑战，因为这会导致视频中的干扰内容变得显著。虽然最近提出的零模型可以在一定程度上缓解干扰，但我们仍然无法生成具有一致性的视频。在这篇论文中，我们提出了DiffSynth，一种新的方法，旨在将图像生成管道转换为视频生成管道。DiffSynth包括两个关键组件：一个抽象iteration抑制框架和一个视频抑制算法。抽象iteration抑制框架在 diffusion模型的latent空间中应用视频抑制，从而避免在中间步骤中积累干扰。此外，我们提出了一种名为贴合算法的视频抑制算法，它可以在不同的帧中重新映射对象，并将它们进行融合，以提高视频一致性。DiffSynth的一个重要优点是它可以应用于多种视频生成任务，包括文本引导视频 стилизация、时尚视频生成、图像引导视频 стилизация、视频恢复和3D渲染。在文本引导视频 стилизаation任务中，我们实现了无需筛选的高质量视频生成。实验结果表明DiffSynth的效果。所有视频都可以在我们的项目页面上查看，源代码也将被发布。

Cross-Silo Prototypical Calibration for Federated Learning with Non-IID Data

paper_url: http://arxiv.org/abs/2308.03457
repo_url: https://github.com/qizhuang-qz/FedCSPC
paper_authors: Zhuang Qi, Lei Meng, Zitan Chen, Han Hu, Hui Lin, Xiangxu Meng
for: This paper aims to improve the performance of federated learning by addressing the issue of dataset biases, such as heterogeneous data distributions and missing classes, through a cross-silo prototypical calibration method called FedCSPC.
methods: The FedCSPC method uses a Data Prototypical Modeling (DPM) module to learn data patterns via clustering, and a cross-silo prototypical calibration (CSPC) module to improve the robustness of the calibration. The CSPC module projects cross-source features into a consistent space while maintaining clear decision boundaries.
results: The paper shows that FedCSPC outperforms state-of-the-art methods in learning consistent features across different data sources of the same class, leading to better performance. The results are demonstrated through experiments on four datasets, including an ablation study, in-depth analysis, and case study.Here is the same information in Simplified Chinese:
for: 这篇论文目标是通过解决 dataset biases 问题，如不同数据源的数据分布和缺失类，提高 federated learning 的性能。
methods: FedCSPC 方法使用 Data Prototypical Modeling (DPM) 模块学习数据模式，并使用 cross-silo prototypical calibration (CSPC) 模块提高抽象的稳定性。CSPC 模块将 cross-source 特征投影到一致的空间中，保持明确的决策界限。
results: 论文表明，FedCSPC 方法在不同数据源的同一类数据上学习一致的特征，性能比 state-of-the-art 方法更好。结果通过四个数据集的实验、减少学习、深入分析和案例研究证明。

Abstract
Federated Learning aims to learn a global model on the server side that generalizes to all clients in a privacy-preserving manner, by leveraging the local models from different clients. Existing solutions focus on either regularizing the objective functions among clients or improving the aggregation mechanism for the improved model generalization capability. However, their performance is typically limited by the dataset biases, such as the heterogeneous data distributions and the missing classes. To address this issue, this paper presents a cross-silo prototypical calibration method (FedCSPC), which takes additional prototype information from the clients to learn a unified feature space on the server side. Specifically, FedCSPC first employs the Data Prototypical Modeling (DPM) module to learn data patterns via clustering to aid calibration. Subsequently, the cross-silo prototypical calibration (CSPC) module develops an augmented contrastive learning method to improve the robustness of the calibration, which can effectively project cross-source features into a consistent space while maintaining clear decision boundaries. Moreover, the CSPC module's ease of implementation and plug-and-play characteristics make it even more remarkable. Experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study, and the results verified that FedCSPC is capable of learning the consistent features across different data sources of the same class under the guidance of calibrated model, which leads to better performance than the state-of-the-art methods. The source codes have been released at https://github.com/qizhuang-qz/FedCSPC.

摘要
federated learning 目标是在服务器端学习一个通用模型，该模型可以在保持隐私的情况下，通过客户端上的本地模型，泛化到所有客户端。现有的解决方案通常是通过客户端对象函数的规范化或改进模型聚合机制来提高模型泛化能力。然而，它们的性能通常受到数据偏好的影响，如不同数据分布和缺失类。为解决这个问题，本文提出了跨积 silence prototype 准备方法（FedCSPC），该方法通过客户端上的额外原型信息来学习服务器端的通用特征空间。具体来说，FedCSPC首先使用数据prototype模型（DPM）模块学习数据模式，以帮助准备。然后，跨积 silence prototype准备（CSPC）模块开发了一种改进的增强对比学习方法，可以有效地将跨源特征投影到一致的空间中，保持清晰的决策边界。此外，CSPC模块的实现简单易用，使其更加remarkable。经过实验，results表明，FedCSPC可以在不同数据源之间的同类型数据上学习一致的特征，从而获得更好的性能，比现有方法更好。代码已经在https://github.com/qizhuang-qz/FedCSPC上发布。

Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising

paper_url: http://arxiv.org/abs/2308.03448
repo_url: https://github.com/srameo/led
paper_authors: Xin Jin, Jia-Wen Xiao, Ling-Hao Han, Chunle Guo, Ruixun Zhang, Xialei Liu, Chongyi Li
for: 提高陌生环境下 RAW 图像噪声 removal 的效果，不需要耗时consuming 的准备和训练。
methods: 提出了一种基于自适应学习的、不需要准备和训练的推 oscilloation 管道，可以适应目标摄像头，并在几个步骤中进行微调。
results: 与其他准备和训练方法相比，该方法在不同的数位增强和摄像头上实现了更高的噪声去除效果，只需要几个匹配的数据对和0.5%的迭代。

Abstract
Calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods suffer from several main deficiencies: 1) the calibration procedure is laborious and time-consuming, 2) denoisers for different cameras are difficult to transfer, and 3) the discrepancy between synthetic noise and real noise is enlarged by high digital gain. To overcome the above shortcomings, we propose a calibration-free pipeline for Lighting Every Drakness (LED), regardless of the digital gain or camera sensor. Instead of calibrating the noise parameters and training repeatedly, our method could adapt to a target camera only with few-shot paired data and fine-tuning. In addition, well-designed structural modification during both stages alleviates the domain gap between synthetic and real noise without any extra computational cost. With 2 pairs for each additional digital gain (in total 6 pairs) and 0.5% iterations, our method achieves superior performance over other calibration-based methods. Our code is available at https://github.com/Srameo/LED .

摘要
准确基于方法在极低照度环境下进行 RAW 图像干涉除，但这些方法受到多种主要缺点的影响：1）准备过程耗时和费时consuming，2）对不同摄像头的denoiser难以传输，3）高度数字增强导致假象差异变大。为了解决以上缺陷，我们提出了不需要准备的管道，可以在不同的摄像头上适应LED，不 matter how much digital gain or camera sensor。相比之下，我们的方法只需要几个对应的数据和微调就能够适应目标摄像头。此外，我们在两个阶段中设计了结构修改，以避免假象差异的问题，无需额外的计算成本。使用2对每个额外数字增强（共计6对）和0.5%迭代，我们的方法可以在其他准确基于方法上达到更高的性能。我们的代码可以在https://github.com/Srameo/LED 中找到。

GaFET: Learning Geometry-aware Facial Expression Translation from In-The-Wild Images

paper_url: http://arxiv.org/abs/2308.03413
repo_url: None
paper_authors: Tianxiang Ma, Bingchuan Li, Qian He, Jing Dong, Tieniu Tan
for: 这篇论文旨在提出一种基于 Parametric 3D 面部表示的 Geometry-aware Facial Expression Translation (GaFET) 框架，以稳定地分离表达。
methods: 该框架包括一种多级特征对齐变换器，用于补充非几何面部细节特征，以及一种基于 StyleGAN 的 De-expression 模型，用于降低 GaFET 在无对应图像数据的学习难度。
results: 广泛的 qualitative 和 quantitative 实验表明，我们的方法可以在无需视频或标注数据的情况下实现更高质量和更准确的面部表达传递结果，并能够处理多个 pose 和复杂的 texture。

Abstract
While current face animation methods can manipulate expressions individually, they suffer from several limitations. The expressions manipulated by some motion-based facial reenactment models are crude. Other ideas modeled with facial action units cannot generalize to arbitrary expressions not covered by annotations. In this paper, we introduce a novel Geometry-aware Facial Expression Translation (GaFET) framework, which is based on parametric 3D facial representations and can stably decoupled expression. Among them, a Multi-level Feature Aligned Transformer is proposed to complement non-geometric facial detail features while addressing the alignment challenge of spatial features. Further, we design a De-expression model based on StyleGAN, in order to reduce the learning difficulty of GaFET in unpaired "in-the-wild" images. Extensive qualitative and quantitative experiments demonstrate that we achieve higher-quality and more accurate facial expression transfer results compared to state-of-the-art methods, and demonstrate applicability of various poses and complex textures. Besides, videos or annotated training data are omitted, making our method easier to use and generalize.

摘要
当前的面部动画方法可以分别 manipulate 表情，但它们受到一些限制。一些基于动作的面部reenactment模型中的表情被描述为粗糙。其他基于表情动作单元的想法无法泛化到未经标注的表情。在这篇论文中，我们引入了一种新的 Geometry-aware Facial Expression Translation (GaFET) 框架，它基于参数化的 3D 面部表示和可以稳定地做出表情分离。其中，一种 Multi-level Feature Aligned Transformer 被提议，以填充非 геометрические面部细节特征，同时解决空间特征的对齐问题。另外，我们设计了基于 StyleGAN 的 De-expression 模型，以降低 GaFET 在无标注 "在野" 图像上学习的困难性。广泛的质量和量测试表明，我们可以在比例表情传输中获得更高质量和更准确的结果，并在不同的姿势和复杂的文化上进行应用。此外，我们不需要视频或标注训练数据，使我们的方法更容易使用和泛化。

A Horse with no Labels: Self-Supervised Horse Pose Estimation from Unlabelled Images and Synthetic Prior

paper_url: http://arxiv.org/abs/2308.03411
repo_url: None
paper_authors: Jose Sosa, David Hogg
for: 用于 estimating animal pose 的深度学习方法的训练
methods: 使用自我超vised学习方法，只需要无标注的图像和少量的Synthetic 2D pose
results: 可以准确地学习动物姿态，只需要一小部分的Synthetic 2D pose和无标注图像

Abstract
Obtaining labelled data to train deep learning methods for estimating animal pose is challenging. Recently, synthetic data has been widely used for pose estimation tasks, but most methods still rely on supervised learning paradigms utilising synthetic images and labels. Can training be fully unsupervised? Is a tiny synthetic dataset sufficient? What are the minimum assumptions that we could make for estimating animal pose? Our proposal addresses these questions through a simple yet effective self-supervised method that only assumes the availability of unlabelled images and a small set of synthetic 2D poses. We completely remove the need for any 3D or 2D pose annotations (or complex 3D animal models), and surprisingly our approach can still learn accurate 3D and 2D poses simultaneously. We train our method with unlabelled images of horses mainly collected for YouTube videos and a prior consisting of 2D synthetic poses. The latter is three times smaller than the number of images needed for training. We test our method on a challenging set of horse images and evaluate the predicted 3D and 2D poses. We demonstrate that it is possible to learn accurate animal poses even with as few assumptions as unlabelled images and a small set of 2D poses generated from synthetic data. Given the minimum requirements and the abundance of unlabelled data, our method could be easily deployed to different animals.

摘要
获取标注数据来训练深度学习方法用于动物姿态估计是具有挑战性的。现在，人工生成数据广泛使用于姿态估计任务中，但大多数方法仍然采用指导学习 парадигмы，使用人工图像和标签。可以完全无监督培训吗？一个小型的人工数据集是足够吗？我们的提议通过一种简单 yet effective的自我监督方法来回答这些问题。我们只需要没有标注的图像和一小组 synthetic 2D 姿态作为假设。我们完全 removing the need for any 3D or 2D pose annotations (或复杂的 3D 动物模型)，并且我们的方法可以在缺乏标注的情况下学习准确的 3D 和 2D 姿态。我们使用 YouTube 上收集的大量无标注图像和一小组 synthetic 2D 姿态来训练我们的方法。后者的数量只是图像的三倍。我们对一组具有挑战性的马图像进行测试，并评估预测的 3D 和 2D 姿态。我们示出了可以通过使用只有无标注图像和少量 synthetic 2D 姿态来学习准确的动物姿态。由于最小的假设和丰富的无标注数据，我们的方法可以轻松应用于不同的动物。

DiT: Efficient Vision Transformers with Dynamic Token Routing

paper_url: http://arxiv.org/abs/2308.03409
repo_url: https://github.com/maycbj/dit
paper_authors: Yuchen Ma, Zhengcong Fei, Junshi Huang
for: ImageNet classification, object detection, instance segmentation, and semantic segmentation
methods: Data-dependent token routing strategy for Dynamic Vision Transformer (DiT) with differentiable routing gates for multi-path feature propagation, and budget constraints for routing gate and early-stopping of feature extraction.
results: Superior performance and favorable complexity/accuracy trade-offs compared to many State-of-the-Art (SoTA) methods on various vision tasks, with the DiT-B5 achieving 84.8% top-1 Acc on ImageNet with 10.3 GFLOPs, which is 1.0% higher than the SoTA method with similar computational complexity.

Abstract
Recently, the tokens of images share the same static data flow in many dense networks. However, challenges arise from the variance among the objects in images, such as large variations in the spatial scale and difficulties of recognition for visual entities. In this paper, we propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT. The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens. In feed-forward, the differentiable routing gates are designed to select the scaling paths and feature transformation paths for image tokens, leading to multi-path feature propagation. In this way, the impact of object scales and visual discrimination of image representation can be carefully tuned. Moreover, the computational cost can be further reduced by giving budget constraints to the routing gate and early-stopping of feature extraction. In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation. Particularly, the DiT-B5 obtains 84.8\% top-1 Acc on ImageNet with 10.3 GFLOPs, which is 1.0\% higher than that of the SoTA method with similar computational complexity. These extensive results demonstrate that DiT can serve as versatile backbones for various vision tasks.

摘要
近期，图像token在多密网络中共享同样的静态数据流。然而，图像中对象的变化带来了挑战，包括巨大的空间缩放和视觉特征的识别困难。在这篇论文中，我们提出了基于数据依赖的图像token路由策略，用于强化图像Token的路由方式。我们的框架生成了基于数据的路由路径，以适应图像中对象的尺度和视觉特征。在Feed-Forward中，我们设计了可微分的路由门，以选择缩放路径和特征转换路径，从而实现多路径特征传播。这样，我们可以细化对象的尺度和视觉特征的影响。此外，我们还可以通过对路由门进行予算限制和早期停止特征提取来降低计算成本。在实验中，我们的DiT在ImageNet分类、物体检测、实例 segmentation和semantic segmentation等多种视觉任务上显示出了优秀的性能和计算复杂度/准确率的平衡。尤其是DiT-B5在ImageNet上取得了84.8%的权重排名第一位，与同等计算复杂度的SoTA方法相比，提高了1.0%的性能。这些广泛的结果表明，DiT可以作为多种视觉任务的 versatile 背部。

Spatially Varying Nanophotonic Neural Networks

paper_url: http://arxiv.org/abs/2308.03407
repo_url: None
paper_authors: Kaixuan Wei, Xiao Li, Johannes Froech, Praneeth Chakravarthula, James Whitehead, Ethan Tseng, Arka Majumdar, Felix Heide
for: This paper aims to improve the performance of optical neural networks for image recognition tasks, with the goal of bringing optical neural networks into the modern deep learning era.
methods: The paper introduces a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques, and experiments with a flat meta-optical system that includes an array of nanophotonic structures to induce angle-dependent responses.
results: The paper achieves a blind test classification accuracy of 73.80% on the CIFAR-10 dataset with a nanophotonic neural network, outperforming the first modern digital neural network (AlexNet) with 57M parameters and bringing optical neural networks into the modern deep learning era.

Abstract
The explosive growth of computation and energy cost of artificial intelligence has spurred strong interests in new computing modalities as potential alternatives to conventional electronic processors. Photonic processors that execute operations using photons instead of electrons, have promised to enable optical neural networks with ultra-low latency and power consumption. However, existing optical neural networks, limited by the underlying network designs, have achieved image recognition accuracy much lower than state-of-the-art electronic neural networks. In this work, we close this gap by introducing a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques. We experimentally instantiate the network with a flat meta-optical system that encompasses an array of nanophotonic structures designed to induce angle-dependent responses. Combined with an extremely lightweight electronic backend with approximately 2K parameters we demonstrate a nanophotonic neural network reaches 73.80\% blind test classification accuracy on CIFAR-10 dataset, and, as such, the first time, an optical neural network outperforms the first modern digital neural network -- AlexNet (72.64\%) with 57M parameters, bringing optical neural network into modern deep learning era.

摘要
“计算和人工智能的能源成本的快速增长已经促使了新的计算模式的兴趣，以代替传统的电子处理器。光学处理器可以通过光子而非电子来执行操作，承诺了可以实现光学神经网络的超低延迟和能耗。然而，现有的光学神经网络，受到基础网络设计的限制，只能达到图像识别精度远低于电子神经网络的状态OF-the-art。在这种工作中，我们封闭了这个差距，通过大核心空间变化的干扰 convolutional neural network 的学习，并通过低维度重parameterization技术来实现。我们实际实现了这种网络，使用一个平面 meta-光学系统，包括一个数组 nanophotonic 结构，以induce 角度相关的响应。与此同时，我们还使用一个非常轻量级的电子后续，包含约 2K 参数，并证明了一个 nanophotonic 神经网络可以在 CIFAR-10 数据集上达到 73.80% 的盲测精度，超过了 AlexNet （72.64%）的精度，这是首次，光学神经网络超越了第一代现代数字神经网络， bringing optical neural network into modern deep learning era。”

Enhancing Nucleus Segmentation with HARU-Net: A Hybrid Attention Based Residual U-Blocks Network

paper_url: http://arxiv.org/abs/2308.03382
repo_url: None
paper_authors: Junzhou Chen, Qian Huang, Yulin Chen, Linyi Qian, Chengyuan Yu
for: 本研究主要用于提高核体实例分割的精度和效果，解决现有方法受到质量问题和较复杂的细胞聚集等问题的限制。
methods: 我们提出了一种基于双支网络和混合注意力径 residual U-块的核体实例分割方法，同时预测目标信息和目标 kontour。我们还提出了一种后处理方法，通过组合目标信息和目标 kontour来分辨重叠的核体并生成实例分割图像。网络中还包括一个Context Fusion块（CF-块），可以有效地提取和融合网络中的Contextual信息。
results: 我们对多个数据集进行了广泛的量化评估，并证明了我们的方法在BNS、MoNuSeg、CoNSeg和CPM-17等数据集上的性能superiority compared to state-of-the-art methods。

Abstract
Nucleus image segmentation is a crucial step in the analysis, pathological diagnosis, and classification, which heavily relies on the quality of nucleus segmentation. However, the complexity of issues such as variations in nucleus size, blurred nucleus contours, uneven staining, cell clustering, and overlapping cells poses significant challenges. Current methods for nucleus segmentation primarily rely on nuclear morphology or contour-based approaches. Nuclear morphology-based methods exhibit limited generalization ability and struggle to effectively predict irregular-shaped nuclei, while contour-based extraction methods face challenges in accurately segmenting overlapping nuclei. To address the aforementioned issues, we propose a dual-branch network using hybrid attention based residual U-blocks for nucleus instance segmentation. The network simultaneously predicts target information and target contours. Additionally, we introduce a post-processing method that combines the target information and target contours to distinguish overlapping nuclei and generate an instance segmentation image. Within the network, we propose a context fusion block (CF-block) that effectively extracts and merges contextual information from the network. Extensive quantitative evaluations are conducted to assess the performance of our method. Experimental results demonstrate the superior performance of the proposed method compared to state-of-the-art approaches on the BNS, MoNuSeg, CoNSeg, and CPM-17 datasets.

摘要
核心像素分割是生物学分析、诊断和分类中一个关键步骤，但是这个步骤受到核心像素质量的限制。然而，核心像素的变化、模糊、不均匀染料、细胞堆叠和重叠细胞等问题带来了挑战。现有的核心像素分割方法主要基于核心形态或边缘检测方法。核心形态基本方法具有局限性，难以预测不规则形状的核心，而边缘检测方法在重叠细胞上受到检测的挑战。为了解决以上问题，我们提议一种基于双分支网络的核心实例分割方法。该方法同时预测目标信息和目标边界。此外，我们还提出了一种兼容处理方法，通过将目标信息和目标边界结合起来，以解决重叠细胞的问题。在网络中，我们提出了一个上下文融合块（CF-块），可以有效地抽取和融合网络中的上下文信息。我们对方法的性能进行了广泛的量化评估。实验结果表明，我们提出的方法在BNS、MoNuSeg、CoNSeg和CPM-17等数据集上的性能明显超过了现有方法。

Bilevel Generative Learning for Low-Light Vision

paper_url: http://arxiv.org/abs/2308.03381
repo_url: https://github.com/yingchi1998/bgl
paper_authors: Yingchi Liu, Zhu Liu, Long Ma, Jinyuan Liu, Xin Fan, Zhongxuan Luo, Risheng Liu
For: The paper is written for constructing deep learning schemes for Low-Light Vision (LLV) tasks.* Methods: The paper proposes a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain, and establishes a bilevel model to precisely characterize the latent correspondence between the generative procedure and the vision task.* Results: The paper demonstrates the superiority of the proposed approach on three representative low-light vision tasks, namely enhancement, detection, and segmentation, and shows that the generative blocks have a strong generalization ability in other low-light vision tasks.Here is the information in Simplified Chinese text:* For: 这篇论文是为了构建深度学习方案来解决低光环境视觉任务。* Methods: 论文提出了一种通用的低光环境视觉解决方案，利用生成块将数据从RAW转换到RGB频谱上，并建立了一个碎谱模型来准确地描述数据生成过程和视觉任务之间的隐藏关系。* Results: 论文在三个表示低光环境视觉任务的示例任务上，即提升、检测和分割任务上，展现了提案的方法的超越性，并证明了生成块在其他低光环境视觉任务中具有强大的普适性。

Abstract
Recently, there has been a growing interest in constructing deep learning schemes for Low-Light Vision (LLV). Existing techniques primarily focus on designing task-specific and data-dependent vision models on the standard RGB domain, which inherently contain latent data associations. In this study, we propose a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain. This novel approach connects diverse vision problems by explicitly depicting data generation, which is the first in the field. To precisely characterize the latent correspondence between the generative procedure and the vision task, we establish a bilevel model with the parameters of the generative block defined as the upper level and the parameters of the vision task defined as the lower level. We further develop two types of learning strategies targeting different goals, namely low cost and high accuracy, to acquire a new bilevel generative learning paradigm. The generative blocks embrace a strong generalization ability in other low-light vision tasks through the bilevel optimization on enhancement tasks. Extensive experimental evaluations on three representative low-light vision tasks, namely enhancement, detection, and segmentation, fully demonstrate the superiority of our proposed approach. The code will be available at https://github.com/Yingchi1998/BGL.

摘要
近些年来，低光环境视觉（LLV）领域内有一个增长的兴趣，现有技术主要集中在设计任务特定和数据依赖的视觉模型上标准RGB频谱上，这些模型内置了隐藏的数据关系。在这种研究中，我们提出了一种通用的低光环境解决方案，通过引入生成块将数据从RAW频谱转换到RGB频谱。这种新的approach连接了多种视觉问题，并且显式地描述了数据生成过程，这是领域内首次。为准确地描述生成过程和视觉任务之间的隐藏关系，我们建立了一个二级模型，其中生成块的参数定义为上层级，而视觉任务的参数定义为下层级。我们还开发了两种不同目标，即低成本和高精度的学习策略，以获得一种新的二级生成学习 парадиг。生成块具有强大的通用能力在其他低光环境任务上，经过二级优化的增强任务上。我们在三个代表性的低光环境任务上，即增强、检测和 segmentation 上进行了广泛的实验评估，并证明了我们提出的方法的超越性。代码将在https://github.com/Yingchi1998/BGL中提供。

VR-based body tracking to stimulate musculoskeletal training

paper_url: http://arxiv.org/abs/2308.03375
repo_url: None
paper_authors: M. Neidhardt, S. Gerlach F. N. Schmidt, I. A. K. Fiedler, S. Grube, B. Busse, A. Schlaefer
For: 这个研究旨在开发一个基于HoloLens 2的虚拟下山滑雪训练应用程序，以便为老年人和残疾人提供个性化的训练和自动化评估。* Methods: 这个研究使用HoloLens 2的运动数据来控制和预测身体运动和关节角度 during musculoskeletal training。研究者记录了10名健康志愿者的外部跟踪相机数据，并系统地分析了整个身体运动是否可以从HoloLens 2运动数据中 derivation。* Results: 研究结果显示，HoloLens 2 运动数据和外部跟踪数据之间存在高度相关性，特别是在上半身运动和下肢关节角度方面。无参与者报告了运动疲劳效应，所有参与者都能快速互动和控制他们的运动。

Abstract
Training helps to maintain and improve sufficient muscle function, body control, and body coordination. These are important to reduce the risk of fracture incidents caused by falls, especially for the elderly or people recovering from injury. Virtual reality training can offer a cost-effective and individualized training experience. We present an application for the HoloLens 2 to enable musculoskeletal training for elderly and impaired persons to allow for autonomous training and automatic progress evaluation. We designed a virtual downhill skiing scenario that is controlled by body movement to stimulate balance and body control. By adapting the parameters of the ski slope, we can tailor the intensity of the training to individual users. In this work, we evaluate whether the movement data of the HoloLens 2 alone is sufficient to control and predict body movement and joint angles during musculoskeletal training. We record the movements of 10 healthy volunteers with external tracking cameras and track a set of body and joint angles of the participant during training. We estimate correlation coefficients and systematically analyze whether whole body movement can be derived from the movement data of the HoloLens 2. No participant reports movement sickness effects and all were able to quickly interact and control their movement during skiing. Our results show a high correlation between HoloLens 2 movement data and the external tracking of the upper body movement and joint angles of the lower limbs.

摘要
训练可以保持和改善足够的肌肉功能、身体控制和身体协调。这些因素对降低因为落下而导致骨折的风险非常重要，特别是老年人或恢复后的人。虚拟现实训练可以提供成本效益和个性化的训练经验。我们在HoloLens 2上提出了一个应用程序，用于帮助老年人和残疾人进行肌骨征识训练，以便在自主训练和自动进度评估之间进行折衔。我们设计了一个虚拟下山滑雪场景，通过身体运动控制来刺激平衡和身体协调。通过调整雪坡参数，我们可以根据用户的个性进行定制训练的Intensity。在这项工作中，我们评估了HoloLens 2运动数据是否充分控制和预测身体运动和关节角度 durante 肌骨征识训练。我们通过外部跟踪相机记录参与者的运动，并跟踪参与者的身体运动和关节角度。我们计算了相关系数，系统地分析了整体运动是否可以从HoloLens 2运动数据中提取出来。所有参与者都没有报告运动药效，并且所有参与者快速交互和控制他们的运动 durante 滑雪。我们的结果显示，HoloLens 2运动数据与外部跟踪的上半身运动和关节角度之间存在高相关性。

Heterogeneous Forgetting Compensation for Class-Incremental Learning

paper_url: http://arxiv.org/abs/2308.03374
repo_url: https://github.com/jiahuadong/hfc
paper_authors: Jiahua Dong, Wenqi Liang, Yang Cong, Gan Sun
for: 这篇论文的目的是解决累累忘记挑战，并且处理忘记不均的问题。
methods: 这篇论文提出了一个名为“多元忘记补偿”（Heterogeneous Forgetting Compensation，HFC）的新模型，它可以解决忘记不均的问题，并且从表征和对应方面进行补偿。
results: 实验结果显示，HFC模型能够有效地解决累累忘记挑战，并且在不同的数据集上获得了良好的性能。

Abstract
Class-incremental learning (CIL) has achieved remarkable successes in learning new classes consecutively while overcoming catastrophic forgetting on old categories. However, most existing CIL methods unreasonably assume that all old categories have the same forgetting pace, and neglect negative influence of forgetting heterogeneity among different old classes on forgetting compensation. To surmount the above challenges, we develop a novel Heterogeneous Forgetting Compensation (HFC) model, which can resolve heterogeneous forgetting of easy-to-forget and hard-to-forget old categories from both representation and gradient aspects. Specifically, we design a task-semantic aggregation block to alleviate heterogeneous forgetting from representation aspect. It aggregates local category information within each task to learn task-shared global representations. Moreover, we develop two novel plug-and-play losses: a gradient-balanced forgetting compensation loss and a gradient-balanced relation distillation loss to alleviate forgetting from gradient aspect. They consider gradient-balanced compensation to rectify forgetting heterogeneity of old categories and heterogeneous relation consistency. Experiments on several representative datasets illustrate effectiveness of our HFC model. The code is available at https://github.com/JiahuaDong/HFC.

摘要
CLASS-INCREMENTAL LEARNING (CIL) 已经取得了不可忽略的成功，可以顺序学习新的类型，同时解决旧类型的恐怖忘记。然而，大多数现有的 CIL 方法不合理地假设所有的旧类型忘记速率相同，并忽略了旧类型忘记不同程度的负面影响。为超越这些挑战，我们开发了一种新的多类忘记补偿模型（HFC），可以解决旧类型的多类忘记问题。 Specifically, we design a task-semantic aggregation block to alleviate heterogeneous forgetting from representation aspect. It aggregates local category information within each task to learn task-shared global representations. Moreover, we develop two novel plug-and-play losses: a gradient-balanced forgetting compensation loss and a gradient-balanced relation distillation loss to alleviate forgetting from gradient aspect. They consider gradient-balanced compensation to rectify forgetting heterogeneity of old categories and heterogeneous relation consistency.实验结果表明，我们的 HFC 模型具有效果。代码可以在上下载。

Dual Aggregation Transformer for Image Super-Resolution

paper_url: http://arxiv.org/abs/2308.03364
repo_url: https://github.com/zhengchen1999/dat
paper_authors: Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, Fisher Yu
for: 这个论文主要针对图像超解像（SR）问题，旨在提出一种基于Transformer网络的新型图像SR模型，以提高图像 Representation 能力。
methods: 该模型叫做 dual aggregation transformer（DAT），它在不同维度上进行自我集成，包括间隔块和内部块两种方式。具体来说，我们在连续的Transformer块中 alternate 应用空间和通道维度上的自我集成。此外，我们还提出了适应交互模块（AIM）和空间网络（SGFN）来实现内部块特征聚合。
results: 我们的DAT模型在多个实验中表现出色，超过了当前的方法。代码和模型可以在https://github.com/zhengchen1999/DAT 上下载。

Abstract
Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.

摘要
“传统抽象 transformer 在低级视觉任务中，如像高清化（SR）中得到了很大的推广。这些网络使用自我对齐在不同的维度，包括空间和通道维度，并实现了非常出色的表现。这给我们启发了融合这两种维度的想法，我们提出了一个新的 transformer 模型，即双总化 transformer（DAT），用于图像 SR。我们的 DAT 在内部和外部两种方式进行特征聚合，即在不同的维度进行双重总化。具体来说，我们在连续的 transformer 层中交替应用空间和通道自我对齐。这种交替策略使得 DAT 能够捕捉全域上下文，并实现内部对齐的特征聚合。此外，我们还提出了适应互动模组（AIM）和空间闸道对应网络（SGFN），以实现内部对齐的特征聚合。AIM 对应了两个自我对齐机制，而 SGFN 则引入了额外的非线性空间信息。实验结果显示，我们的 DAT 超过了目前的方法。代码和模型可以在 GitHub 上获取：https://github.com/zhengchen1999/DAT。”

Distortion-aware Transformer in 360° Salient Object Detection

paper_url: http://arxiv.org/abs/2308.03359
repo_url: None
paper_authors: Yinjie Zhao, Lichen Zhao, Qian Yu, Jing Zhang, Lu Sheng, Dong Xu
for: addressing the distortion problem in 360{\deg} data projection for feature extraction and task development
methods: using a Transformer-based model called DATFormer with two distortion-adaptive modules and a learnable relation matrix for positional embedding
results: outperforming existing 2D SOD and 360 SOD methods on three public datasets

Abstract
With the emergence of VR and AR, 360{\deg} data attracts increasing attention from the computer vision and multimedia communities. Typically, 360{\deg} data is projected into 2D ERP (equirectangular projection) images for feature extraction. However, existing methods cannot handle the distortions that result from the projection, hindering the development of 360-data-based tasks. Therefore, in this paper, we propose a Transformer-based model called DATFormer to address the distortion problem. We tackle this issue from two perspectives. Firstly, we introduce two distortion-adaptive modules. The first is a Distortion Mapping Module, which guides the model to pre-adapt to distorted features globally. The second module is a Distortion-Adaptive Attention Block that reduces local distortions on multi-scale features. Secondly, to exploit the unique characteristics of 360{\deg} data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance. Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods.

摘要
Firstly, we introduce two distortion-adaptive modules:1. Distortion Mapping Module: This module guides the model to pre-adapt to distorted features globally.2. Distortion-Adaptive Attention Block: This module reduces local distortions on multi-scale features.Secondly, to exploit the unique characteristics of 360° data, we present a learnable relation matrix and use it as part of the positional embedding to further improve performance.Extensive experiments are conducted on three public datasets, and the results show that our model outperforms existing 2D SOD (salient object detection) and 360 SOD methods.

Energy-Guided Diffusion Model for CBCT-to-CT Synthesis

paper_url: http://arxiv.org/abs/2308.03354
repo_url: None
paper_authors: Linjie Fu, Xia Li, Xiuding Cai, Dong Miao, Yu Yao, Yali Shen
for: 提高CBCT图像质量和Hounsfield单位准确性，以便更好地计算辐射剂量和精确地定位组织结构。
methods: 基于能量导向分散模型（EGDiff），从CBCT图像生成Synthetic CT（sCT）。
results: 对胸肿囊数据集进行实验，EGDiff方法可以生成高精度、高视觉质量的sCT图像，与State-of-the-art无监督合成方法相比，EGDiff方法表现出色。

Abstract
Cone Beam CT (CBCT) plays a crucial role in Adaptive Radiation Therapy (ART) by accurately providing radiation treatment when organ anatomy changes occur. However, CBCT images suffer from scatter noise and artifacts, making relying solely on CBCT for precise dose calculation and accurate tissue localization challenging. Therefore, there is a need to improve CBCT image quality and Hounsfield Unit (HU) accuracy while preserving anatomical structures. To enhance the role and application value of CBCT in ART, we propose an energy-guided diffusion model (EGDiff) and conduct experiments on a chest tumor dataset to generate synthetic CT (sCT) from CBCT. The experimental results demonstrate impressive performance with an average absolute error of 26.87$\pm$6.14 HU, a structural similarity index measurement of 0.850$\pm$0.03, a peak signal-to-noise ratio of the sCT of 19.83$\pm$1.39 dB, and a normalized cross-correlation of the sCT of 0.874$\pm$0.04. These results indicate that our method outperforms state-of-the-art unsupervised synthesis methods in accuracy and visual quality, producing superior sCT images.

摘要
cone beam CT (CBCT) 在 adaptive radiation therapy (ART) 中发挥重要作用，准确地提供辐射治疗当器官结构变化时。然而，CBCT图像受到散射噪和artefacts的影响，使凭借CBCT alone 精度计算和正确地本地化难以准确。因此，我们需要提高 CBCT 图像质量和Hounsfield单元（HU）准确性，保持器官结构。为了提高 CBCT 在 ART 中的应用价值，我们提议一种能量引导扩散模型（EGDiff），并在胸腔肿瘤数据集上进行实验，将 CBCT 转换成 synthetic CT（sCT）。实验结果表明，我们的方法可以达到 impressive 性能，其中平均绝对错误为26.87±6.14 HU，结构相似度指数为0.850±0.03，峰信号噪声比（PSNR）为19.83±1.39 dB，同步协方差为0.874±0.04。这些结果表明，我们的方法在准确性和视觉质量方面都有所提高，生成出Superior sCT 图像。

Explicifying Neural Implicit Fields for Efficient Dynamic Human Avatar Modeling via a Neural Explicit Surface

paper_url: http://arxiv.org/abs/2308.05112
repo_url: None
paper_authors: Ruiqi Zhang, Jie Chen, Qiang Wang
for: 这 paper 旨在提出一种方法，用于高效地模型动态人体。
methods: 该 paper 使用 Neural Explicit Surface (NES) 技术来Explicify implicit neural fields，提高了计算和存储效率。
results: 实验表明，NES 能够与前一代3D方法相比，具有类似的性能，同时提高渲染速度和减少存储开销。

Abstract
This paper proposes a technique for efficiently modeling dynamic humans by explicifying the implicit neural fields via a Neural Explicit Surface (NES). Implicit neural fields have advantages over traditional explicit representations in modeling dynamic 3D content from sparse observations and effectively representing complex geometries and appearances. Implicit neural fields defined in 3D space, however, are expensive to render due to the need for dense sampling during volumetric rendering. Moreover, their memory efficiency can be further optimized when modeling sparse 3D space. To overcome these issues, the paper proposes utilizing Neural Explicit Surface (NES) to explicitly represent implicit neural fields, facilitating memory and computational efficiency. To achieve this, the paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, leveraging the strengths of both implicit and explicit approaches. This conversion enables effective training of the hybrid representation using implicit methods and efficient rendering by integrating the explicit rendering interface with a newly proposed rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency. NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures both in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.

摘要
The paper creates a fully differentiable conversion between the implicit neural fields and the explicit rendering interface of NES, allowing for effective training of the hybrid representation using implicit methods and efficient rendering. The conversion enables the use of a rasterization-based neural renderer that only incurs a texture color query once for the initial ray interaction with the explicit surface, resulting in improved inference efficiency.NES describes dynamic human geometries with pose-dependent neural implicit surface deformation fields and their dynamic neural textures in 2D space, which is a more memory-efficient alternative to traditional 3D methods, reducing redundancy and computational load. The comprehensive experiments show that NES performs similarly to previous 3D approaches, with greatly improved rendering speed and reduced memory cost.

Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation

paper_url: http://arxiv.org/abs/2308.03348
repo_url: None
paper_authors: Xingxing Yang, Jie Chen, Zaifeng Yang
for: 这篇论文主要targets near-infrared (NIR) image spectrum translation, a challenging problem with many promising applications.methods: 该方法基于一种合作学习模式，通过exploring latent cross-domain priors（i.e., latent spectrum context priors and task domain priors），colorizes NIR images in parallel with another proxy grayscale colorization task.results: 该方法可以生成高质量的spectrum translation输出，并且比 estado-of-the-art counterparts提高3.95dB和4.66dB的PNSR для NIR和grayscale colorization tasks。

Abstract
Near-infrared (NIR) image spectrum translation is a challenging problem with many promising applications. Existing methods struggle with the mapping ambiguity between the NIR and the RGB domains, and generalize poorly due to the limitations of models' learning capabilities and the unavailability of sufficient NIR-RGB image pairs for training. To address these challenges, we propose a cooperative learning paradigm that colorizes NIR images in parallel with another proxy grayscale colorization task by exploring latent cross-domain priors (i.e., latent spectrum context priors and task domain priors), dubbed CoColor. The complementary statistical and semantic spectrum information from these two task domains -- in the forms of pre-trained colorization networks -- are brought in as task domain priors. A bilateral domain translation module is subsequently designed, in which intermittent NIR images are generated from grayscale and colorized in parallel with authentic NIR images; and vice versa for the grayscale images. These intermittent transformations act as latent spectrum context priors for efficient domain knowledge exchange. We progressively fine-tune and fuse these modules with a series of pixel-level and feature-level consistency constraints. Experiments show that our proposed cooperative learning framework produces satisfactory spectrum translation outputs with diverse colors and rich textures, and outperforms state-of-the-art counterparts by 3.95dB and 4.66dB in terms of PNSR for the NIR and grayscale colorization tasks, respectively.

摘要
near-infrared（NIR）图像 спектр翻译是一个具有挑战性的问题，有很多有前途的应用。现有方法在映射NIR和RGBDomains之间存在困难，并且因模型学习能力的限制和缺乏充足的NIR-RGB图像对 для训练而导致泛化不佳。为了解决这些挑战，我们提议一种合作学习 парадиг，通过利用潜在的跨频域约束（即潜在pectrumContext约束和任务频域约束），来同时进行NIR图像的colorization。这两个任务频域的统计和semantic spectrum信息都被引入作为任务频域约束。随后，我们设计了一种bilateral频域翻译模块，其中NIR图像中的黑白图像在干扰NIR图像的同时，也在平行进行了颜色化。这些干扰变换作为潜在pectrumContext约束，以便有效地进行频域知识交换。我们逐步细化和融合这些模块，并使用像素级和特征级一致性约束。实验结果表明，我们提议的合作学习框架可以生成高质量的spectrum翻译输出，具有多样性和丰富的Texture，并且比前方的counterpart高3.95dB和4.66dB在NIR和黑白图像色化任务中的PNSR指标上。

A Hybrid CNN-Transformer Architecture with Frequency Domain Contrastive Learning for Image Deraining

paper_url: http://arxiv.org/abs/2308.03340
repo_url: None
paper_authors: Cheng Wang, Wei Li
for: restore degraded images affected by rain streaks
methods: image deraining
results: not specifiedPlease note that the results are not specified in the abstract, so I cannot provide any information about the results of the paper.

Abstract
Image deraining is a challenging task that involves restoring degraded images affected by rain streaks.

摘要
图像抑雨是一项具有挑战性的任务，涉及到修复受到雨斑影响的图像。

AFN: Adaptive Fusion Normalization via Encoder-Decoder Framework

paper_url: http://arxiv.org/abs/2308.03321
repo_url: https://github.com/huanranchen/ASRNorm
paper_authors: Zikai Zhou, Huanran Chen
for: 这篇论文主要为了提出一种统一的normalization函数，以解决现有normalization方法的缺点。
methods: 该论文提出了一种新的normalization函数 named Adaptive Fusion Normalization（AFN），它可以结合所有normalization方法，并消除它们的缺点。
results: 经过实验，AFN函数在领域总结和图像分类任务中表现出色，超过了现有的normalization方法。

Abstract
The success of deep learning is inseparable from normalization layers. Researchers have proposed various normalization functions, and each of them has both advantages and disadvantages. In response, efforts have been made to design a unified normalization function that combines all normalization procedures and mitigates their weaknesses. We also proposed a new normalization function called Adaptive Fusion Normalization. Through experiments, we demonstrate AFN outperforms the previous normalization techniques in domain generalization and image classification tasks.

摘要
深度学习的成功与normalization层相关，研究人员提出了多种normalization函数，每种都有优点和缺点。为了解决这些问题，努力设计一个统一的normalization函数，汇集所有normalization过程，并减少它们的缺点。我们还提出了一种新的normalization函数called Adaptive Fusion Normalization（AFN）。经过实验，我们证明AFN在领域普适化和图像分类任务中表现出色，超越了前一代的normalization技术。

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

paper_url: http://arxiv.org/abs/2308.03290
repo_url: None
paper_authors: Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li
for: 这个论文目的是为了提出一个一击式混合精度搜寻方法，以实现高品质且低成本的深度神经网络（DNNs）模型。
methods: 这个方法使用了混合精度搜寻，以找到高品质且低成本的DNNs模型。它首先使用了数位支持的改进，然后使用了一个新的搜寻方法，以实现在数位和浮点数字之间进行搜寻。
results: 这个方法可以实现高品质且低成本的DNNs模型，并且比之前的方法更好。对于ResNet-18和ResNet-50模型，这个方法可以提高ImageNet准确度 by 1.31%和0.90%分别，同时保持相同的模型成本。此外，这个方法还可以对MobileNetV2进行改进，提高其准确度 by up to 0.98%分。最后，这个方法还可以同时搜寻一个混合精度和神经网络架构的共同搜寻空间，提高ImageNet准确度 by 2.69%分。

Abstract
Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With the improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed-precision quantization methods have performed a post-training quantization search, which compromises on accuracy, or a differentiable quantization search, which leads to high memory usage from branching. Therefore, we propose the first one-shot mixed-precision quantization search that eliminates the need for retraining in both integer and low-precision floating point models. We evaluate our floating-point and integer quantization search (FLIQS) on multiple convolutional networks and vision transformer models to discover Pareto-optimal models. Our approach discovers models that improve upon uniform precision, manual mixed-precision, and recent integer quantization search methods. With the proposed integer quantization search, we increase the accuracy of ResNet-18 on ImageNet by 1.31% points and ResNet-50 by 0.90% points with equivalent model cost over previous methods. Additionally, for the first time, we explore a novel mixed-precision floating-point search and improve MobileNetV2 by up to 0.98% points compared to prior state-of-the-art FP8 models. Finally, we extend FLIQS to simultaneously search a joint quantization and neural architecture space and improve the ImageNet accuracy by 2.69% points with similar model cost on a MobileNetV2 search space.

摘要
归纳化技术已成为现代深度神经网络（DNN）的主流压缩技术，以提高模型大小、计算需求和能耗。随着现有硬件的数字支持的提升，混合精度归纳化已成为实现高质量结果的低成本模型的必要手段。现有的混合精度归纳化方法通常会在训练后进行归纳化搜索，这会妥协准确性，或者使用可导的归纳化搜索，这会导致高 память使用率。因此，我们提出了首个一步混合精度归纳化搜索，无需重新训练，并在整数和低精度浮点数模型中实现高质量结果。我们在多个卷积网络和视transformer模型上进行了评估，并发现了Pareto优质量模型。我们的方法在浮点数和整数归纳化搜索中提高了ImageNet中ResNet-18和ResNet-50模型的准确率，相比之前的方法，增加了1.31%点和0.90%点。此外，我们首次探索了一种新的混合精度浮点数搜索，并在MobileNetV2上提高了0.98%点，相比之前的FP8模型。最后，我们将FLIQS扩展到同时搜索归纳化和神经网络体系空间，并在MobileNetV2上提高了ImageNet准确率2.69%点，与相同的模型成本相似。

Multi-Label Self-Supervised Learning with Scene Images

paper_url: http://arxiv.org/abs/2308.03286
repo_url: None
paper_authors: Ke Zhu, Minghao Fu, Jianxin Wu
for: 本研究旨在提出一种简单且高效的自监学习（SSL）方法，用于Scene图像。
methods: 本方法将Scene/多标签图像SSL简化为多标签分类问题，通过比较输入图像的嵌入与两个字典中的嵌入进行多个二进制 pseudo-标签的分配，然后使用二进制权值函数进行优化。
results: 实验显示，提出的多标签自监学习（MLS）方法可以学习高质量的图像表示，在MS-COCO数据集上实现了分类、检测和分割标准 bencmarks 的最佳结果，同时与现有方法相比，MLS 更简单，易于部署和进一步探索。

Abstract
Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module. This paper shows that instead of hinging on these strenuous operations, quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem, which greatly simplifies the learning framework. Specifically, multiple binary pseudo-labels are assigned for each input image by comparing its embeddings with those in two dictionaries, and the network is optimized using the binary cross entropy loss. The proposed method is named Multi-Label Self-supervised learning (MLS). Visualizations qualitatively show that clearly the pseudo-labels by MLS can automatically find semantically similar pseudo-positive pairs across different images to facilitate contrastive learning. MLS learns high quality representations on MS-COCO and achieves state-of-the-art results on classification, detection and segmentation benchmarks. At the same time, MLS is much simpler than existing methods, making it easier to deploy and for further exploration.

摘要
自动学习（SSL）方法targeting场景图像在最近几年内得到了快速发展，这些方法主要基于 either 专门的密集匹配机制或者昂贵的无监督物体发现模块。这篇论文表明，相比于依靠这些艰辛的操作，高质量图像表示可以通过对场景/多标签图像SSL进行简单的多标签分类问题来学习。特别是，每个输入图像都将多个 binary pseudo-标签赋给，通过对其嵌入与两个词典中的嵌入进行比较，并使用二分类 entropy 损失函数进行优化。这种方法被称为多标签自动学习（MLS）。视觉化Qualitatively 显示，MLS 可以自动找到不同图像中的semantic 相似 pseudo-正例对，以便进行对比学习。MLS 在 MS-COCO 上学习高质量表示，并在分类、检测和 segmentation benchmark 上 achieve 状态的最佳结果。同时，MLS 比现有方法更加简单，更容易部署和进一步探索。

Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation

paper_url: http://arxiv.org/abs/2308.03282
repo_url: https://github.com/myukzzz/eicr
paper_authors: Yukuan Min, Aming Wu, Cheng Deng
for: 解决Scene Graph Generation（SGG）任务中的类别不均和上下文不均问题，以提高SGG模型的性能。
methods: 提出了一种基于环境不变的Curiculum学习（EICR）方法，可以与现有的SGG方法结合使用，以解决类别不均和上下文不均问题。
results: 对VG和GQA数据集进行了广泛的实验，结果显示，EICR框架可以作为SGG模型的通用策略，并取得了显著的改善。

Abstract
The scene graph generation (SGG) task is designed to identify the predicates based on the subject-object pairs.However,existing datasets generally include two imbalance cases: one is the class imbalance from the predicted predicates and another is the context imbalance from the given subject-object pairs, which presents significant challenges for SGG. Most existing methods focus on the imbalance of the predicted predicate while ignoring the imbalance of the subject-object pairs, which could not achieve satisfactory results. To address the two imbalance cases, we propose a novel Environment Invariant Curriculum Relation learning (EICR) method, which can be applied in a plug-and-play fashion to existing SGG methods. Concretely, to remove the imbalance of the subject-object pairs, we first construct different distribution environments for the subject-object pairs and learn a model invariant to the environment changes. Then, we construct a class-balanced curriculum learning strategy to balance the different environments to remove the predicate imbalance. Comprehensive experiments conducted on VG and GQA datasets demonstrate that our EICR framework can be taken as a general strategy for various SGG models, and achieve significant improvements.

摘要
scene graph generation (SGG) 任务的设计是根据主语-谓语对 Identify predicate。然而，现有数据集通常存在两种不均衡情况：一是预测 predicate 的类别不均衡，另一是给定主语-谓语对的上下文不均衡，这两种不均衡情况都会对 SGG 带来很大的挑战。大多数现有方法主要关注预测 predicate 的不均衡，而忽略主语-谓语对的不均衡，这会导致不能达到满意的结果。为了解决这两种不均衡情况，我们提出了一种新的 Environment Invariant Curriculum Relation 学习方法（EICR），它可以与现有的 SGG 方法相结合使用。具体来说，为了消除主语-谓语对的不均衡，我们首先构建了不同的分布环境 для主语-谓语对，然后学习一个环境不变的模型。接着，我们构建了一种类别均衡的学习策略，以平衡不同的环境，从而消除预测 predicate 的不均衡。经过了在 VG 和 GQA 数据集上的广泛实验，我们的 EICR 框架可以作为多种 SGG 模型的通用策略，并实现了显著的提升。

Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing

paper_url: http://arxiv.org/abs/2308.03280
repo_url: https://github.com/zju3dv/Mirror-NeRF
paper_authors: Junyi Zeng, Chong Bao, Rui Chen, Zilong Dong, Guofeng Zhang, Hujun Bao, Zhaopeng Cui
for: 该论文旨在解决NeRF无法正确描述镜子上的反射问题，提出一种基于镜子反射概率的神经渲染框架，以支持各种场景操作和镜子反射的渲染。
methods: 该方法基于镜子反射概率的 introduce 镜子反射概率并使用Whitted Ray Tracing的光传输模型跟踪光线，以及一些促进学习过程的技术。
results: 实验和比较表明，该方法在 synthetic 和实际数据集上具有明显的优势，能够准确描述镜子上的反射和多视图相互关联的反射。

Abstract
Recently, Neural Radiance Fields (NeRF) has exhibited significant success in novel view synthesis, surface reconstruction, etc. However, since no physical reflection is considered in its rendering pipeline, NeRF mistakes the reflection in the mirror as a separate virtual scene, leading to the inaccurate reconstruction of the mirror and multi-view inconsistent reflections in the mirror. In this paper, we present a novel neural rendering framework, named Mirror-NeRF, which is able to learn accurate geometry and reflection of the mirror and support various scene manipulation applications with mirrors, such as adding new objects or mirrors into the scene and synthesizing the reflections of these new objects in mirrors, controlling mirror roughness, etc. To achieve this goal, we propose a unified radiance field by introducing the reflection probability and tracing rays following the light transport model of Whitted Ray Tracing, and also develop several techniques to facilitate the learning process. Experiments and comparisons on both synthetic and real datasets demonstrate the superiority of our method. The code and supplementary material are available on the project webpage: https://zju3dv.github.io/Mirror-NeRF/.

摘要
最近，神经辐射场（NeRF）在新视图合成、表面重建等领域表现出了显著的成功。然而，由于NeRF的渲染管线中没有考虑物理反射，因此NeRF会错误地将镜子中的反射视为独立的虚拟场景，导致镜子和多视图不一致的反射。在这篇论文中，我们提出了一种新的神经渲染框架，名为镜子-NeRF，它能够学习镜子上的准确 геометрии和反射。我们还提出了多种技术来促进学习过程，包括引入反射概率和根据Whitted雨筒跟踪模型跟踪光线的方法。实验和比较表明，我们的方法在 sintetic和实际数据集上具有明显的优势。代码和补充材料可以在项目网站（https://zju3dv.github.io/Mirror-NeRF/）上获取。

Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations

paper_url: http://arxiv.org/abs/2308.03276
repo_url: https://github.com/apperception-db/spatialyze
paper_authors: Chanwut Kittivorawong, Yongming Ge, Yousef Helal, Alvin Cheung
for: 这篇论文是关于如何提供高效的地ospatial视频数据管理系统，以便用户可以更好地交互地ospatial视频数据。
methods: 该论文提出了一个新的框架 named Spatialyze，该框架使用了域专语言， allowing users to construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm。
results: 实验结果表明，使用Spatialyze可以提高执行效率，比例最高可以达5.3倍，同时维持97.1%的准确率。

Abstract
Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume significantly. Yet, we do not have data management systems that allow users to interact with such data effectively. In this paper, we describe Spatialyze, a new framework for end-to-end querying of geospatial videos. Spatialyze comes with a domain-specific language where users can construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm. Internally, Spatialyze leverages the declarative nature of such workflows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of workflows. Our results using real-world videos and workflows show that Spatialyze can reduce execution time by up to 5.3x, while maintaining up to 97.1% accuracy compared to unoptimized execution.

摘要
视频 Recorded using common hardware such as phones and surveillance cameras 包含时间和地点 metadata。我们每天都会遇到这类地ospatial videos，但我们没有有效地处理这些数据的数据管理系统。在本文中，我们介绍了 Spatialyze，一个新的框架 для地ospatial videos 的终端查询。Spatialyze 提供了一个域特定语言， allowing users to construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm。内部，Spatialyze 利用了声明性的 workflows，视频中的时间-空间 metadata 和实际物体的物理行为来优化 workflows 的执行。我们使用实际视频和 workflows 进行测试，结果显示，Spatialyze 可以提高执行时间 Speed 到 5.3x，保持高于 97.1% 的准确率 compared to unoptimized execution。

Feature-Suppressed Contrast for Self-Supervised Food Pre-training

paper_url: http://arxiv.org/abs/2308.03272
repo_url: None
paper_authors: Xinda Liu, Yaohui Zhu, Linhu Liu, Jiang Tian, Lili Wang
for:This paper focuses on developing a self-supervised learning method for food image recognition, aiming to reduce the human labeling expenses and improve the efficiency of food image analysis.methods:The proposed method, called Feature Suppressed Contrast (FeaSC), leverages contrastive self-supervised learning on unlabelled food images. To address the problem of similar informative contents in the two views, the method uses a response-aware scheme to localize salient features in an unsupervised manner, reducing the mutual information between the views.results:The proposed FeaSC method consistently improves the classification accuracy of BYOL and SimSiam by 1.70% - 6.69% on four publicly available food recognition datasets. Additionally, the method achieves superior results on downstream segmentation tasks, demonstrating its effectiveness in food image analysis.

Abstract
Most previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, weiqing explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70\% $\sim$ 6.69\% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.

摘要
previous approaches for analyzing food images have relied on extensively annotated datasets, resulting in significant human labeling expenses due to the varied and intricate nature of such images. Inspired by the effectiveness of contrastive self-supervised methods in utilizing unlabelled data, we explore leveraging these techniques on unlabelled food images. In contrastive self-supervised methods, two views are randomly generated from an image by data augmentations. However, regarding food images, the two views tend to contain similar informative contents, causing large mutual information, which impedes the efficacy of contrastive self-supervised learning. To address this problem, we propose Feature Suppressed Contrast (FeaSC) to reduce mutual information between views. As the similar contents of the two views are salient or highly responsive in the feature map, the proposed FeaSC uses a response-aware scheme to localize salient features in an unsupervised manner. By suppressing some salient features in one view while leaving another contrast view unchanged, the mutual information between the two views is reduced, thereby enhancing the effectiveness of contrast learning for self-supervised food pre-training. As a plug-and-play module, the proposed method consistently improves BYOL and SimSiam by 1.70\% $\sim$ 6.69\% classification accuracy on four publicly available food recognition datasets. Superior results have also been achieved on downstream segmentation tasks, demonstrating the effectiveness of the proposed method.Here's the word-for-word translation of the text into Simplified Chinese:前一些食物图像分析方法都是基于大量人工标注的数据集，这导致了人工标注成本的增加，因为食物图像的性质是复杂且多变的。以启发自contrastive self-supervised方法的效iveness，我们想要利用无标注数据来预训练食物图像。在contrastive self-supervised方法中，两个视图是通过数据变换生成的，但是在食物图像上，这两个视图往往含有相似的有用信息，导致大量的相互信息，这阻碍了对比学习的效iveness。为解决这个问题，我们提出了特征压缩对比（FeaSC），以减少视图之间的相互信息。在特征地图中，与食物图像相似的部分是突出的或高度反应的，我们使用回应感知方案来本地化这些特征。通过压缩一个视图中的突出特征而不改变另一个对比视图，我们可以减少视图之间的相互信息，从而提高对比学习的效iveness。作为插入式模块，我们的提案可以适应BYOL和SimSiam等方法，并在四个公开的食物识别数据集上实现了1.70% 至 6.69%的分类精度提升。此外，我们还在下游分割任务中获得了更高的成果，这证明了我们的提案的效果。

A Benchmark for Chinese-English Scene Text Image Super-resolution

paper_url: http://arxiv.org/abs/2308.03262
repo_url: https://github.com/mjq11302010044/real-ce
paper_authors: Jianqi Ma, Zhetong Liang, Wangmeng Xiang, Xi Yang, Lei Zhang
for: 本研究旨在提高中文Scene Text Image Super-resolution（STISR）的质量，以恢复优质的高分辨率（HR）Scene文本图像，并且能够保持中文文本的拼写正确性和可读性。
methods: 我们提出了一种新的Edge-aware学习方法，该方法在图像和特征领域提供了结构权重，以有效地重建中文字符的紧凑结构。
results: 我们在提出的Real-CE数据集上进行了实验，并评估了现有的STISR模型，包括使用我们的Edge-aware损失和不使用。实验结果显示，我们的Edge-aware方法能够提高STISR模型的性能，并且能够保持中文文本的拼写正确性和可读性。

Abstract
Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have relatively simple character structures, while little work has been done on the more challenging Chinese texts with diverse and complex character structures. In this paper, we propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with the emphasis on restoring structurally complex Chinese characters. The benchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789 text lines in total) for training/testing in 2$\times$ and 4$\times$ zooming modes, complemented by detailed annotations, including detection boxes and text transcripts. Moreover, we design an edge-aware learning method, which provides structural supervision in image and feature domains, to effectively reconstruct the dense structures of Chinese characters. We conduct experiments on the proposed Real-CE benchmark and evaluate the existing STISR models with and without our edge-aware loss. The benchmark, including data and source code, is available at https://github.com/mjq11302010044/Real-CE.

摘要

APBench: A Unified Benchmark for Availability Poisoning Attacks and Defenses

paper_url: http://arxiv.org/abs/2308.03258
repo_url: https://github.com/lafeat/apbench
paper_authors: Tianrui Qin, Xitong Gao, Juanjuan Zhao, Kejiang Ye, Cheng-Zhong Xu
for:这篇论文的目的是评估黑客攻击和防御数据毒液的能效性，并提供一个 benchmark 来评估这些攻击和防御方法的表现。methods:这篇论文使用了9种最新的可用性毒液攻击、8种防御算法和4种传统的数据增强技术来评估这些攻击和防御方法的表现。results:这篇论文的结果显示现有的黑客攻击无法保护个人隐私，而 APBench 可以帮助评估这些攻击和防御方法的表现。

Abstract
The efficacy of availability poisoning, a method of poisoning data by injecting imperceptible perturbations to prevent its use in model training, has been a hot subject of investigation. Previous research suggested that it was difficult to effectively counteract such poisoning attacks. However, the introduction of various defense methods has challenged this notion. Due to the rapid progress in this field, the performance of different novel methods cannot be accurately validated due to variations in experimental setups. To further evaluate the attack and defense capabilities of these poisoning methods, we have developed a benchmark -- APBench for assessing the efficacy of adversarial poisoning. APBench consists of 9 state-of-the-art availability poisoning attacks, 8 defense algorithms, and 4 conventional data augmentation techniques. We also have set up experiments with varying different poisoning ratios, and evaluated the attacks on multiple datasets and their transferability across model architectures. We further conducted a comprehensive evaluation of 2 additional attacks specifically targeting unsupervised models. Our results reveal the glaring inadequacy of existing attacks in safeguarding individual privacy. APBench is open source and available to the deep learning community: https://github.com/lafeat/apbench.

摘要
“数据可用性毒化”的效果，一种在模型训练中注入不可见的干扰以防止数据使用，已成为研究热点。前一些研究表明，对这种攻击难以有效防御。然而，新的防御技术的出现挑战了这一观点。由于这个领域的快速进步，不同的新方法的性能无法准确验证因为实验设置的变化。为了进一步评估攻击和防御毒化方法的能力，我们开发了一个标准套件——APBench，用于评估毒化攻击的效果。APBench包括9种当前最佳的可用性毒化攻击，8种防御算法，以及4种常见的数据增强技术。我们还在不同的毒化比率下进行了实验，并对多个数据集和模型架构进行了评估。此外，我们还进行了对2种专门针对无监督模型的攻击的全面评估。我们的结果显示现有的攻击方法对个人隐私无法提供充分的保护。APBench是开源的，可以在GitHub上获取：https://github.com/lafeat/apbench。”

Learning a Graph Neural Network with Cross Modality Interaction for Image Fusion

paper_url: http://arxiv.org/abs/2308.03256
repo_url: https://github.com/lok-18/ignet
paper_authors: Jiawei Li, Jiansheng Chen, Jinyuan Liu, Huimin Ma
for: 本研究旨在提出一种基于交互式图 neural network（GNN）的诸多模式图像融合方法，以提高图像融合的精度和效果。
methods: 本方法首先使用多尺度提取器来获得 shallow 特征，然后将这些特征作为必要的输入建立图структуре。接着，图交互模块将抽取的中间特征在不同模式之间进行交互，以实现跨模式和semantic学习。此外，我们还提出了领导节点来改进同模式信息传播。最后，我们将所有图结构特征合并以获得融合结果。
results: 我们在多个数据集（TNO、MFNet和M3FD）上进行了广泛的实验，结果表明，我们的IGNet方法可以生成视觉吸引人的融合图像，同时在检测和分割任务中平均获得2.59% mAP@.5和7.77% mIoU高于相关的状态当前方法。

Abstract
Infrared and visible image fusion has gradually proved to be a vital fork in the field of multi-modality imaging technologies. In recent developments, researchers not only focus on the quality of fused images but also evaluate their performance in downstream tasks. Nevertheless, the majority of methods seldom put their eyes on the mutual learning from different modalities, resulting in fused images lacking significant details and textures. To overcome this issue, we propose an interactive graph neural network (GNN)-based architecture between cross modality for fusion, called IGNet. Specifically, we first apply a multi-scale extractor to achieve shallow features, which are employed as the necessary input to build graph structures. Then, the graph interaction module can construct the extracted intermediate features of the infrared/visible branch into graph structures. Meanwhile, the graph structures of two branches interact for cross-modality and semantic learning, so that fused images can maintain the important feature expressions and enhance the performance of downstream tasks. Besides, the proposed leader nodes can improve information propagation in the same modality. Finally, we merge all graph features to get the fusion result. Extensive experiments on different datasets (TNO, MFNet and M3FD) demonstrate that our IGNet can generate visually appealing fused images while scoring averagely 2.59% mAP@.5 and 7.77% mIoU higher in detection and segmentation than the compared state-of-the-art methods. The source code of the proposed IGNet can be available at https://github.com/lok-18/IGNet.

摘要
infrared和可见图像融合逐渐成为多Modal imaging技术中的重要分支。在最近的发展中，研究人员不仅关注融合图像的质量，还评估其在下游任务中的表现。然而，大多数方法很少关注不同模式之间的相互学习，导致融合图像缺乏重要的特征和тексту。为解决这问题，我们提出了一种交互式图 neural network（GNN）基于树结构的架构，called IGNet。具体来说，我们首先应用多级提取器来获得 shallow 特征，这些特征被用作构建图像结构的必要输入。然后，图像交互模块可以将抽象分支中的中间特征构建成图像结构。同时，两个分支的图像结构之间进行交互性学习，以便在不同模式之间增强融合图像的表现。此外，我们还提出了领导节点，以提高同一个模式中的信息传播。最后，我们将所有的图像特征合并到一起，以获得融合结果。我们在不同的数据集（TNO、MFNet和M3FD）进行了广泛的实验，结果表明，我们的IGNet可以生成有趣的融合图像，同时与比较的状态前方法相比，其在检测和分类任务中的性能提高了2.59% mAP@.5和7.77% mIoU。源代码可以在https://github.com/lok-18/IGNet中下载。

Local Consensus Enhanced Siamese Network with Reciprocal Loss for Two-view Correspondence Learning

paper_url: http://arxiv.org/abs/2308.03217
repo_url: None
paper_authors: Linbo Wang, Jing Wu, Xianyong Fang, Zhengyi Liu, Chenjie Cao, Yanwei Fu
for: 提高两视匹配学习的性能
methods: 提出了一个Local Feature Consensus（LFC）插件块，对现有模型的特征进行增强，以及一种对抗式损失函数，使用反向映射的信息进行监督网络训练
results: 实验表明，基于MSA-Net的两个提议可以提高匹配性能，达到了参考数据集上的状态速度表现

Abstract
Recent studies of two-view correspondence learning usually establish an end-to-end network to jointly predict correspondence reliability and relative pose. We improve such a framework from two aspects. First, we propose a Local Feature Consensus (LFC) plugin block to augment the features of existing models. Given a correspondence feature, the block augments its neighboring features with mutual neighborhood consensus and aggregates them to produce an enhanced feature. As inliers obey a uniform cross-view transformation and share more consistent learned features than outliers, feature consensus strengthens inlier correlation and suppresses outlier distraction, which makes output features more discriminative for classifying inliers/outliers. Second, existing approaches supervise network training with the ground truth correspondences and essential matrix projecting one image to the other for an input image pair, without considering the information from the reverse mapping. We extend existing models to a Siamese network with a reciprocal loss that exploits the supervision of mutual projection, which considerably promotes the matching performance without introducing additional model parameters. Building upon MSA-Net, we implement the two proposals and experimentally achieve state-of-the-art performance on benchmark datasets.

摘要
(Simplified Chinese translation)现在的研究通常是两视匹配学习的结果，通常是一个端到端网络来同时预测匹配可靠性和相对pose。我们从两个方面提高了这种框架：第一，我们提议一个Local Feature Consensus（LFC）插件块来增强现有模型的特征。给一个匹配特征，这个块将其周围的特征通过相互邻居一致来增强，并将其汇聚到生成一个加强特征。由于匹配点遵循同一个跨视图变换，并且在学习过程中分享更一致的特征，因此特征一致性增强了匹配点的相互关系，降低了干扰器的影响，使输出特征更有力度地分类匹配/干扰。第二，现有的方法通常通过真实对应的地址来训练网络，而不考虑反向映射的信息。我们将现有模型扩展为一个SIAMESE网络，使用对偶损失来利用对偶映射的超级vision，从而明显提高匹配性能，而不需要添加更多的模型参数。基于MSA-Net，我们实现了这两个提议，并在测试数据集上实现了状态机器人的性能。

Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP)

paper_url: http://arxiv.org/abs/2308.03203
repo_url: None
paper_authors: Youssef Sultan, Yongqiang Wang, James Scanlon, Lisa D’lima
for: 这个研究旨在为 HuBMAP 项目提供细致的微血管结构分割方法，以创建详细的人体细胞地图。
methods: 该研究使用了基础 FastAI U-Net 模型，并对其进行了改进，包括使用不同的后端架构、深度模型和特征峰网络。
results: 研究对不同方法进行了严谨的评估，并发现了各种改进方法的性能。这种研究提供了未来研究领域的有价值透彻。

Abstract
Image segmentation serves as a critical tool across a range of applications, encompassing autonomous driving's pedestrian detection and pre-operative tumor delineation in the medical sector. Among these applications, we focus on the National Institutes of Health's (NIH) Human BioMolecular Atlas Program (HuBMAP), a significant initiative aimed at creating detailed cellular maps of the human body. In this study, we concentrate on segmenting various microvascular structures in human kidneys, utilizing 2D Periodic Acid-Schiff (PAS)-stained histology images. Our methodology begins with a foundational FastAI U-Net model, upon which we investigate alternative backbone architectures, delve into deeper models, and experiment with Feature Pyramid Networks. We rigorously evaluate these varied approaches by benchmarking their performance against our baseline U-Net model. This study thus offers a comprehensive exploration of cutting-edge segmentation techniques, providing valuable insights for future research in the field.

摘要
(Simplified Chinese translation)图像分割是应用领域中的一种重要工具，包括自动驾驶中的步行人检测和医疗领域中的前操作肿瘤定点。我们在这些应用中将重点关注国家卫生研究院（NIH）的人生物分子地图计划（HuBMAP），这是一项旨在创建人体cellular图的重要initiative。在这项研究中，我们将专注于人类肾脏中的微血管结构分割，使用2D periodic acid-Schiff（PAS）染色的历史图像。我们的方法开始于基础的 FastAI U-Net 模型，然后我们会 investigate alternative backbone architectures、 deeper models和 Feature Pyramid Networks。我们严格评估这些不同的方法，对比基准 U-Net 模型的性能。这项研究因此提供了一种全面的分割技术探索，为未来研究提供有价值的意见。

Syn-Mediverse: A Multimodal Synthetic Dataset for Intelligent Scene Understanding of Healthcare Facilities

paper_url: http://arxiv.org/abs/2308.03193
repo_url: None
paper_authors: Rohit Mohan, José Arce, Sassan Mokhtar, Daniele Cattaneo, Abhinav Valada
For: 这个论文的目的是提供一个大量的多模态Synthetic数据集，以便研究医疗设施的场景理解。* Methods: 该论文使用了一个 simulate industry-standard optical tracking camera 生成的数据集，包含了多种场景理解任务的1.5万个标注。* Results: 论文提供了一个广泛的基线测试，以评估不同任务的性能。此外，论文还提供了一个在线评估平台，可以帮助进一步研究医疗设施的场景理解。

Abstract
Safety and efficiency are paramount in healthcare facilities where the lives of patients are at stake. Despite the adoption of robots to assist medical staff in challenging tasks such as complex surgeries, human expertise is still indispensable. The next generation of autonomous healthcare robots hinges on their capacity to perceive and understand their complex and frenetic environments. While deep learning models are increasingly used for this purpose, they require extensive annotated training data which is impractical to obtain in real-world healthcare settings. To bridge this gap, we present Syn-Mediverse, the first hyper-realistic multimodal synthetic dataset of diverse healthcare facilities. Syn-Mediverse contains over \num{48000} images from a simulated industry-standard optical tracking camera and provides more than 1.5M annotations spanning five different scene understanding tasks including depth estimation, object detection, semantic segmentation, instance segmentation, and panoptic segmentation. We demonstrate the complexity of our dataset by evaluating the performance on a broad range of state-of-the-art baselines for each task. To further advance research on scene understanding of healthcare facilities, along with the public dataset we provide an online evaluation benchmark available at \url{http://syn-mediverse.cs.uni-freiburg.de}

摘要
安全和效率在医疗设施中是非常重要，因为患者的生命正在归附。虽然已经采用了机器人来协助医疗人员完成复杂的手术等任务，但人类专业仍然是不可或缺的。下一代自动化医疗机器人的发展取决于它们能够在复杂和紧张的医疗环境中进行感知和理解。然而，深度学习模型在这种目的上面习用的数据是实际医疗设施中获得的困难。为了bridging这个差距，我们介绍了Syn-Mediverse，首个 Hyper-Realistic 多模态人工数据集。Syn-Mediverse包含了 более48000张来自 simulated 行业标准光学跟踪相机的图像，以及1500000多个注释，涵盖了五个不同的场景理解任务，包括深度估计、物体检测、semantic segmentation、instance segmentation和panoptic segmentation。我们通过评估一系列国际顶峰模型的性能来证明Syn-Mediverse的复杂性。为了进一步推动医疗设施场景理解的研究，我们同时提供了在 line 4 提到的在线评估平台，可以在http://syn-mediverse.cs.uni-freiburg.de 上获取。

Understanding Biometric Entropy and Iris Capacity: Avoiding Identity Collisions on National Scales

paper_url: http://arxiv.org/abs/2308.03189
repo_url: None
paper_authors: John Daugman
for: 研究了基于眼睛图像的唯一身份识别方法，特别是人口规模下的唯一标识。
methods: 使用眼睛图像的生物特征，如眼睛肤色、眼睛形状等，进行唯一身份识别。
results: 研究发现，使用眼睛图像可以实现高精度的唯一身份识别，并且可以处理大规模人口数据。具体来说，在US NIST（国家标准技术研究所）试验中，使用眼睛图像进行1.2亿次比较，并没有发现任何身份冲突现象。此外，研究还发现，使用两个眼睛图像的生物特征可以保证全球唯一身份识别。

Abstract
The numbers of persons who can be enrolled by their iris patterns with no identity collisions is studied in relation to the biometric entropy extracted, and the decision operating threshold. The population size at which identity collision becomes likelier than not, given those variables, defines iris "capacity." The general solution to this combinatorial problem is derived, in analogy with the well-known "birthday problem." Its application to unique biometric identification on national population scales is shown, referencing empirical data from US NIST (National Institute of Standards and Technology) trials involving 1.2 trillion (1.2 x 10^(12) ) iris comparisons. The entropy of a given person's two iris patterns suffices for global identity uniqueness.

摘要
TEXTThe number of people who can be enrolled using their iris patterns without any identity collisions is studied in relation to the biometric entropy extracted and the decision operating threshold. The population size at which identity collision becomes more likely than not, given these variables, defines the "capacity" of the iris. The general solution to this combinatorial problem is derived, similar to the well-known "birthday problem." Its application to unique biometric identification on national population scales is shown, referencing empirical data from US NIST (National Institute of Standards and Technology) trials involving 1.2 trillion (1.2 x 10^(12)) iris comparisons. The entropy of a person's two iris patterns is sufficient for global identity uniqueness.SIMPLIFIED CHINESE TRANSLATION文本通过人们的肉眼印模式注册的人数量，不会出现身份冲突的情况是研究的，与提取的生物метри entropy和决策操作阈值相关。这个变量定义了肉眼的容量。通过生物 метри "生日问题" 的一般解决方案来 derivation。这种应用于国家规模的唯一生物特征标识， referencing 美国 NIST（国家标准技术研究所）的实验数据，涉及 1.2 x 10^(12) 比较。两个人的肉眼印模式的熵值充分保证全球身份唯一性。

Photorealistic and Identity-Preserving Image-Based Emotion Manipulation with Latent Diffusion Models

paper_url: http://arxiv.org/abs/2308.03183
repo_url: None
paper_authors: Ioannis Pikoulis, Panagiotis P. Filntisis, Petros Maragos
for: investigate the emotion manipulation capabilities of diffusion models with “in-the-wild” images
methods: Latent Diffusion models and text-driven manipulation with CLIP latents
results: superior image quality and realism, competitive results relative to emotion translation compared to GAN-based counterparts.

Abstract
In this paper, we investigate the emotion manipulation capabilities of diffusion models with "in-the-wild" images, a rather unexplored application area relative to the vast and rapidly growing literature for image-to-image translation tasks. Our proposed method encapsulates several pieces of prior work, with the most important being Latent Diffusion models and text-driven manipulation with CLIP latents. We conduct extensive qualitative and quantitative evaluations on AffectNet, demonstrating the superiority of our approach in terms of image quality and realism, while achieving competitive results relative to emotion translation compared to a variety of GAN-based counterparts. Code is released as a publicly available repo.

摘要
在这篇论文中，我们研究了使用“在野”图像进行情感操作的扩散模型，这是图像到图像翻译任务领域的未探索领域。我们提出的方法集成了许多先前的研究，最重要的是潜在扩散模型和文本驱动的映射。我们在AffectNet上进行了广泛的质量和量测试，示出我们的方法在图像质量和真实性方面具有突出的优势，同时与情感翻译相比，与多种基于GAN的对手相比的结果具有竞争力。代码将被公开发布为公共可用 репозиторий。

Boosting Few-shot 3D Point Cloud Segmentation via Query-Guided Enhancement

paper_url: http://arxiv.org/abs/2308.03177
repo_url: None
paper_authors: Zhenhua Ning, Zhuotao Tian, Guangming Lu, Wenjie Pei
for: 提高3D点云 segmentation模型的几何 adaptation 性能
methods: 提出一种基于查询指导的改进方法，通过修改支持背景模型以匹配查询样本的上下文，并通过填充查询特征来填充 semantic gap
results: 实验结果表明，该方法可以在S3DIS和ScanNet上达到显著提高，同时保持高效率

Abstract
Although extensive research has been conducted on 3D point cloud segmentation, effectively adapting generic models to novel categories remains a formidable challenge. This paper proposes a novel approach to improve point cloud few-shot segmentation (PC-FSS) models. Unlike existing PC-FSS methods that directly utilize categorical information from support prototypes to recognize novel classes in query samples, our method identifies two critical aspects that substantially enhance model performance by reducing contextual gaps between support prototypes and query features. Specifically, we (1) adapt support background prototypes to match query context while removing extraneous cues that may obscure foreground and background in query samples, and (2) holistically rectify support prototypes under the guidance of query features to emulate the latter having no semantic gap to the query targets. Our proposed designs are agnostic to the feature extractor, rendering them readily applicable to any prototype-based methods. The experimental results on S3DIS and ScanNet demonstrate notable practical benefits, as our approach achieves significant improvements while still maintaining high efficiency. The code for our approach is available at https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement

摘要
尽管普通的3D点云分割检测已经得到了广泛的研究，但将通用模型适应新类型仍然是一项具有挑战性的任务。这篇论文提出了一种改进点云几何分割（PC-FSS）模型的新方法。与现有PC-FSS方法不同，我们的方法不直接使用支持类prototype来识别新类型的查询样本中的类别信息。而是通过两种关键方法来减少查询样本与支持类prototype之间的上下文差异，以提高模型性能。具体来说，我们：1. 将支持背景prototype调整到与查询样本的上下文相匹配，同时移除查询样本中可能掩蔽背景和前景的误导因素。2. 使用查询特征来正则化支持类prototype，以便模拟查询样本中没有 semantic gap 的情况。我们的设计是对于任何prototype-based方法都是可靠的，并且在实验中得到了显著的实用效果。我们的代码可以在https://github.com/AaronNZH/Boosting-Few-shot-3D-Point-Cloud-Segmentation-via-Query-Guided-Enhancement上下载。

FireFly A Synthetic Dataset for Ember Detection in Wildfire

paper_url: http://arxiv.org/abs/2308.03164
repo_url: https://github.com/ergowho/firefly2.0
paper_authors: Yue Hu, Xinan Ye, Yifei Liu, Souvik Kundu, Gourav Datta, Srikar Mutnuri, Namo Asavisanu, Nora Ayanian, Konstantinos Psounis, Peter Beerel
for: 这个论文是为了提供一个用于燃烧检测的人工数据集，以替代现有的燃烧训练资源的缺乏。
methods: 这个论文使用了Unreal Engine 4（UE4）创建了一个名为“FireFly”的人工数据集，并提供了一种自动生成Synthetic labeled dataset的工具，以获得多种环境条件下的数据多样性。此外，他们还利用了一个已经训练的模型来创建一种半自动的标注过程，以便将真实的燃烧框架标注为数据集中。
results: 根据论文的描述，使用FireFly数据集训练四种流行的对象检测模型后，对于真实的野火场景下，相比于只使用小型实际数据集训练的模型，FireFly可以提供8.57%的提升在mean Average Precision（mAP）上。

Abstract
This paper presents "FireFly", a synthetic dataset for ember detection created using Unreal Engine 4 (UE4), designed to overcome the current lack of ember-specific training resources. To create the dataset, we present a tool that allows the automated generation of the synthetic labeled dataset with adjustable parameters, enabling data diversity from various environmental conditions, making the dataset both diverse and customizable based on user requirements. We generated a total of 19,273 frames that have been used to evaluate FireFly on four popular object detection models. Further to minimize human intervention, we leveraged a trained model to create a semi-automatic labeling process for real-life ember frames. Moreover, we demonstrated an up to 8.57% improvement in mean Average Precision (mAP) in real-world wildfire scenarios compared to models trained exclusively on a small real dataset.

摘要

CGBA: Curvature-aware Geometric Black-box Attack

paper_url: http://arxiv.org/abs/2308.03163
repo_url: https://github.com/farhamdur/cgba
paper_authors: Md Farhamdur Reza, Ali Rahmati, Tianfu Wu, Huaiyu Dai
for: 这个论文是为了提出一种高效的黑盒攻击方法，能够在训练数据集上生成高质量的攻击示例。
methods: 该方法使用几何学决策法，在一个限制的2D平面上进行边搜索，以确保找到边点成功，无论边界曲率如何。
results: 对于一些常用的分类器，该方法可以高效地生成攻击示例，并且在非目标攻击和目标攻击两种情况下都显示出优秀的性能。

Abstract
Decision-based black-box attacks often necessitate a large number of queries to craft an adversarial example. Moreover, decision-based attacks based on querying boundary points in the estimated normal vector direction often suffer from inefficiency and convergence issues. In this paper, we propose a novel query-efficient curvature-aware geometric decision-based black-box attack (CGBA) that conducts boundary search along a semicircular path on a restricted 2D plane to ensure finding a boundary point successfully irrespective of the boundary curvature. While the proposed CGBA attack can work effectively for an arbitrary decision boundary, it is particularly efficient in exploiting the low curvature to craft high-quality adversarial examples, which is widely seen and experimentally verified in commonly used classifiers under non-targeted attacks. In contrast, the decision boundaries often exhibit higher curvature under targeted attacks. Thus, we develop a new query-efficient variant, CGBA-H, that is adapted for the targeted attack. In addition, we further design an algorithm to obtain a better initial boundary point at the expense of some extra queries, which considerably enhances the performance of the targeted attack. Extensive experiments are conducted to evaluate the performance of our proposed methods against some well-known classifiers on the ImageNet and CIFAR10 datasets, demonstrating the superiority of CGBA and CGBA-H over state-of-the-art non-targeted and targeted attacks, respectively. The source code is available at https://github.com/Farhamdur/CGBA.

摘要
决策基于的黑盒攻击经常需要许多查询来制作攻击性的输入。另外，基于查询边缘点的决策攻击经常受到效率和收敛问题的影响。在这篇论文中，我们提出了一种新的查询效率高的几何决策基于黑盒攻击（CGBA），它在一个限定的2D平面上进行边搜索，以确保找到边点成功，不 matter the boundary curvature。尽管CGBA攻击可以有效地攻击任何决策边界，但是它在非目标攻击时尤其有效，可以轻松地制作高质量的攻击性输入。在目标攻击时，我们开发了一种新的查询效率变体CGBA-H，并设计了一个算法来获得更好的初始边界点，以提高目标攻击的性能。我们对一些常用的分类器进行了广泛的实验，并证明了CGBA和CGBA-H的超越性， Comparing with state-of-the-art non-targeted and targeted attacks。源代码可以在https://github.com/Farhamdur/CGBA中下载。