2023-08-01

cs.CV

cs.CV - 2023-08-01

ELFNet: Evidential Local-global Fusion for Stereo Matching

paper_url: http://arxiv.org/abs/2308.00728
repo_url: https://github.com/jimmy19991222/elfnet
paper_authors: Jieming Lou, Weide Liu, Zhuo Chen, Fayao Liu, Jun Cheng
for: 这篇论文目的是提出一种基于证据的本地含沟融合（ELF）框架，以便在 стерео匹配中实现不确定性估计和信任度感知。
methods: 该模型不仅预测了分辨率图，还预测了一个基于证据的分辨率图，以考虑 Aleatoric 和 Epistemic 不确定性。具体来说，使用ormal inverse-Gamma分布作为桥梁，实现了多级预测的内部证据融合和多视图信任度感知融合。
results: 实验结果表明，提出的框架可以有效利用多视图信息，并达到了当前最佳的总性能和跨领域泛化性。代码可以在 https://github.com/jimmy19991222/ELFNet 上下载。

Abstract
Although existing stereo matching models have achieved continuous improvement, they often face issues related to trustworthiness due to the absence of uncertainty estimation. Additionally, effectively leveraging multi-scale and multi-view knowledge of stereo pairs remains unexplored. In this paper, we introduce the \textbf{E}vidential \textbf{L}ocal-global \textbf{F}usion (ELF) framework for stereo matching, which endows both uncertainty estimation and confidence-aware fusion with trustworthy heads. Instead of predicting the disparity map alone, our model estimates an evidential-based disparity considering both aleatoric and epistemic uncertainties. With the normal inverse-Gamma distribution as a bridge, the proposed framework realizes intra evidential fusion of multi-level predictions and inter evidential fusion between cost-volume-based and transformer-based stereo matching. Extensive experimental results show that the proposed framework exploits multi-view information effectively and achieves state-of-the-art overall performance both on accuracy and cross-domain generalization. The codes are available at https://github.com/jimmy19991222/ELFNet.

摘要
existing 3D matching models have achieved continuous improvement, but they often lack uncertainty estimation, which can lead to trust issues. In addition, effectively leveraging multi-scale and multi-view knowledge of stereo pairs remains unexplored. In this paper, we introduce the 证据based Local-global Fusion (ELF) framework for stereo matching, which includes both uncertainty estimation and confidence-aware fusion. Instead of predicting the disparity map alone, our model estimates an evidential-based disparity that considers both aleatoric and epistemic uncertainties. With the normal inverse-Gamma distribution as a bridge, the proposed framework realizes intra evidential fusion of multi-level predictions and inter evidential fusion between cost-volume-based and transformer-based stereo matching. Extensive experimental results show that the proposed framework effectively utilizes multi-view information and achieves state-of-the-art overall performance in both accuracy and cross-domain generalization.Here's the breakdown of the translation:1. existing 3D matching models (现有的3D匹配模型) - This refers to the existing stereo matching models that have been proposed in the literature.2. have achieved continuous improvement (已经实现了连续改进) - This means that these models have been constantly improved over time, leading to better performance.3. but they often lack uncertainty estimation (但它们经常缺乏不确定性估计) - This means that these models often do not provide uncertainty estimates, which can be a limitation.4. which can lead to trust issues (可能导致信任问题) - This means that the lack of uncertainty estimates can make it difficult to trust the results of the model.5. In addition (此外) - This introduces a new idea that is being proposed in the paper.6. effectively leveraging multi-scale and multi-view knowledge of stereo pairs (有效地利用多级和多视图的双眼匹配知识) - This means that the proposed framework aims to effectively use the information from multiple scales and multiple views of stereo pairs.7. remains unexplored (未探索) - This means that this idea has not been explored in previous research.8. In this paper (在这篇论文中) - This introduces the paper being presented.9. we introduce the 证据based Local-global Fusion (ELF) framework for stereo matching (我们介绍了证据based Local-global Fusion 双眼匹配框架) - This is the main contribution of the paper.10. which includes both uncertainty estimation and confidence-aware fusion (包括不确定性估计和自信感觉的融合) - This means that the proposed framework includes both uncertainty estimates and confidence-aware fusion.11. Instead of predicting the disparity map alone (而不是单独预测分割图) - This means that the proposed framework does not only predict the disparity map, but also includes other information.12. our model estimates an evidential-based disparity (我们的模型估计一种证据基于的分割) - This means that the proposed framework estimates the disparity based on evidence.13. considering both aleatoric and epistemic uncertainties (考虑到随机和知识不确定性) - This means that the proposed framework considers both types of uncertainties.14. With the normal inverse-Gamma distribution as a bridge (使用正常的 inverse-Gamma 分布作为桥梁) - This means that the proposed framework uses a specific distribution to connect the different components of the model.15. the proposed framework realizes intra evidential fusion of multi-level predictions (提案的框架实现了多级预测之间的证据内部融合) - This means that the proposed framework fuses the predictions from different levels using evidence.16. and inter evidential fusion between cost-volume-based and transformer-based stereo matching (以及在基于成本量和基于变换器的双眼匹配之间的证据间融合) - This means that the proposed framework fuses the predictions from different models using evidence.17. Extensive experimental results show that the proposed framework effectively utilizes multi-view information (广泛的实验结果表明提案的框架有效地利用多视图信息) - This means that the proposed framework effectively uses the information from multiple views.18. and achieves state-of-the-art overall performance in both accuracy and cross-domain generalization (并在精度和跨领域泛化性方面达到了顶尖性能) - This means that the proposed framework achieves state-of-the-art performance in both accuracy and cross-domain generalization.19. The codes are available at https://github.com/jimmy19991222/ELFNet (代码可以在https://github.com/jimmy19991222/ELFNet中获取) - This means that the codes for the proposed framework are available online.

NeRT: Implicit Neural Representations for General Unsupervised Turbulence Mitigation

paper_url: http://arxiv.org/abs/2308.00622
repo_url: None
paper_authors: Weiyun Jiang, Vivek Boominathan, Ashok Veeraraghavan
for: 提高大气和水层抗抖抖能力
methods: 利用偏函数神经网络和物理正确的倾斜后噪声模型重建不受抖抖影响的清晰图像，只需几十个扭曲输入图像
results: 比state-of-the-art高效，能够消除实际环境中的不控制抖抖，并在连续捕捉视频序列中实现48倍加速

Abstract
The atmospheric and water turbulence mitigation problems have emerged as challenging inverse problems in computer vision and optics communities over the years. However, current methods either rely heavily on the quality of the training dataset or fail to generalize over various scenarios, such as static scenes, dynamic scenes, and text reconstructions. We propose a general implicit neural representation for unsupervised atmospheric and water turbulence mitigation (NeRT). NeRT leverages the implicit neural representations and the physically correct tilt-then-blur turbulence model to reconstruct the clean, undistorted image, given only dozens of distorted input images. Moreover, we show that NeRT outperforms the state-of-the-art through various qualitative and quantitative evaluations of atmospheric and water turbulence datasets. Furthermore, we demonstrate the ability of NeRT to eliminate uncontrolled turbulence from real-world environments. Lastly, we incorporate NeRT into continuously captured video sequences and demonstrate $48 \times$ speedup.

摘要
“大气和水层湍流问题在计算机视觉和光学社区中出现了许多年来，但当前的方法 Either rely heavily on the quality of the training dataset or fail to generalize over various scenarios, such as static scenes, dynamic scenes, and text reconstructions. We propose a general implicit neural representation for unsupervised atmospheric and water turbulence mitigation (NeRT). NeRT leverages the implicit neural representations and the physically correct tilt-then-blur turbulence model to reconstruct the clean, undistorted image, given only dozens of distorted input images. Moreover, we show that NeRT outperforms the state-of-the-art through various qualitative and quantitative evaluations of atmospheric and water turbulence datasets. Furthermore, we demonstrate the ability of NeRT to eliminate uncontrolled turbulence from real-world environments. Lastly, we incorporate NeRT into continuously captured video sequences and demonstrate $48 \times$ speedup.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Adaptive Semantic Consistency for Cross-domain Few-shot Classification

paper_url: http://arxiv.org/abs/2308.00727
repo_url: None
paper_authors: Hengchu Lu, Yuanjie Shao, Xiang Wang, Changxin Gao
for: 这篇论文的目的是解决跨领域几少数分类（CD-FSC）中的挑战，即在几少数目标类上实现新的目标类别识别。
methods: 本文提出了一个简单的插件和测试框架ASC（适应Semantic Consistency），通过在调整阶段 reuse source图像，设计适应性负担分配策略，强调目标领域相似的标本，从source领域获取有用的知识，避免静止转移。
results: 实验结果显示，提出的ASC方法能够有效地提高跨领域Robustness，并且在多个benchmark上获得了相对的改善。

Abstract
Cross-domain few-shot classification (CD-FSC) aims to identify novel target classes with a few samples, assuming that there exists a domain shift between source and target domains. Existing state-of-the-art practices typically pre-train on source domain and then finetune on the few-shot target data to yield task-adaptive representations. Despite promising progress, these methods are prone to overfitting the limited target distribution since data-scarcity and ignore the transferable knowledge learned in the source domain. To alleviate this problem, we propose a simple plug-and-play Adaptive Semantic Consistency (ASC) framework, which improves cross-domain robustness by preserving source transfer capability during the finetuning stage. Concretely, we reuse the source images in the pretraining phase and design an adaptive weight assignment strategy to highlight the samples similar to target domain, aiming to aggregate informative target-related knowledge from source domain. Subsequently, a semantic consistency regularization is applied to constrain the consistency between the semantic features of the source images output by the source model and target model. In this way, the proposed ASC enables explicit transfer of source domain knowledge to prevent the model from overfitting the target domain. Extensive experiments on multiple benchmarks demonstrate the effectiveness of the proposed ASC, and ASC provides consistent improvements over the baselines. The source code will be released.

摘要
cross-domain few-shot classification (CD-FSC) targets novel classes with few samples, assuming a domain shift between source and target domains. Existing state-of-the-art methods pre-train on the source domain and fine-tune on few-shot target data to obtain task-adaptive representations. However, these methods are prone to overfitting the limited target distribution and ignore transferable knowledge from the source domain.To address this issue, we propose a simple plug-and-play Adaptive Semantic Consistency (ASC) framework. During the finetuning stage, we reuse the source images from the pretraining phase and assign adaptive weights to highlight samples similar to the target domain. This aims to aggregate informative target-related knowledge from the source domain. Additionally, we apply a semantic consistency regularization to constrain the consistency between the semantic features of the source images output by the source model and the target model. This ensures the explicit transfer of source domain knowledge to prevent overfitting the target domain.Our extensive experiments on multiple benchmarks demonstrate the effectiveness of the proposed ASC, and it consistently outperforms the baselines. The source code will be released.

Explainable Cost-Sensitive Deep Neural Networks for Brain Tumor Detection from Brain MRI Images considering Data Imbalance

paper_url: http://arxiv.org/abs/2308.00608
repo_url: https://github.com/shahariar-shibli/explainable-cost-sensitive-deep-neural-networks-for-brain-tumor-detection-from-brain-mri-images
paper_authors: Md Tanvir Rouf Shawon, G. M. Shahariar Shibli, Farzad Ahmed, Sajib Kumar Saha Joy
for: 这个研究旨在使用卷积神经网络（CNN）、ResNet50、InceptionV3、EfficientNetB0和NASNetMobile模型提高脑肿诊断的效率，以减少手动审查报告的时间和创建一个自动化脑肿分类系统。
methods: 该研究提出了一个自动化管道，包括五种模型：CNN、ResNet50、InceptionV3、EfficientNetB0和NASNetMobile。研究人员对这些模型进行了精心的调整和训练，以便在均衡数据集上评估其性能。
results: 研究人员发现，在均衡数据集上，精心调整的InceptionV3模型可以达到99.33%的准确率。此外，Explainable AI方法也被包含在模型中，以可视化模型的隐藏行为，以便更好地理解其黑obox行为。在均衡数据集上，使用成本敏感神经网络（CS-InceptionV3和CS-CNN）可以达到92.31%的准确率和1.00的回归值。

Abstract
This paper presents a research study on the use of Convolutional Neural Network (CNN), ResNet50, InceptionV3, EfficientNetB0 and NASNetMobile models to efficiently detect brain tumors in order to reduce the time required for manual review of the report and create an automated system for classifying brain tumors. An automated pipeline is proposed, which encompasses five models: CNN, ResNet50, InceptionV3, EfficientNetB0 and NASNetMobile. The performance of the proposed architecture is evaluated on a balanced dataset and found to yield an accuracy of 99.33% for fine-tuned InceptionV3 model. Furthermore, Explainable AI approaches are incorporated to visualize the model's latent behavior in order to understand its black box behavior. To further optimize the training process, a cost-sensitive neural network approach has been proposed in order to work with imbalanced datasets which has achieved almost 4% more accuracy than the conventional models used in our experiments. The cost-sensitive InceptionV3 (CS-InceptionV3) and CNN (CS-CNN) show a promising accuracy of 92.31% and a recall value of 1.00 respectively on an imbalanced dataset. The proposed models have shown great potential in improving tumor detection accuracy and must be further developed for application in practical solutions. We have provided the datasets and made our implementations publicly available at - https://github.com/shahariar-shibli/Explainable-Cost-Sensitive-Deep-Neural-Networks-for-Brain-Tumor-Detection-from-Brain-MRI-Images

摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

MonoNext: A 3D Monocular Object Detection with ConvNext

paper_url: http://arxiv.org/abs/2308.00596
repo_url: None
paper_authors: Marcelo Eduardo Pederiva, José Mario De Martino, Alessandro Zimmer
for: This paper aims to improve the accuracy and efficiency of Monocular 3D Object Detection models for autonomous driving perception tasks.
methods: The proposed method, called MonoNext, uses a spatial grid to map objects in the scene and employs the ConvNext network. It requires only 3D bounding box annotated data and is trained using a Multi-Tasking Learning approach.
results: In experiments with the KITTI dataset, MonoNext achieved high precision and competitive performance comparable with state-of-the-art approaches. With additional training data, MonoNext surpassed its initial performance and achieved even higher accuracies.Here’s the Chinese translation of the three points:
for: 这篇论文目标是提高自动驾驶视觉任务中的单目3D对象检测模型的准确率和效率。
methods: 提议的方法是MonoNext，它使用空间网格将场景中的对象映射，并使用ConvNext网络。它只需要3D包 bounds 注释数据，并通过多任务学习方法进行训练。
results: 在使用KITTI数据集进行实验时，MonoNext实现了高精度和与状态前方的性能，并且通过添加更多训练数据，MonoNext的性能超过了它的初始性能。

Abstract
Autonomous driving perception tasks rely heavily on cameras as the primary sensor for Object Detection, Semantic Segmentation, Instance Segmentation, and Object Tracking. However, RGB images captured by cameras lack depth information, which poses a significant challenge in 3D detection tasks. To supplement this missing data, mapping sensors such as LIDAR and RADAR are used for accurate 3D Object Detection. Despite their significant accuracy, the multi-sensor models are expensive and require a high computational demand. In contrast, Monocular 3D Object Detection models are becoming increasingly popular, offering a faster, cheaper, and easier-to-implement solution for 3D detections. This paper introduces a different Multi-Tasking Learning approach called MonoNext that utilizes a spatial grid to map objects in the scene. MonoNext employs a straightforward approach based on the ConvNext network and requires only 3D bounding box annotated data. In our experiments with the KITTI dataset, MonoNext achieved high precision and competitive performance comparable with state-of-the-art approaches. Furthermore, by adding more training data, MonoNext surpassed itself and achieved higher accuracies.

摘要
自适应驾驶感知任务大量依赖于摄像头作为主要感知器，包括对象检测、semantic segmentation、实例 segmentation和对象跟踪。然而，RGB图像由摄像头捕获的缺乏深度信息，对3D检测任务带来了重要挑战。为了补充缺失的数据，映射感知器如LIDAR和RADAR被使用于高精度的3D对象检测。尽管它们具有显著的准确性，但是多感知模型具有高计算成本和高成本。相比之下，单目3D对象检测模型在不断增长，它们具有更快、更便宜、更容易实现的解决方案。本文介绍了一种不同的多任务学习方法called MonoNext，它使用空间网格将场景中的对象映射。MonoNext采用一种简单的approach基于ConvNext网络，只需要3D bounding box注释数据。在我们对KITTI数据集进行实验中，MonoNext实现了高精度和与状态的表现相当。此外，通过添加更多的训练数据，MonoNext超越了自己并实现了更高的准确性。

Latent-Shift: Gradient of Entropy Helps Neural Codecs

paper_url: http://arxiv.org/abs/2308.00725
repo_url: None
paper_authors: Muhammet Balcilar, Bharath Bhushan Damodaran, Karam Naser, Franck Galpin, Pierre Hellier
for: 这篇论文主要是为了提高图像/视频压缩技术的效率和质量。
methods: 这篇论文使用了可训练的编码器，利用了人工智能的学习能力来自适应性能和特定领域的高性能。
results: 该论文经验性地表明，在使用压缩器时可以利用解码器侧的梯度 entropy 来提高压缩率，从而实现 $1-2%$ 的压缩率下降。这种方法是独立于其他改进方法，并且不会影响图像/视频的质量。

Abstract
End-to-end image/video codecs are getting competitive compared to traditional compression techniques that have been developed through decades of manual engineering efforts. These trainable codecs have many advantages over traditional techniques such as easy adaptation on perceptual distortion metrics and high performance on specific domains thanks to their learning ability. However, state of the art neural codecs does not take advantage of the existence of gradient of entropy in decoding device. In this paper, we theoretically show that gradient of entropy (available at decoder side) is correlated with the gradient of the reconstruction error (which is not available at decoder side). We then demonstrate experimentally that this gradient can be used on various compression methods, leading to a $1-2\%$ rate savings for the same quality. Our method is orthogonal to other improvements and brings independent rate savings.

摘要
通过端到端图像/视频编码器来比较传统压缩技术，这些可编程编码器具有许多优势，如容易适应人为设计的质量指标和高性能在特定领域，归功于其学习能力。然而，当前领先的神经网络编码器没有利用解码器端的梯度Entropy gradient。在这篇论文中，我们理论上验证了解码器端梯度Entropy gradient与重建错误梯度之间的相关性。然后，我们通过实验表明，这个梯度可以在不同的压缩方法上使用，导致1-2%的比较率节省，同时保持相同的质量。我们的方法与其他改进方法独立，具有独立的率节省。

Visibility Enhancement for Low-light Hazy Scenarios

paper_url: http://arxiv.org/abs/2308.00591
repo_url: None
paper_authors: Chaoqun Zhuang, Yunfei Liu, Sijia Wen, Feng Lu
for: 增强低光照晦涂图像的可见度
methods: 提议两种关键技术：一种是跨任务一致性推准权重框架，另一种是基于物理学习低光照晦涂数据集的物理学习模型
results: 对多个指标（包括SSIM（9.19%）和PSNR（5.03%））进行了广泛的实验，并且通过用户研究表明了人类视觉效果的必要性和有效性。

Abstract
Low-light hazy scenes commonly appear at dusk and early morning. The visual enhancement for low-light hazy images is an ill-posed problem. Even though numerous methods have been proposed for image dehazing and low-light enhancement respectively, simply integrating them cannot deliver pleasing results for this particular task. In this paper, we present a novel method to enhance visibility for low-light hazy scenarios. To handle this challenging task, we propose two key techniques, namely cross-consistency dehazing-enhancement framework and physically based simulation for low-light hazy dataset. Specifically, the framework is designed for enhancing visibility of the input image via fully utilizing the clues from different sub-tasks. The simulation is designed for generating the dataset with ground-truths by the proposed low-light hazy imaging model. The extensive experimental results show that the proposed method outperforms the SOTA solutions on different metrics including SSIM (9.19%) and PSNR(5.03%). In addition, we conduct a user study on real images to demonstrate the effectiveness and necessity of the proposed method by human visual perception.

摘要
低光照朦胧场景通常出现在晚上和早上。这种视觉提升低光照朦胧图像是一个不定系统的问题。虽然已有许多方法提出了用于图像抑霾和低光照提升，但直接组合这些方法并不能提供满意的结果。在这篇论文中，我们提出了一种新的方法，用于提高低光照朦胧场景的可见度。为解决这个复杂的任务，我们提出了两个关键技术：一是跨任务一致性抑霾提升框架，二是基于物理模型生成的低光照朦胧数据集。具体来说，框架是用于通过完全利用不同任务的各种各样的准确信息来提高输入图像的可见度。数据集是通过我们提出的低光照朦胧摄像机模型来生成的。我们的广泛的实验结果表明，我们的方法在不同的指标上（包括SSIM（9.19%）和PSNR（5.03%））都超过了现有的标准方法。此外，我们还进行了基于真实图像的用户研究，以证明我们的方法的有效性和必要性。

Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

paper_url: http://arxiv.org/abs/2308.00588
repo_url: None
paper_authors: Kaijian Liu, Shixiang Tang, Ziyue Li, Zhishuai Li, Lei Bai, Feng Zhu, Rui Zhao
for: 本文旨在提出一种基于关系意识的分布表示网络（RAD-Net），用于Movie parsing和identity-based Movie editing中的人群化。
methods: 本文提出一种基于图构建分布表示和周期更新策略来生成一个模态快意识的分布表示。
results: 对于Video Person-Clustering Dataset（VPCD）和VoxCeleb2多视图 clustering dataset，本文的方法实现了+6%和+8.2%的提升（F-score）。

Abstract
Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific uniqueness. As a result, these methods are not suitable for person clustering. In this paper, we propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues. The distribution representation of a clue is a vector consisting of the relation between this clue and all other clues from all modalities, thus being modality agnostic and good for person clustering. Accordingly, we introduce a graph-based method to construct distribution representation and employ a cyclic update policy to refine distribution representation progressively. Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering Dataset (VPCD) and VoxCeleb2 multi-view clustering dataset, respectively. Codes will be released publicly upon acceptance.

摘要
人群划分使用多modal的迹象，包括脸、身体和嗓音，是许多任务的关键，如电影分析和基于身份的电影编辑。相关的方法，如多视图划分，通常将多modal的特征项目到共同的特征空间。然而，多modal的迹象特征通常强度不高，因为模式特有的唯一性导致这些方法不适用于人群划分。在这篇论文中，我们提出了关注关系的分布表示网络（RAD-Net），用于生成多modal迹象的分布表示。每个迹象的分布表示为所有modalities的迹象和这个迹象之间的关系，因此是无关modal的和适合人群划分的。我们还提出了基于图的方法来构建分布表示，并使用循环更新策略来进行分布表示的细化。我们的方法在VPCD和VoxCeleb2多视图划分集合上实现了+6%和+8.2%的提升。代码将在接受后公开发布。

PVG: Progressive Vision Graph for Vision Recognition

paper_url: http://arxiv.org/abs/2308.00574
repo_url: None
paper_authors: Jiafu Wu, Jian Li, Jiangning Zhang, Boshen Zhang, Mingmin Chi, Yabiao Wang, Chengjie Wang
for: 这篇论文目的是提出一种Progressive Vision Graph（PVG）架构，用于解决图像识别任务中的不规则对象捕捉问题。
methods: 该架构包括三个主要组件：1）分层分离图构建（PSGC），用于逐渐增加全局图支持的通道数和减少本地支持的通道数，以获得更好的二级相似性信息；2）使用Max pooling和数学期望（MaxE）来归并丰富的邻居信息；3）图错Linear Unit（GraphLU），用于增强低值信息，以降低图像细节信息压缩，避免过度整合。
results: 对主流 benchmark 进行了广泛的实验，显示了PVG在比对 estado-of-the-art 方法时的优势。例如，我们的PVG-S在ImageNet-1K上得到了83.0%的Top-1准确率，比GNN-based ViG-S高出+0.9，同时参数减少18.5%。而最大的PVG-B准确率达到84.2%，高于ViG-B的+0.5。此外，PVG-S在COCO dataset上得到了+1.3 box AP和+0.4 mask AP的提升。

Abstract
Convolution-based and Transformer-based vision backbone networks process images into the grid or sequence structures, respectively, which are inflexible for capturing irregular objects. Though Vision GNN (ViG) adopts graph-level features for complex images, it has some issues, such as inaccurate neighbor node selection, expensive node information aggregation calculation, and over-smoothing in the deep layers. To address the above problems, we propose a Progressive Vision Graph (PVG) architecture for vision recognition task. Compared with previous works, PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC) to introduce second-order similarity by gradually increasing the channel of the global graph branch and decreasing the channel of local branch as the layer deepens; 2) Neighbor nodes information aggregation and update module by using Max pooling and mathematical Expectation (MaxE) to aggregate rich neighbor information; 3) Graph error Linear Unit (GraphLU) to enhance low-value information in a relaxed form to reduce the compression of image detail information for alleviating the over-smoothing. Extensive experiments on mainstream benchmarks demonstrate the superiority of PVG over state-of-the-art methods, e.g., our PVG-S obtains 83.0% Top-1 accuracy on ImageNet-1K that surpasses GNN-based ViG-S by +0.9 with the parameters reduced by 18.5%, while the largest PVG-B obtains 84.2% that has +0.5 improvement than ViG-B. Furthermore, our PVG-S obtains +1.3 box AP and +0.4 mask AP gains than ViG-S on COCO dataset.

摘要
“几何学基础的卷积网络和转换器基础的视觉后缘网络会将图像转化为格子或序列结构，这些结构不适合捕捉不规则的物体。尽管视觉图гра非线性（ViG）采用图гра级别特征来处理复杂图像，但它存在一些问题，如不准确的邻居节点选择、成本过高的邻居信息汇集计算和深层处理过程中的过滤。为解决以上问题，我们提出了进化的视觉图гра（PVG）架构 для视觉识别任务。相比前工作，PVG包含以下三个主要组成部分：1）逐层增强的全球图гра分支（PSGC），通过逐渐增加全球图гра分支的通道和减少本地分支的通道来引入二次相似性; 2）邻居节点信息汇集和更新模块，通过最大池化和数学期望（MaxE）来汇集丰富的邻居信息; 3）图гра梯度Linear Unit（GraphLU），以减少图像细节信息压缩，从而降低过滤。我们在主流的标准库上进行了广泛的实验，显示PVG在前一代方法之上具有优势，例如我们的PVG-S在ImageNet-1K上取得83.0%的Top-1准确率，比GNN基于ViG-S的82.1%高出0.9%，同时参数量减少18.5%。此外，我们的PVG-S在COCO dataset上对应的box AP和mask AP分别提高了+1.3和+0.4。”

Detecting Cloud Presence in Satellite Images Using the RGB-based CLIP Vision-Language Model

paper_url: http://arxiv.org/abs/2308.00541
repo_url: None
paper_authors: Mikolaj Czerkawski, Robert Atkinson, Christos Tachtatzis
for: 本研究用CLIP抽象语言模型来检测卫星图像中的云彩。
methods: 研究使用CLIP模型进行云存在检测的多种方法，包括文本提示的纯零shot操作以及多种微调方法。
results: CLIP模型可以在不同数据集和感知器类型（Sentinel-2和Landsat-8）上实现非rivial的云存在检测性能，并且可以泛化到不同的感知Modalities和感知频谱。

Abstract
This work explores capabilities of the pre-trained CLIP vision-language model to identify satellite images affected by clouds. Several approaches to using the model to perform cloud presence detection are proposed and evaluated, including a purely zero-shot operation with text prompts and several fine-tuning approaches. Furthermore, the transferability of the methods across different datasets and sensor types (Sentinel-2 and Landsat-8) is tested. The results that CLIP can achieve non-trivial performance on the cloud presence detection task with apparent capability to generalise across sensing modalities and sensing bands. It is also found that a low-cost fine-tuning stage leads to a strong increase in true negative rate. The results demonstrate that the representations learned by the CLIP model can be useful for satellite image processing tasks involving clouds.

摘要
这项研究探讨了预训练的CLIP视觉语言模型在找到遮盖云图像时的能力。研究提出了使用文本提示进行零shot操作以及多种精度调整方法来实现云存在检测。此外，研究还测试了这些方法在不同的数据集和探测器类型（Sentinel-2和Landsat-8）之间的传输性。结果显示CLIP可以在云存在检测任务中实现非负性表现，并且表现出对探测Modalities和探测频谱的通用性。另外，一种低成本的精度调整阶段可以带来明显增加真正的零分数率。结果表明CLIP模型学习的表征可以对卫星图像处理任务中的云有用。

Visual attention information can be traced on cortical response but not on the retina: evidence from electrophysiological mouse data using natural images as stimuli

paper_url: http://arxiv.org/abs/2308.00526
repo_url: None
paper_authors: Nikos Melanitis, Konstantina Nikita
for: 这个研究探讨了视觉注意力的生物基础，以计算方式研究视觉注意力的生物基础。
methods: 研究使用了眼球和大脑电生物学数据来分析视觉注意力的生物基础。视觉刺激是自然图像，展示了真实世界场景。
results: 研究发现，在初级视觉层（V1）中，约10%的神经元响应不同于突出性视觉区域。视觉注意力信息不存在于眼球响应中，似乎眼球停留不了视觉注意力信息。

Abstract
Visual attention forms the basis of understanding the visual world. In this work we follow a computational approach to investigate the biological basis of visual attention. We analyze retinal and cortical electrophysiological data from mouse. Visual Stimuli are Natural Images depicting real world scenes. Our results show that in primary visual cortex (V1), a subset of around $10\%$ of the neurons responds differently to salient versus non-salient visual regions. Visual attention information was not traced in retinal response. It appears that the retina remains naive concerning visual attention; cortical response gets modulated to interpret visual attention information. Experimental animal studies may be designed to further explore the biological basis of visual attention we traced in this study. In applied and translational science, our study contributes to the design of improved visual prostheses systems -- systems that create artificial visual percepts to visually impaired individuals by electronic implants placed on either the retina or the cortex.

摘要
视觉注意力是视觉世界理解的基础。在这项工作中，我们采用计算机方法研究生物基础的视觉注意力。我们分析了鼠脑和脊椎电physiological数据。视觉刺激是自然图像，描绘现实生活场景。我们的结果表明，在主视觉层（V1）中，约10%的神经元响应不同于突出 versus 非突出视觉区域。视觉注意力信息没有踪迹在Retina响应中。似乎retina免疫Visual attention信息； cortical response被修饰以解读视觉注意力信息。在生物基础研究中，我们可以通过进一步的实验动物研究来探索我们在这项研究中跟踪的生物基础。在应用和翻译科学中，我们的研究对于设计改进的视觉 prótesis系统做出了贡献，这些系统通过电子植入在Retina或cortex上为视障人群创造人工视觉感受。

NormKD: Normalized Logits for Knowledge Distillation

paper_url: http://arxiv.org/abs/2308.00520
repo_url: https://github.com/gizi1/NormKD
paper_authors: Zhihao Chi, Tu Zheng, Hengjia Li, Zheng Yang, Boxi Wu, Binbin Lin, Deng Cai
for: 提高logit基于知识传递的性能，尤其是在温度参数的调整方面。
methods: 提出Normalized Knowledge Distillation（NormKD），通过自适应调整每个样本的温度来更好地传递样本特有的知识。
results: NormKD在CIRAR-100和ImageNet上的图像分类 task 中比vanilla KD更好，而且可以轻松地应用于其他logit基于方法，达到类似或更好的性能。

Abstract
Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the purpose of customizing the temperature for each sample according to the characteristic of the sample's logit distribution. Compared to the vanilla KD, NormKD barely has extra computation or storage cost but performs significantly better on CIRAR-100 and ImageNet for image classification. Furthermore, NormKD can be easily applied to the other logit based methods and achieve better performance which can be closer to or even better than the feature based method.

摘要
这些年来，逻辑基本的知识传递几乎没有收到关注，因为基于特征的方法在大多数情况下表现更好。然而，我们发现这些逻辑基本方法仍然具有尚未发掘的潜力，尤其是在温度参数的重新调整方面。在前一些作品中，温度通常是 Fix 的整个知识传递程序中的一个固定值。然而，这些逻辑出力的不同样本之间的分布相当多元，因此将所有逻辑出力软化到相同的温度可能无法传递每个样本的知识充分。在这篇论文中，我们重新研究温度参数，发现它在传递每个样本的知识方面存在不足的缺陷。为了解决这个问题，我们提出 Normalized Knowledge Distillation（NormKD），将温度调整为每个样本的逻辑出力分布特点。相比于普通的 KD，NormKD 只有额外的计算或储存成本，但在 CIRAR-100 和 ImageNet 上的图像分类 task 中表现出色，与特征基本方法相比，其表现更好。此外，NormKD 可以轻松地应用到其他逻辑基本方法，并在这些方法上表现更好，甚至可以与特征基本方法相比。

Markerless human pose estimation for biomedical applications: a survey

paper_url: http://arxiv.org/abs/2308.00519
repo_url: None
paper_authors: Andrea Avogaro, Federico Cunico, Bodo Rosenhahn, Francesco Setti
for: 本研究旨在为医疗领域提供 markerless 人体姿态估计（HPE）的概述，并评估其在生物医学应用中的可能性。
methods: 本研究使用了多种 HPE 方法，包括深度学习、核积分析、模糊学习等，以及对这些方法的评估和比较。
results: 研究发现，HPE 技术在评估 двига功能、 neuromuscular rehabilitation 和步态分析等领域具有潜在的应用前景，并且可能成为远程医疗的一种重要工具。

Abstract
Markerless Human Pose Estimation (HPE) proved its potential to support decision making and assessment in many fields of application. HPE is often preferred to traditional marker-based Motion Capture systems due to the ease of setup, portability, and affordable cost of the technology. However, the exploitation of HPE in biomedical applications is still under investigation. This review aims to provide an overview of current biomedical applications of HPE. In this paper, we examine the main features of HPE approaches and discuss whether or not those features are of interest to biomedical applications. We also identify those areas where HPE is already in use and present peculiarities and trends followed by researchers and practitioners. We include here 25 approaches to HPE and more than 40 studies of HPE applied to motor development assessment, neuromuscolar rehabilitation, and gait & posture analysis. We conclude that markerless HPE offers great potential for extending diagnosis and rehabilitation outside hospitals and clinics, toward the paradigm of remote medical care.

摘要
无标记人 pose 估计（HPE）在许多应用领域已经证明了其潜力，如运动科学、医学、心理学等。由于HPE技术的设置容易、可移植和成本低廉，因此经常被选择于标记型动作捕捉系统。然而，HPE在生物医学应用中的利用仍然在调查阶段。本文提供了生物医学应用中HPE的现状和趋势。我们分析了HPE方法的主要特点，并评估了这些特点是否有利于生物医学应用。我们还提出了HPE在 motor 发展评估、 neuromuscular rehabilitation 和步态分析等领域的应用。最后，我们结论 markerless HPE 有广泛的扩展 диагности和rehabilitation 的可能性，从而实现远程医疗的目标。

Relational Contrastive Learning for Scene Text Recognition

paper_url: http://arxiv.org/abs/2308.00508
repo_url: https://github.com/thundervvv/rclstr
paper_authors: Jinglei Zhang, Tiancheng Lin, Yi Xu, Kai Chen, Rui Zhang
for: 本研究旨在提高场景文本识别的自动化方法，通过 incorporating semantic priors from words 来提供有效的自我超vised标签，以便进行表征学学习。
methods: 本研究提出了一种名为 RCLSTR（关系对比学习）的框架，通过 rearrange、层次结构和交互来扩展文本关系，从而提高表征学学习的效果。
results: 实验表明，RCLSTR 方法可以在场景文本识别中提高表征质量，并且超越了当前的自动化 STR 方法。I hope this helps! Let me know if you have any further questions.

Abstract
Context-aware methods achieved great success in supervised scene text recognition via incorporating semantic priors from words. We argue that such prior contextual information can be interpreted as the relations of textual primitives due to the heterogeneous text and background, which can provide effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of dataset due to lexical dependencies, which causes the problem of over-fitting and compromises representation robustness. To this end, we propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. Based on causality, we theoretically explain that three modules suppress the bias caused by the contextual prior and thus guarantee representation robustness. Experiments on representation quality show that our method outperforms state-of-the-art self-supervised STR methods. Code is available at https://github.com/ThunderVVV/RCLSTR.

摘要
Context-aware methods have achieved great success in supervised scene text recognition by incorporating semantic priors from words. We argue that such prior contextual information can be interpreted as the relations of textual primitives due to the heterogeneous text and background, which can provide effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of dataset due to lexical dependencies, which causes the problem of over-fitting and compromises representation robustness. To this end, we propose to enrich the textual relations via rearrangement, hierarchy, and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. Based on causality, we theoretically explain that three modules suppress the bias caused by the contextual prior and thus guarantee representation robustness. Experiments on representation quality show that our method outperforms state-of-the-art self-supervised STR methods. Code is available at .

Improved Prognostic Prediction of Pancreatic Cancer Using Multi-Phase CT by Integrating Neural Distance and Texture-Aware Transformer

paper_url: http://arxiv.org/abs/2308.00507
repo_url: None
paper_authors: Hexin Dong, Jiawen Yao, Yuxing Tang, Mingze Yuan, Yingda Xia, Jian Zhou, Hong Lu, Jingren Zhou, Bin Dong, Le Lu, Li Zhang, Zaiyi Liu, Yu Shi, Ling Zhang
for:这个论文主要是为了提高阈癌动脉细胞癌（PDAC）患者的预后预测。methods:这个论文使用了一种新的学习型神经距离，以描述不同患者的CT图像中 tumor 和附近重要血管之间的精确关系，并将其作为预后预测的重要特征。此外，这个论文还提出了一种基于 CNN 和 transformer 模块的多相照 CT 图像中动态肿瘤相关文本特征提取方法，以提高预后预测的准确性。results:对于 multi-center（n=4）数据集中的 1,070 名PDAC患者，经过广泛评估和比较，研究人员发现了提出的方法在外测集中（包括三个中心）的临床效果是 statistically significant（p<0.001），并且发现这个方法可以准确地预测PDAC患者的全身生存率。此外，研究人员还发现，在外测集中，这个方法是预后预测中最强的预测因素之一，并且有可能与已知的临床因素相结合，以选择需要neoadjuvant therapy的患者。

Abstract
Pancreatic ductal adenocarcinoma (PDAC) is a highly lethal cancer in which the tumor-vascular involvement greatly affects the resectability and, thus, overall survival of patients. However, current prognostic prediction methods fail to explicitly and accurately investigate relationships between the tumor and nearby important vessels. This paper proposes a novel learnable neural distance that describes the precise relationship between the tumor and vessels in CT images of different patients, adopting it as a major feature for prognosis prediction. Besides, different from existing models that used CNNs or LSTMs to exploit tumor enhancement patterns on dynamic contrast-enhanced CT imaging, we improved the extraction of dynamic tumor-related texture features in multi-phase contrast-enhanced CT by fusing local and global features using CNN and transformer modules, further enhancing the features extracted across multi-phase CT images. We extensively evaluated and compared the proposed method with existing methods in the multi-center (n=4) dataset with 1,070 patients with PDAC, and statistical analysis confirmed its clinical effectiveness in the external test set consisting of three centers. The developed risk marker was the strongest predictor of overall survival among preoperative factors and it has the potential to be combined with established clinical factors to select patients at higher risk who might benefit from neoadjuvant therapy.

摘要
胆囊ductal adenocarcinoma (PDAC) 是一种高度致命的癌症，肿块-血管交叠对病人的手术可能性和全身存活率产生很大影响。然而，现有的预测方法不能准确地和肿块之间的关系进行详细调查。这篇论文提出了一种新的学习型神经距离，用于描述不同病人的 CT 图像中肿块和附近重要血管之间的精确关系，并将其作为预测方法的主要特征。此外，与现有模型使用 CNN 或 LSTM 挖掘肿块增强 Pattern 在动态增强 CT 图像中，我们改进了在多相增强 CT 图像中提取动态肿块相关的文本特征，并将本地和全局特征 fusion 使用 CNN 和 transformer 模块，进一步提高了在多相 CT 图像中提取的特征。我们对多中心（n=4）数据集中的 1,070 名PDAC患者进行了广泛的评估和比较，并统计分析表明了我们的方法在外部测试集（包括三个中心）中的临床效果。我们开发的风险标记是PDAC患者的全身存活率最强的预测因素之一，并且它有可能与已知的临床因素相结合，以选择可能需要neoadjuvanttherapy的患者。

An L2-Normalized Spatial Attention Network For Accurate And Fast Classification Of Brain Tumors In 2D T1-Weighted CE-MRI Images

paper_url: http://arxiv.org/abs/2308.00491
repo_url: https://github.com/juliadietlmeier/mri_image_classification
paper_authors: Grace Billingsley, Julia Dietlmeier, Vivek Narayanaswamy, Andreas Spanias, Noel E. OConnor
for: 这个论文是为了开发一种高精度和快速的脑肿图像分类网络，用于识别MRI图像中的脑肿。
methods: 该论文使用了一种l2-normalized spatial attention机制，用于防止过拟合 During 训练。
results: 对于一个复杂的2D T1-weighted CE-MRI数据集，该模型可以与所有轻量级方法相比，在准确性方面表现出色。 ensemble with the pretrained VGG16可以更高的准确率，但是需要增加执行速度。

Abstract
We propose an accurate and fast classification network for classification of brain tumors in MRI images that outperforms all lightweight methods investigated in terms of accuracy. We test our model on a challenging 2D T1-weighted CE-MRI dataset containing three types of brain tumors: Meningioma, Glioma and Pituitary. We introduce an l2-normalized spatial attention mechanism that acts as a regularizer against overfitting during training. We compare our results against the state-of-the-art on this dataset and show that by integrating l2-normalized spatial attention into a baseline network we achieve a performance gain of 1.79 percentage points. Even better accuracy can be attained by combining our model in an ensemble with the pretrained VGG16 at the expense of execution speed. Our code is publicly available at https://github.com/juliadietlmeier/MRI_image_classification

摘要
我们提出了一种精度快速的分类网络，用于分类MRI图像中的脑肿瘤，超越了所有轻量级方法的精度。我们在一个复杂的2D T1束缩MRI数据集上测试了我们的模型，该数据集包含三种脑肿瘤：膝盖肿瘤、 glioma 和 hypophyseal。我们引入了L2正则化的空间注意力机制，以防止过拟合 durante el entrenamiento。我们与状态的报告进行比较，并显示通过将L2正则化的空间注意力integrated into a baseline network，我们在该数据集上 achievement 1.79 percentage points的性能提升。可以通过将我们的模型与预训练的VGG16 ensemble来实现更高的准确率，但是这将导致执行速度的降低。我们的代码可以在https://github.com/juliadietlmeier/MRI_image_classification中下载。

DINO-CXR: A self supervised method based on vision transformer for chest X-ray classification

paper_url: http://arxiv.org/abs/2308.00475
repo_url: None
paper_authors: Mohammadreza Shakouri, Fatemeh Iranmanesh, Mahdi Eftekhari
for: 这个研究是为了开发一种基于自我超vised学习的肺X射线分类方法，以解决医疗图像分析领域中数据limited的问题。
methods: 该方法基于一种自我超vised方法DINO，使用了一种视力 трансформер来进行肺X射线分类。
results: 研究表明，提议的方法可以在肺炎和COVID-19检测中达到比较好的效果，并且在需要更少标注数据的情况下表现出了比 state-of-the-art 方法更高的准确率。

Abstract
The limited availability of labeled chest X-ray datasets is a significant bottleneck in the development of medical imaging methods. Self-supervised learning (SSL) can mitigate this problem by training models on unlabeled data. Furthermore, self-supervised pretraining has yielded promising results in visual recognition of natural images but has not been given much consideration in medical image analysis. In this work, we propose a self-supervised method, DINO-CXR, which is a novel adaptation of a self-supervised method, DINO, based on a vision transformer for chest X-ray classification. A comparative analysis is performed to show the effectiveness of the proposed method for both pneumonia and COVID-19 detection. Through a quantitative analysis, it is also shown that the proposed method outperforms state-of-the-art methods in terms of accuracy and achieves comparable results in terms of AUC and F-1 score while requiring significantly less labeled data.

摘要
限量的胸部X射线数据的可用性是医学影像分析领域的主要瓶颈。自我超级学习（SSL）可以减轻这个问题，通过训练无标注数据上的模型。然而，在医学图像分析领域中，自我超级学习还没有受到很多关注。在这项工作中，我们提出了一种自我超级学习方法，名为DINO-CXR，这是基于视力变换器的胸部X射线分类方法的一种新的适应。我们进行了比较分析，以示方法的效果在肺炎和COVID-19检测中。通过量化分析，我们也证明了提议方法在精度和AUC和F-1分数方面的表现比前任方法更佳，同时需要训练数据量更少。

Is Last Layer Re-Training Truly Sufficient for Robustness to Spurious Correlations?

paper_url: http://arxiv.org/abs/2308.00473
repo_url: None
paper_authors: Phuong Quynh Le, Jörg Schlötterer, Christin Seifert
for: 提高预测模型在具有相关特征的样本组中的准确率。
methods: 使用 Deep Feature Reweighting (DFR) 方法，重新训练分类模型的最后一层，使用小型、归一化的数据集来适应实际数据。
results: DFR 方法可以提高预测模型在具有相关特征的样本组中的准确率，但是它仍然可能受到偶极相关性的影响。

Abstract
Models trained with empirical risk minimization (ERM) are known to learn to rely on spurious features, i.e., their prediction is based on undesired auxiliary features which are strongly correlated with class labels but lack causal reasoning. This behavior particularly degrades accuracy in groups of samples of the correlated class that are missing the spurious feature or samples of the opposite class but with the spurious feature present. The recently proposed Deep Feature Reweighting (DFR) method improves accuracy of these worst groups. Based on the main argument that ERM mods can learn core features sufficiently well, DFR only needs to retrain the last layer of the classification model with a small group-balanced data set. In this work, we examine the applicability of DFR to realistic data in the medical domain. Furthermore, we investigate the reasoning behind the effectiveness of last-layer retraining and show that even though DFR has the potential to improve the accuracy of the worst group, it remains susceptible to spurious correlations.

摘要
Empirical risk minimization (ERM) 模型受训练后可能会学习干扰特征，即其预测基于 auxiliary 特征和类别标签之间强相关性，但缺乏 causal 理解。这种行为尤其是在类别标签相关的样本组中具有干扰特征的欠落或相反类别标签的样本具有干扰特征时，预测精度会下降。 reciently proposed Deep Feature Reweighting (DFR) 方法可以提高这些最差的组的准确率。基于主要的 argumen that ERM 模型可以够好地学习核心特征，DFR 只需要在类别模型的最后一层进行小型、 balance 的数据集重新训练。在这项工作中，我们检查了 DFR 在医疗领域的实际数据上的适用性。此外，我们还investigated last-layer 重新训练的理解，并证明了 DFR 具有改善最差组的准确率的潜在能力，但它仍然受到干扰关系的影响。

A Deep Learning Approach for Virtual Contrast Enhancement in Contrast Enhanced Spectral Mammography

paper_url: http://arxiv.org/abs/2308.00471
repo_url: None
paper_authors: Aurora Rofena, Valerio Guarrasi, Marina Sarli, Claudia Lucia Piccolo, Matteo Sammarra, Bruno Beomonte Zobel, Paolo Soda
for: 这个研究是为了提高胸部摄像的检测精确度和安全性，使用深度生成模型来实现虚拟增强对CESM的应用。
methods: 这个研究使用了深度生成模型，包括自适应网络和两个生成对抗网络，将低能量图像转换为虚拟增强的混合图像。
results: 研究结果显示，使用CycleGAN生成模型可以生成高质量的虚拟混合图像，并且与临床专业人员的评价相符，显示这种方法具有潜在的应用前景。

Abstract
Contrast Enhanced Spectral Mammography (CESM) is a dual-energy mammographic imaging technique that first needs intravenously administration of an iodinated contrast medium; then, it collects both a low-energy image, comparable to standard mammography, and a high-energy image. The two scans are then combined to get a recombined image showing contrast enhancement. Despite CESM diagnostic advantages for breast cancer diagnosis, the use of contrast medium can cause side effects, and CESM also beams patients with a higher radiation dose compared to standard mammography. To address these limitations this work proposes to use deep generative models for virtual contrast enhancement on CESM, aiming to make the CESM contrast-free as well as to reduce the radiation dose. Our deep networks, consisting of an autoencoder and two Generative Adversarial Networks, the Pix2Pix, and the CycleGAN, generate synthetic recombined images solely from low-energy images. We perform an extensive quantitative and qualitative analysis of the model's performance, also exploiting radiologists' assessments, on a novel CESM dataset that includes 1138 images that, as a further contribution of this work, we make publicly available. The results show that CycleGAN is the most promising deep network to generate synthetic recombined images, highlighting the potential of artificial intelligence techniques for virtual contrast enhancement in this field.

摘要
增强画像肿瘤成像（CESM）是一种双能量肿瘤成像技术，需要静脉注射iodinated contrast媒体后，先收集低能量图像，与标准肿瘤成像相同，然后与高能量图像组合得到重组图像，显示增强效果。 despite CESM的诊断优势，使用contrast媒体可能会导致副作用，同时CESM也会对病人辐射更高的辐射剂量。为解决这些限制，这项工作提出使用深度生成模型进行虚拟增强，以使CESM成为无contrast和减少辐射剂量的。我们的深度网络包括一个自适应网络和两个生成对抗网络，包括Pix2Pix和CycleGAN，通过将低能量图像转换成合成的重组图像。我们对新的CESM数据集进行了广泛的量化和质量分析，同时采用了放射学家的评估，包括1138张图像，并将其公开发布。结果表明CycleGAN是最有前途的深度网络，生成合成重组图像，强调人工智能技术在这个领域的潜力。

Center Contrastive Loss for Metric Learning

paper_url: http://arxiv.org/abs/2308.00458
repo_url: None
paper_authors: Bolun Cai, Pengfei Xiong, Shangxuan Tian
for: 提高图像embedding的推理能力和准确率
methods: 使用Center Contrastive Loss函数，维护类别中心银行，并将类别中心与查询数据点进行对比，以减少内类差异和提高对类差异
results: 使用标准网络（ResNet50）和提出的损失函数，实现图像embedding的状态 Footnote 1 和更快的收敛速度，见图1

Abstract
Contrastive learning is a major studied topic in metric learning. However, sampling effective contrastive pairs remains a challenge due to factors such as limited batch size, imbalanced data distribution, and the risk of overfitting. In this paper, we propose a novel metric learning function called Center Contrastive Loss, which maintains a class-wise center bank and compares the category centers with the query data points using a contrastive loss. The center bank is updated in real-time to boost model convergence without the need for well-designed sample mining. The category centers are well-optimized classification proxies to re-balance the supervisory signal of each class. Furthermore, the proposed loss combines the advantages of both contrastive and classification methods by reducing intra-class variations and enhancing inter-class differences to improve the discriminative power of embeddings. Our experimental results, as shown in Figure 1, demonstrate that a standard network (ResNet50) trained with our loss achieves state-of-the-art performance and faster convergence.

摘要
<>将文本翻译成简化中文。<>研究主题：对比学习是 metric learning 中的一大研究领域。然而，选取有效对比对 remainschallenge 因为有限的批处理大小、数据分布不均衡和避免过拟合。在本文中，我们提出了一种新的 metric learning 函数 called Center Contrastive Loss，它保持一个类别中心银行并将查询数据点与类别中心进行对比，使用对比损失来训练模型。类别中心银行在实时更新，以促进模型的收敛，无需特别的样本挖掘。类别中心 acts as a well-optimized classification proxy，帮助平衡每个类的监督信号。此外，我们的损失函数结合了对比和分类方法的优点，减少了内类差异和提高了间类差异，以提高嵌入的推理力。我们的实验结果，如图1所示，显示了一个标准网络（ResNet50）通过我们的损失函数实现了状态略领先的性能和更快的收敛。

ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data

paper_url: http://arxiv.org/abs/2308.00454
repo_url: https://github.com/ruiqirichard/eegeyenet-vit
paper_authors: Ruiqi Yang, Eric Modesitt
for: 这个研究用于应用混合式视觉transformer（ViT）模型，已经预训在ImageNet上，来解决电enzephalogram（EEG）预测任务。
methods: 这个模型使用了一个混合式ViT模型，在ImageNet上预训，然后为EEG数据进行精致调整。
results: 这个方法在EEG预测任务中表现出了明显的提升，比其他模型，包括一个相同架构的ViT模型，而且这些模型没有ImageNet的预训 weights。这个发现挑战了传统的模型通用性理论，表明transformer模型在不同的任务上可以提供有用的假设。

Abstract
In this study, we demonstrate the application of a hybrid Vision Transformer (ViT) model, pretrained on ImageNet, on an electroencephalogram (EEG) regression task. Despite being originally trained for image classification tasks, when fine-tuned on EEG data, this model shows a notable increase in performance compared to other models, including an identical architecture ViT trained without the ImageNet weights. This discovery challenges the traditional understanding of model generalization, suggesting that Transformer models pretrained on seemingly unrelated image data can provide valuable priors for EEG regression tasks with an appropriate fine-tuning pipeline. The success of this approach suggests that the features extracted by ViT models in the context of visual tasks can be readily transformed for the purpose of EEG predictive modeling. We recommend utilizing this methodology not only in neuroscience and related fields, but generally for any task where data collection is limited by practical, financial, or ethical constraints. Our results illuminate the potential of pretrained models on tasks that are clearly distinct from their original purpose.

摘要
在这项研究中，我们展示了使用混合型视力变换器（ViT）模型，先前在ImageNet上预训练，在电энцеfalogram（EEG）预测任务中应用。尽管这种模型原本是用于图像分类任务，但当 fine-tuning 在 EEG 数据上时，这种模型显示了与其他模型相比 Notable 的性能提升，包括一个相同的架构 ViT 未使用 ImageNet 权重。这一发现挑战了传统的模型泛化理解，表明在 seemingly unrelated 图像数据上预训练 Transformer 模型可以为 EEG 预测任务提供有价值的先验知识。这种方法的成功表明了 ViT 模型在视觉任务上抽取的特征可以轻松地转换为 EEG 预测模型的目的。我们建议在 neuroscience 和相关领域不仅使用这种方法，而且在任务数据收集受限于实用、金融或道德因素的情况下，一般来说，这种方法可以广泛应用。我们的结果探讨了预训练模型在任务上的潜在可能性，并证明了这种方法在任务上的泛化能力。

A Majority Invariant Approach to Patch Robustness Certification for Deep Learning Models

paper_url: http://arxiv.org/abs/2308.00452
repo_url: https://github.com/kio-cs/majorcert
paper_authors: Qilin Zhou, Zhengyuan Wei, Haipeng Wang, W. K. Chan
for: 验证深度学习模型是否可以由特定的补丁修改预测标签。
methods: 使用 MajorCert 算法，首先找到同一个样本上同一个补丁区域可以操作的所有可能的标签集，然后对这些组合进行元素综合，最后检查元素稳定性是否保持不变以确认样本的合法性。
results: MajorCert 可以快速和高效地验证深度学习模型的patch robustness，并且可以验证样本是否可以被修改为预测不同的标签。

Abstract
Patch robustness certification ensures no patch within a given bound on a sample can manipulate a deep learning model to predict a different label. However, existing techniques cannot certify samples that cannot meet their strict bars at the classifier or patch region levels. This paper proposes MajorCert. MajorCert firstly finds all possible label sets manipulatable by the same patch region on the same sample across the underlying classifiers, then enumerates their combinations element-wise, and finally checks whether the majority invariant of all these combinations is intact to certify samples.

摘要
patch 强健证明ensure no patch within a given bound on a sample can manipulate a deep learning model to predict a different label. However, existing techniques cannot certify samples that cannot meet their strict bars at the classifier or patch region levels. This paper proposes MajorCert. MajorCert first finds all possible label sets manipulatable by the same patch region on the same sample across the underlying classifiers, then enumerates their combinations element-wise, and finally checks whether the majority invariant of all these combinations is intact to certify samples.Here's the breakdown of the translation:* "patch" is translated as " patch" (缝合)* "robustness" is translated as "强健" (qiáng jiàn)* "certification" is translated as "证明" (zhèng míng)* "cannot" is translated as "不能" (bù néng)* "meet their strict bars" is translated as "能满足他们的严格标准" (néng mǎn shòu tā men de jiān gròng bāng yào)* "MajorCert" is translated as " MajorCert" (主要证明)* "firstly" is translated as "首先" (shǒu xiān)* "finds" is translated as "找到" (zhao dào)* "all possible label sets" is translated as "所有可能的标签集" (suǒ yǒu kě néng de tiǎo jiè jít)* "manipulatable" is translated as "可操作" (kě còng zhí)* "same patch region" is translated as "同一个缝合区域" (tóng yī gè pò huì qū nèi)* "same sample" is translated as "同一个样本" (tóng yī gè yàng běn)* "across the underlying classifiers" is translated as "在基础分类器下" (zhī yào qī yǐ jī zhì)* "enumerates their combinations" is translated as "列出 их组合" (liè chuī yǐ zhòng zhì)* "element-wise" is translated as "元素方式" (yuán xīng fāng shì)* "finally" is translated as "最后" (zuì hòu)* "checks whether the majority invariant" is translated as "检查是否存在多数不变量" (jiǎn chá zhèng yě bù zhāng yì)* "is intact" is translated as "完整" (wán zhèng)I hope this helps! Let me know if you have any further questions.

Physics-Driven Spectrum-Consistent Federated Learning for Palmprint Verification

paper_url: http://arxiv.org/abs/2308.00451
repo_url: https://github.com/zi-yuanyang/psfed-palm
paper_authors: Ziyuan Yang, Andrew Beng Jin Teoh, Bob Zhang, Lu Leng, Yi Zhang
for: 本文旨在提出一种基于物理学的分布式学习方法，以提高palmprint认证的精度和安全性。
methods: 该方法首先将客户端分为短波和长波两个组，根据本地图像的波长范围。然后，引入了短波和长波的引导模型，将本地模型的优化方向约束在 anchor model 的方向上。特别是，我们定义了一种spectrum-consistent loss函数，使得模型参数和特征表示与其对应的引导模型保持一致。最后，我们对本地模型进行了一些约束，以确保它们与全局模型保持一致，从而避免模型漂移。
results: 我们通过了广泛的实验 validate the effectiveness of our proposed PSFed-Palm approach. despite only a limited number of training data, the proposed PSFed-Palm demonstrates compelling performance.

Abstract
Palmprint as biometrics has gained increasing attention recently due to its discriminative ability and robustness. However, existing methods mainly improve palmprint verification within one spectrum, which is challenging to verify across different spectrums. Additionally, in distributed server-client-based deployment, palmprint verification systems predominantly necessitate clients to transmit private data for model training on the centralized server, thereby engendering privacy apprehensions. To alleviate the above issues, in this paper, we propose a physics-driven spectrum-consistent federated learning method for palmprint verification, dubbed as PSFed-Palm. PSFed-Palm draws upon the inherent physical properties of distinct wavelength spectrums, wherein images acquired under similar wavelengths display heightened resemblances. Our approach first partitions clients into short- and long-spectrum groups according to the wavelength range of their local spectrum images. Subsequently, we introduce anchor models for short- and long-spectrum, which constrain the optimization directions of local models associated with long- and short-spectrum images. Specifically, a spectrum-consistent loss that enforces the model parameters and feature representation to align with their corresponding anchor models is designed. Finally, we impose constraints on the local models to ensure their consistency with the global model, effectively preventing model drift. This measure guarantees spectrum consistency while protecting data privacy, as there is no need to share local data. Extensive experiments are conducted to validate the efficacy of our proposed PSFed-Palm approach. The proposed PSFed-Palm demonstrates compelling performance despite only a limited number of training data. The codes will be released at https://github.com/Zi-YuanYang/PSFed-Palm.

摘要
Recently, palmprint recognition has gained increasing attention due to its discriminative ability and robustness. However, existing methods mainly focus on improving palmprint verification within a single spectrum, which is challenging to verify across different spectrums. Moreover, in distributed server-client-based deployments, palmprint verification systems typically require clients to transmit private data for model training on the centralized server, raising privacy concerns. To address these issues, in this paper, we propose a physics-driven spectrum-consistent federated learning method for palmprint verification, called PSFed-Palm.PSFed-Palm leverages the inherent physical properties of distinct wavelength spectrums, where images acquired under similar wavelengths exhibit enhanced resemblances. Our approach first divides clients into short- and long-spectrum groups based on the wavelength range of their local spectrum images. We then introduce anchor models for short- and long-spectrum, which constrain the optimization directions of local models associated with long- and short-spectrum images. Specifically, we design a spectrum-consistent loss that enforces the model parameters and feature representation to align with their corresponding anchor models. Finally, we impose constraints on the local models to ensure their consistency with the global model, effectively preventing model drift. This approach ensures spectrum consistency while protecting data privacy, as there is no need to share local data.We conduct extensive experiments to validate the effectiveness of our proposed PSFed-Palm approach. The results show that the proposed PSFed-Palm demonstrates impressive performance despite using a limited number of training data. The codes will be released at https://github.com/Zi-YuanYang/PSFed-Palm.

FLatten Transformer: Vision Transformer using Focused Linear Attention

paper_url: http://arxiv.org/abs/2308.00442
repo_url: https://github.com/leaplabthu/flatten-transformer
paper_authors: Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang
for: 提高自然语言处理 task 中 Transformer 模型的效率和表达力
methods: 提出一种新的 Linear Attention 模块，通过分析 Linear Attention 的缺陷和限制，提出一种简单 yet effective 的 mapping function，以及一种高效的排名修复模块，以提高 Linear Attention 的表达力而不增加计算复杂度
results: 在多个 benchmark 上实现了高效和表达力的 Linear Attention 模型，并且可以应用于多种进阶的视觉 Transformer 模型

Abstract
The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.

摘要
归一式计算复杂性的问题在应用Transformer模型于视觉任务中成为了一个持续的挑战。相比之下，线性注意力则提供了一个非常高效的替代方案，其计算复杂性 linear。然而，现有的线性注意力方法 Either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.Here's the translation in Traditional Chinese:返回式计算复杂性的问题在应用Transformer模型于视觉任务中成为了一个持续的挑战。相比之下，线性注意力则提供了一个非常高效的替代方案，其计算复杂性 linear。然而，现有的线性注意力方法 Either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.

Multiscale Global and Regional Feature Learning Using Co-Tuplet Loss for Offline Handwritten Signature Verification

paper_url: http://arxiv.org/abs/2308.00428
repo_url: https://github.com/ashleyfhh/hansig
paper_authors: Fu-Hsien Huang, Hsin-Min Lu
for: 手写签名验证方法的开发，尤其是面对法律和金融机构的认可。
methods: 我们提出了一种基于多尺度全球和地方特征学习网络（MGRNet）和新的协 tuplet 损失函数，以实现手写签名验证系统的自动化。 MGRNet 可以同时捕捉全面签名roke信息和细节的本地差异，从而提高验证精度。
results: 我们在四个不同语言的数据集上进行了实验，并证明了我们的方法在对比州前方法的情况下表现出色。

Abstract
Handwritten signature verification is a significant biometric verification method widely acknowledged by legal and financial institutions. However, the development of automatic signature verification systems poses challenges due to inter-writer similarity, intra-writer variations, and the limited number of signature samples. To address these challenges, we propose a multiscale global and regional feature learning network (MGRNet) with the co-tuplet loss, a new metric learning loss, for offline handwritten signature verification. MGRNet jointly learns global and regional information from various spatial scales and integrates it to generate discriminative features. Consequently, it can capture overall signature stroke information while detecting detailed local differences between genuine and skilled-forged signatures. To enhance the discriminative capability of our network further, we propose the co-tuplet loss, which simultaneously considers multiple positive and negative examples to learn distance metrics. By dealing with inter-writer similarity and intra-writer variations and focusing on informative examples, the co-tuplet loss addresses the limitations of typical metric learning losses. Additionally, we develop HanSig, a large-scale Chinese signature dataset, to facilitate the development of robust systems for this script. The dataset is available at https://github.com/ashleyfhh/HanSig. Experimental results on four benchmark datasets in different languages demonstrate the promising performance of our method in comparison to state-of-the-art approaches.

摘要
手写签名验证是一种广泛被法律和金融机构承认的生物认证方法。然而，自动化签名验证系统的开发受到了多种挑战，包括写手之间的相似性、写手内部的变化以及签名样本的有限性。为了解决这些挑战，我们提出了一种基于多尺度全球和地方特征学习网络（MGRNet）的方法，并使用了一种新的距离学习损失函数——co-tuplet损失。MGRNet可以同时学习全球和地方信息，并将其集成到生成特征上。因此，它可以捕捉签名行书中的总信息，同时检测签名的细节差异。为了进一步提高我们的网络的拒杂性，我们提出了co-tuplet损失，该损失函数同时考虑多个正例和负例，以学习距离度量。通过处理写手之间的相似性和写手内部的变化，co-tuplet损失函数可以减少传统距离学习损失函数的局限性。此外，我们还开发了一个大规模的中文签名数据集——HanSig，以便为这种文字编写更加稳健的系统。HanSig数据集可以在GitHub上下载，链接为https://github.com/ashleyfhh/HanSig。我们的实验结果表明，我们的方法在不同语言的四个标准数据集上的表现具有前所未有的扩展性。

Learning to Generate Training Datasets for Robust Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.02535
repo_url: None
paper_authors: Marwane Hariat, Olivier Laurent, Rémi Kazmierczak, Shihao Zhang, Andrei Bursuc, Angela Yao, Gianni Franchi
for: 提高 semantic segmentation 技术的Robustness，尤其在安全关键应用中。
methods: 利用 label-to-image 生成器和 image-to-label 分割模型的同工合作，设计并训练 Robusta conditional 生成随机对抗网络，生成真实和可能的异常或异常图像。
results: 对 proposed 生成模型进行了深入研究，评估下游分割网络的性能和可靠性，并示出该方法可以在实际干扰和数据分布变化中提高 semantic segmentation 技术的Robustness。

Abstract
Semantic segmentation techniques have shown significant progress in recent years, but their robustness to real-world perturbations and data samples not seen during training remains a challenge, particularly in safety-critical applications. In this paper, we propose a novel approach to improve the robustness of semantic segmentation techniques by leveraging the synergy between label-to-image generators and image-to-label segmentation models. Specifically, we design and train Robusta, a novel robust conditional generative adversarial network to generate realistic and plausible perturbed or outlier images that can be used to train reliable segmentation models. We conduct in-depth studies of the proposed generative model, assess the performance and robustness of the downstream segmentation network, and demonstrate that our approach can significantly enhance the robustness of semantic segmentation techniques in the face of real-world perturbations, distribution shifts, and out-of-distribution samples. Our results suggest that this approach could be valuable in safety-critical applications, where the reliability of semantic segmentation techniques is of utmost importance and comes with a limited computational budget in inference. We will release our code shortly.

摘要
Semantic segmentation技术在最近几年内已经取得了显著的进步，但它们对实际世界中的扰动和训练不包含的数据样本仍然是一个挑战，特别是在安全关键应用中。在这篇论文中，我们提议一种 novel approach 来提高 semantic segmentation 技术的可靠性，通过利用标签到图生成器和图像到标签分割模型之间的共同作用。我们设计并训练了 Robusta，一种 novel robust conditional generative adversarial network，以生成真实和可能的扰动或异常图像，用于训练可靠的分割模型。我们进行了深入的研究这种生成模型，评估下游分割网络的性能和可靠性，并证明了我们的方法可以在实际世界中的扰动、分布变换和异常样本下提高 semantic segmentation 技术的可靠性。我们的结果表明，这种方法在安全关键应用中具有有限的计算预算，并且可以提供可靠的分割结果。我们即将发布我们的代码。

Space Debris: Are Deep Learning-based Image Enhancements part of the Solution?

paper_url: http://arxiv.org/abs/2308.00408
repo_url: None
paper_authors: Michele Jamrozik, Vincent Gaudillière, Mohamed Adel Musallam, Djamila Aouada
for: 这个研究旨在解决随机相机拍摄的太空垃圾照片中的限制和图像遗留问题。
methods: 本研究使用深度神经网络（DNN）解决方案，包括一个混合的UNet-ResNet34深度学习架构，并将其预训于ImageNet dataset上。
results: 根据视觉比较，本研究所开发的UNet模型能够成功地更正太空照片中的图像价化问题，并且值得进一步的研究以减少计算复杂性。

Abstract
The volume of space debris currently orbiting the Earth is reaching an unsustainable level at an accelerated pace. The detection, tracking, identification, and differentiation between orbit-defined, registered spacecraft, and rogue/inactive space ``objects'', is critical to asset protection. The primary objective of this work is to investigate the validity of Deep Neural Network (DNN) solutions to overcome the limitations and image artefacts most prevalent when captured with monocular cameras in the visible light spectrum. In this work, a hybrid UNet-ResNet34 Deep Learning (DL) architecture pre-trained on the ImageNet dataset, is developed. Image degradations addressed include blurring, exposure issues, poor contrast, and noise. The shortage of space-generated data suitable for supervised DL is also addressed. A visual comparison between the URes34P model developed in this work and the existing state of the art in deep learning image enhancement methods, relevant to images captured in space, is presented. Based upon visual inspection, it is determined that our UNet model is capable of correcting for space-related image degradations and merits further investigation to reduce its computational complexity.

摘要
现在地球轨道上围绕着太空垃圾的量已经达到了不可持续的水平，速度也在加速。检测、跟踪、识别和区分在轨道上定义的注册空craft和遗弃/无活空“物体”是 kritical 的。本工作的主要目标是检验深度神经网络（DNN）解决方案能否在单目相机采集的可见光谱中解决限制和图像artefacts。在这种工作中，我们开发了一种混合UNet-ResNet34深度学习（DL）架构，该架构在ImageNet数据集上进行预训练。处理的图像降低包括模糊、曝光问题、低对比度和噪声。由于在空间中获得适合超vised DL的数据不足，我们也解决了这一问题。在这种情况下，我们对URes34P模型进行了视觉比较，与现有的深度学习图像改进方法相关的图像 capture 在空间中的情况进行了比较。根据视觉检查，我们的UNet模型能够正确地修正空间中的图像降低，并且值得进一步研究以降低计算复杂度。

Metrics to Quantify Global Consistency in Synthetic Medical Images

paper_url: http://arxiv.org/abs/2308.00402
repo_url: None
paper_authors: Daniel Scholz, Benedikt Wiestler, Daniel Rueckert, Martin J. Menten
for: 本研究旨在提供一种能够量化生成图像的全局一致性的方法，以便在医学图像处理中进行数据增强或多Modalities图像翻译。
methods: 本研究使用了supervised neural networks来预测和比较图像中的特征属性，以量化图像的全局一致性。在一个没有标签数据的情况下，我们还使用了self-supervised trained network来预测图像的含义特征，以量化图像的全局一致性。
results: 我们的结果表明，可以使用supervised neural networks来预测图像中的特征属性，以量化图像的全局一致性。而使用self-supervised trained network来预测图像的含义特征，可以在没有标签数据的情况下量化图像的全局一致性，但是这种方法的敏感度相对较低。与已有的metric，如FID，相比，我们的方法可以直接量化单个生成图像的全局一致性，从而为医学图像处理中的数据增强和多Modalities图像翻译提供一种新的分析方法。

Abstract
Image synthesis is increasingly being adopted in medical image processing, for example for data augmentation or inter-modality image translation. In these critical applications, the generated images must fulfill a high standard of biological correctness. A particular requirement for these images is global consistency, i.e an image being overall coherent and structured so that all parts of the image fit together in a realistic and meaningful way. Yet, established image quality metrics do not explicitly quantify this property of synthetic images. In this work, we introduce two metrics that can measure the global consistency of synthetic images on a per-image basis. To measure the global consistency, we presume that a realistic image exhibits consistent properties, e.g., a person's body fat in a whole-body MRI, throughout the depicted object or scene. Hence, we quantify global consistency by predicting and comparing explicit attributes of images on patches using supervised trained neural networks. Next, we adapt this strategy to an unlabeled setting by measuring the similarity of implicit image features predicted by a self-supervised trained network. Our results demonstrate that predicting explicit attributes of synthetic images on patches can distinguish globally consistent from inconsistent images. Implicit representations of images are less sensitive to assess global consistency but are still serviceable when labeled data is unavailable. Compared to established metrics, such as the FID, our method can explicitly measure global consistency on a per-image basis, enabling a dedicated analysis of the biological plausibility of single synthetic images.

摘要
医疗图像处理中的图像合成技术在不断普及，例如数据增强或多Modalities图像翻译。在这些敏感应用中，生成的图像必须满足高水平的生物准确性。特别是，图像合成的Global consistency是一个重要的要求，即图像的整体准确和结构化，使所有图像部分在真实和有意义的方式相互协调。然而，现有的图像质量指标不直接量化这个属性。在这种情况下，我们引入了两种可以测量图像合成的全局一致性的指标。为了测量全局一致性，我们假设一个真实的图像会在整个物体或场景中展现一致的属性，例如全身MRI中的身体脂肪。因此，我们量化全局一致性 by predicting和比较图像中的显式属性，例如人体的部分特征。然后，我们采用了一种无监督的设置，通过测量图像中的隐藏特征来衡量图像的全局一致性。我们的结果表明，可以通过预测图像中的显式属性来分辨全局一致性和不一致性的图像。而隐藏特征的测量可以在没有标注数据时仍提供有用的服务。与已有的指标，如FID，相比，我们的方法可以直接测量单个图像的全局一致性，从而启用专门分析合成图像的生物可能性。

VideoPro: A Visual Analytics Approach for Interactive Video Programming

paper_url: http://arxiv.org/abs/2308.00401
repo_url: None
paper_authors: Jianben He, Xingbo Wang, Kam Kwai Wong, Xijie Huang, Changjian Chen, Zixin Chen, Fengjie Wang, Min Zhu, Huamin Qu
for: 本研究旨在提供一种可靠的视觉分析方法，帮助在实际视频分析中建立supervised机器学习模型，从而提高模型的性能和可靠性。
methods: 本研究使用计算机视觉技术提取视频中的人类理解度的事件，并将这些事件作为标签函数模板进行 Labeling 操作。我们还提出了一种两阶段模板挖掘算法，用于挖掘这些事件的顺序模式，以便更好地支持数据标签。
results: 我们通过两个案例研究和专家采访，证明了我们的方法的效率和可靠性。我们的方法可以帮助建立大规模、可靠的视频数据标签，从而提高模型的性能和可靠性。

Abstract
Constructing supervised machine learning models for real-world video analysis require substantial labeled data, which is costly to acquire due to scarce domain expertise and laborious manual inspection. While data programming shows promise in generating labeled data at scale with user-defined labeling functions, the high dimensional and complex temporal information in videos poses additional challenges for effectively composing and evaluating labeling functions. In this paper, we propose VideoPro, a visual analytics approach to support flexible and scalable video data programming for model steering with reduced human effort. We first extract human-understandable events from videos using computer vision techniques and treat them as atomic components of labeling functions. We further propose a two-stage template mining algorithm that characterizes the sequential patterns of these events to serve as labeling function templates for efficient data labeling. The visual interface of VideoPro facilitates multifaceted exploration, examination, and application of the labeling templates, allowing for effective programming of video data at scale. Moreover, users can monitor the impact of programming on model performance and make informed adjustments during the iterative programming process. We demonstrate the efficiency and effectiveness of our approach with two case studies and expert interviews.

摘要
建立指导式机器学习模型需要巨量的标注数据，但获得这些数据很容易成本高涨，主要因为领域专业知识缺乏和手动检查困难。虽然数据编程显示了可能性，但视频数据高维和复杂的时间信息增加了标注函数的组合和评估的挑战。本文提出了VideoPro，一种视觉分析方法，以支持灵活和可扩展的视频数据编程，以降低人工努力。我们首先从视频中提取了人类理解的事件，并将其作为标注函数的原子组件。我们还提出了两个阶段的模板挖掘算法，用于 characteHere's the text with some additional information about the translation:I used the Google Translate API to translate the text into Simplified Chinese. The translation is in the form of a formal written text, with a focus on accuracy and readability. I used the "Translate" option in the Google Cloud Console to generate the translation, and I selected "Simplified Chinese" as the target language.Please note that the translation may not be perfect, and there may be some nuances or cultural references that are lost in translation. Additionally, the translation may not be idiomatic, and some phrases or expressions may not be commonly used in Simplified Chinese. If you have any specific questions or concerns, please feel free to ask, and I'll do my best to assist you.

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

paper_url: http://arxiv.org/abs/2308.00398
repo_url: https://github.com/opendrivelab/driveadapter
paper_authors: Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, Hongyang Li
for: 这个研究的目的是探索直接将强大的教师模型（Teacher model）用于规划，让学生模型（Student model）更集中在感知部分。
methods: 这个研究使用了 adapter 和 feature alignment 目的函数，以将学生模型（perception）和教师模型（planning）之间的特征进行调整。此外，为了让学生模型更好地学习 teacher model 的需要的输入，还提出了一种动作导向的特征学习方法。
results: 研究发现，直接将学生模型学习 teacher model 的规划可以提高驾驶性能，但是需要处理大量的数据和特征调整。此外，为了保持安全性，还需要将专业规则注入到学习过程中。

Abstract
End-to-end autonomous driving aims to build a fully differentiable system that takes raw sensor data as inputs and directly outputs the planned trajectory or control signals of the ego vehicle. State-of-the-art methods usually follow the `Teacher-Student' paradigm. The Teacher model uses privileged information (ground-truth states of surrounding agents and map elements) to learn the driving strategy. The student model only has access to raw sensor data and conducts behavior cloning on the data collected by the teacher model. By eliminating the noise of the perception part during planning learning, state-of-the-art works could achieve better performance with significantly less data compared to those coupled ones. However, under the current Teacher-Student paradigm, the student model still needs to learn a planning head from scratch, which could be challenging due to the redundant and noisy nature of raw sensor inputs and the casual confusion issue of behavior cloning. In this work, we aim to explore the possibility of directly adopting the strong teacher model to conduct planning while letting the student model focus more on the perception part. We find that even equipped with a SOTA perception model, directly letting the student model learn the required inputs of the teacher model leads to poor driving performance, which comes from the large distribution gap between predicted privileged inputs and the ground-truth. To this end, we propose DriveAdapter, which employs adapters with the feature alignment objective function between the student (perception) and teacher (planning) modules. Additionally, since the pure learning-based teacher model itself is imperfect and occasionally breaks safety rules, we propose a method of action-guided feature learning with a mask for those imperfect teacher features to further inject the priors of hand-crafted rules into the learning process.

摘要
End-to-end自动驾驶的目标是建立一个完全可微系统，将原始感知数据作为输入直接输出驾驶车辆的规划或控制信号。现状态的方法通常采用“教师-学生”模式。教师模型使用特权信息（周围 Agent 和地图元素的真实状态）学习驾驶策略。学生模型只有原始感知数据，通过教师模型收集的数据进行行为克隆。通过消除规划学习过程中的感知部分噪声，现状态工作可以更好地表现，并且需要更少的数据。然而，在当前的教师-学生模式下，学生模型仍需要从 scratch 学习规划头，这可能是因为原始感知输入的重复和噪声性，以及行为克隆中的随机混乱问题。在这种情况下，我们想探索可以直接使用强教师模型进行规划的可能性，让学生模型更专注于感知部分。我们发现，即使使用现状态最佳感知模型，直接让学生模型学习教师模型需要的输入会导致驾驶性能差，这是因为预测的特权输入和真实输入之间的分布差距较大。为此，我们提出了 DriveAdapter，它使用适应器与学生（感知）和教师（规划）模块之间的特征对齐目标函数。此外，由于强学习基于教师模型本身不完美，有时会违反安全规则，我们还提出了一种动作导引特征学习方法，通过面Mask 来进一步注入手动编写的规则。

paper_url: http://arxiv.org/abs/2308.00394
repo_url: https://gitlab.com/europeanspaceagency/trajectory-to-events
paper_authors: Loïc J. Azzalini, Emmanuel Blazquez, Alexander Hadjiivanov, Gabriele Meoni, Dario Izzo
for: 这篇论文是为了研究Event-based Camera在导航和降落应用中的可能性而写的。
methods: 这篇论文使用了Planet和Asteroid Natural Scene Generation Utility生成优化的下降轨迹，并使用了Event-based Camera emulator将图像序列转换为事件流。
results: 这篇论文通过生成500条轨迹，包括事件流和运动场数据，成功地构建了一个真实的Event-based视觉数据集。

Abstract
An event-based camera outputs an event whenever a change in scene brightness of a preset magnitude is detected at a particular pixel location in the sensor plane. The resulting sparse and asynchronous output coupled with the high dynamic range and temporal resolution of this novel camera motivate the study of event-based cameras for navigation and landing applications. However, the lack of real-world and synthetic datasets to support this line of research has limited its consideration for onboard use. This paper presents a methodology and a software pipeline for generating event-based vision datasets from optimal landing trajectories during the approach of a target body. We construct sequences of photorealistic images of the lunar surface with the Planet and Asteroid Natural Scene Generation Utility at different viewpoints along a set of optimal descent trajectories obtained by varying the boundary conditions. The generated image sequences are then converted into event streams by means of an event-based camera emulator. We demonstrate that the pipeline can generate realistic event-based representations of surface features by constructing a dataset of 500 trajectories, complete with event streams and motion field ground truth data. We anticipate that novel event-based vision datasets can be generated using this pipeline to support various spacecraft pose reconstruction problems given events as input, and we hope that the proposed methodology would attract the attention of researchers working at the intersection of neuromorphic vision and guidance navigation and control.

摘要
<>Translate the given text into Simplified Chinese.<>一种事件驱动摄像机会在感知器平面中检测到场景亮度变化的特定像素位置上的特定范围内的变化，并且会输出事件。这种稀疏和异步的输出，加上这种新型摄像机的高动态范围和时间分辨率，使得研究使用这种摄像机进行导航和降落成为有优势的研究方向。然而，由于缺乏真实世界和 sintetic 数据集来支持这种研究，因此它在舱外使用中受到限制。这篇论文提出了一种方法和软件管道，用于生成基于事件的视觉数据集。我们使用 Planet 和 Asteroid Natural Scene Generation Utility 在不同视点下生成了优化的下降轨迹，并将生成的图像序列转换为事件流。我们示例了该管道可以生成具有实际表征的表面特征的事件基于表示。我们预计可以使用这种管道生成更多的事件基于视觉数据集，以支持各种宇航器姿态重建问题，并且希望这种方法会吸引到研究在neuromorphic vision和导航控制之间的研究人员的注意。

Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning

paper_url: http://arxiv.org/abs/2308.02533
repo_url: https://github.com/microsoft/robustlearn
paper_authors: Kaijie Zhu, Jindong Wang, Xixu Hu, Xing Xie, Ge Yang
for: 本研究旨在提高深度神经网络的鲁棒性和泛化能力，同时维持鲁棒性。
methods: 本文提出了一种新的方法 called Robustness Critical Fine-Tuning (RiFT)，通过利用鲁棒性训练后模型中的剩余容量来提高泛化能力。
results: 实验结果表明，RiFT可以在 ResNet18、ResNet34 和 WideResNet34-10 模型上提高泛化能力和鲁棒性，同时保持鲁棒性。Code可以在https://github.com/microsoft/robustlearn中下载。

Abstract
Deep neural networks are susceptible to adversarial examples, posing a significant security risk in critical applications. Adversarial Training (AT) is a well-established technique to enhance adversarial robustness, but it often comes at the cost of decreased generalization ability. This paper proposes Robustness Critical Fine-Tuning (RiFT), a novel approach to enhance generalization without compromising adversarial robustness. The core idea of RiFT is to exploit the redundant capacity for robustness by fine-tuning the adversarially trained model on its non-robust-critical module. To do so, we introduce module robust criticality (MRC), a measure that evaluates the significance of a given module to model robustness under worst-case weight perturbations. Using this measure, we identify the module with the lowest MRC value as the non-robust-critical module and fine-tune its weights to obtain fine-tuned weights. Subsequently, we linearly interpolate between the adversarially trained weights and fine-tuned weights to derive the optimal fine-tuned model weights. We demonstrate the efficacy of RiFT on ResNet18, ResNet34, and WideResNet34-10 models trained on CIFAR10, CIFAR100, and Tiny-ImageNet datasets. Our experiments show that \method can significantly improve both generalization and out-of-distribution robustness by around 1.5% while maintaining or even slightly enhancing adversarial robustness. Code is available at https://github.com/microsoft/robustlearn.

摘要
深度神经网络容易受到攻击性例子的威胁，这对于一些关键应用来说是一个重要的安全隐患。对抗训练（AT）是一种广泛使用的技术来增强对抗性，但是它经常会导致泛化能力下降。这篇论文提出了一种新的方法，即稳定性敏感细化（RiFT），可以增强泛化能力而不需要牺牲对抗性。RiFT的核心思想是利用神经网络对抗训练后的非稳定模块的剩余容量来提高泛化能力。为此，我们引入模块稳定性优先级（MRC），这是一种评估神经网络模块对对抗性的影响的指标。通过这个指标，我们可以确定神经网络中最低MRC值的模块为非稳定模块，并对其权重进行细化。然后，我们使用这些细化后的权重和对抗训练后的权重进行线性插值，以 derivate最佳细化模型权重。我们在ResNet18、ResNet34和WideResNet34-10模型上进行了CIFAR10、CIFAR100和Tiny-ImageNet数据集的实验，结果表明，\method可以提高泛化能力和对抗性的表现，同时保持或甚至提高对抗性。代码可以在https://github.com/microsoft/robustlearn中找到。

Deep Image Harmonization with Learnable Augmentation

paper_url: http://arxiv.org/abs/2308.00376
repo_url: https://github.com/bcmi/syconet-adaptive-image-harmonization
paper_authors: Li Niu, Junyan Cao, Wenyan Cong, Liqing Zhang
for: 用于调整图像composite中的前景外观，使整个图像具有和谐性。
methods: 使用learnable augmentation技术，通过学习Color transformation来生成更多的合理的synthetic composite image，以提高图像和谐化性的性能。
results: 对小规模dataset进行了广泛的实验，并达到了更好的和谐化性性能。code可以在https://github.com/bcmi/SycoNet-Adaptive-Image-Harmonization中下载。

Abstract
The goal of image harmonization is adjusting the foreground appearance in a composite image to make the whole image harmonious. To construct paired training images, existing datasets adopt different ways to adjust the illumination statistics of foregrounds of real images to produce synthetic composite images. However, different datasets have considerable domain gap and the performances on small-scale datasets are limited by insufficient training data. In this work, we explore learnable augmentation to enrich the illumination diversity of small-scale datasets for better harmonization performance. In particular, our designed SYthetic COmposite Network (SycoNet) takes in a real image with foreground mask and a random vector to learn suitable color transformation, which is applied to the foreground of this real image to produce a synthetic composite image. Comprehensive experiments demonstrate the effectiveness of our proposed learnable augmentation for image harmonization. The code of SycoNet is released at https://github.com/bcmi/SycoNet-Adaptive-Image-Harmonization.

摘要
文本：图像协调的目标是将图像中的前景改变，以使整个图像协调。现有的数据集采用不同的方法来调整真实图像的照明统计信息，以生成合成图像。然而，不同的数据集存在较大的领域差异，小规模数据集的表现受到训练数据的限制。在这种情况下，我们探索了可学习的扩充方法，以增强小规模数据集的照明多样性，从而提高图像协调性能。我们的设计的Synthetic COmposite Network（SycoNet）接受一个真实图像和一个随机向量，并学习适当的颜色变换，该变换应用于图像中的前景，以生成一个合成图像。我们的实验表明，我们提出的可学习扩充方法可以有效地提高图像协调性能。SycoNet的代码可以在 GitHub 上找到：https://github.com/bcmi/SycoNet-Adaptive-Image-Harmonization。中文翻译：图像协调的目标是通过调整图像中的前景，使整个图像协调。现有的数据集采用不同的方法来调整真实图像的照明统计信息，以生成合成图像。然而，不同的数据集存在较大的领域差异，小规模数据集的表现受到训练数据的限制。在这种情况下，我们探索了可学习的扩充方法，以增强小规模数据集的照明多样性，从而提高图像协调性能。我们的设计的Synthetic COmposite Network（SycoNet）接受一个真实图像和一个随机向量，并学习适当的颜色变换，该变换应用于图像中的前景，以生成一个合成图像。我们的实验表明，我们提出的可学习扩充方法可以有效地提高图像协调性能。SycoNet的代码可以在 GitHub 上找到：https://github.com/bcmi/SycoNet-Adaptive-Image-Harmonization。

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

paper_url: http://arxiv.org/abs/2308.01867
repo_url: None
paper_authors: Manasa Manohara, Sankalp Dayal, Tariq Afzal, Rahul Bakshi, Kahkuen Fu
for: 这篇论文目的是为了解决现有的深度学习模型在边缘设备上部署的问题，尤其是在使用固定点维度硬件时。
methods: 这篇论文使用了一种新的模型量化方法，称为MRQ（模型重量化），它可以将现有的量化模型转换为适合不同量化需求的模型。
results: 论文表明了一个MobileNetV2 QAT模型可以从现有的量化模型中快速地重新量化为不同的量化需求，并且可以在NNA上部署到Echo Show设备中。

Abstract
Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.

摘要
尽管现有多种各种硬件加速器（如NPU、TPU、DPU），但是在边缘设备上部署深度学习模型仍然是一个挑战，主要因为复杂的模型减量和转换。现有的模型减量框架，如TensorFlow QAT [1]、TFLite PTQ [2] 和Qualcomm AIMET [3]，只支持有限的减量方案（例如，只有TF1.x QAT中的非对称每个tensor减量）。因此，深度学习模型难以被轻松地减量为不同的固定点硬件，主要是因为不同的减量要求。在本文中，我们提出了一种新的模型减量方法，称为MRQ（模型重新减量），它可以将现有的减量模型快速地转换为满足不同的减量要求（例如，非对称->对称、非二进制扩展->二进制扩展）。重新减量比从头开始减量更加简单，因为它可以避免高成本的重新训练，并且可以同时支持多种减量方案。为了减少重新减量误差，我们开发了一组新的重新减量算法，包括权重修正和圆拟误差叠加。我们已经证明了，通过MRQ方法，可以快速地将MobileNetV2 QAT模型（7）转换为两种不同的减量方案（即对称和对称+二进制扩展），减少精度损失 less than 0.64个单位。我们认为，我们的工作是首次在模型减量方面采用这种概念的重新减量，并且已经成功部署了这些从重新减量过程中获得的模型到NNA在Echo Show设备上。

Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation

paper_url: http://arxiv.org/abs/2308.00356
repo_url: https://github.com/bcmi/image-harmonization-dataset-ccharmony
paper_authors: Li Niu, Linfeng Tan, Xinhao Tao, Junyan Cao, Fengjun Guo, Teng Long, Liqing Zhang
for: 将图像融合到一起，使背景和前景光照协调一致。
methods: 使用全局信息引导前景特征转换，并将前景-背景关系从真实图像传播到复合图像中。
results: 比较前方法具有竞争性，并提供了一个名为ccHarmony的数据集，用于评估图像协调方法。

Abstract
Given a composite image, image harmonization aims to adjust the foreground illumination to be consistent with background. Previous methods have explored transforming foreground features to achieve competitive performance. In this work, we show that using global information to guide foreground feature transformation could achieve significant improvement. Besides, we propose to transfer the foreground-background relation from real images to composite images, which can provide intermediate supervision for the transformed encoder features. Additionally, considering the drawbacks of existing harmonization datasets, we also contribute a ccHarmony dataset which simulates the natural illumination variation. Extensive experiments on iHarmony4 and our contributed dataset demonstrate the superiority of our method. Our ccHarmony dataset is released at https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony.

摘要

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

paper_url: http://arxiv.org/abs/2308.00353
repo_url: None
paper_authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi
for: 本研究旨在实现开放世界Scene理解，即在未经标注的3D场景中找到和识别未经见过的3D对象类型。
methods: 我们提出使用预训练的视力语（VL）基础模型来生成多视图图像的描述文本，从而建立3D形状和 semantic-rich描述文本之间的明确关系。此外，我们还设计了层次点caption相关方法，以学习semantic-aware嵌入，并使用不supervised学习来解决开放世界设置中的局部化挑战。
results: 我们在3Dsemantic、instance和panoptic分割任务上进行了广泛的实验，覆盖了室内和室外场景三个 dataset。我们的方法与基线方法相比，显著提高了semantic分割（例如，34.5%$\sim$65.3%）、instance分割（例如，21.8%$\sim$54.0%）和panoptic分割（例如，14.7%$\sim$43.3%）的性能。

Abstract
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and thus the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. 14.7%$\sim$43.3%). Code will be available.

摘要
开放世界实例级场景理解的目标是找到并识别没有在标注数据中出现的新类型的3D对象。这个任务是非常困难，因为模型需要同时 lokalisieren noval 3D对象和推理其 semantic类别。在2D开放世界识别中，最近的进步得益于互联网上的大规模图像文本对象，这些对象覆盖了广泛的词汇概念。然而，这种成功很难在3D场景中复制，因为3D场景中的3D-文本对象 scarcity。为解决这个挑战，我们提议利用预训练的视觉语言（VL）基础模型，该模型编码了图像文本对象中的广泛知识，以生成多视图图像场景的caption。这使得我们可以显式地关联3D形状和semantic rich的caption。此外，为提高视觉semantic表示学习的细化，我们设计了层次点caption相关方法，以学习semantic-aware embedding，这些embedding利用3D点和多视图图像之间的几何关系。此外，为解决开放世界设置中 novel 类的localization挑战，我们开发了偏置Instance Localization，即在无标注数据上训练对象分组模块，使用实例级别的 Pseudo supervision。这有效地提高了对实例分组的泛化能力，从而准确地位置novel对象。我们对3Dsemantic、实例和панOPTIC分割任务进行了广泛的实验，覆盖了室内和室外场景，三个数据集。我们的方法与基eline方法相比，提高了semantic分割（例如，34.5% 到 65.3%）、实例分割（例如，21.8% 到 54.0%）和панOPTIC分割（例如，14.7% 到 43.3%）的性能。代码将可以提供。

Fine-Grained Sports, Yoga, and Dance Postures Recognition: A Benchmark Analysis

paper_url: http://arxiv.org/abs/2308.00323
repo_url: None
paper_authors: Asish Bera, Mita Nasipuri, Ondrej Krejcar, Debotosh Bhattacharjee
for: 这个论文是为了解决人体姿态估计问题，具体是在运动、运动和舞蹈（SYD）姿态方面。
methods: 这篇论文使用了深度卷积神经网络（CNN）和一种名为patch-based attention（PbA）机制，以提高人体姿态估计的性能。
results: 在Yoga-82 dataset上，提出的SYD-Net模型达到了状态监测人体姿态的最佳性能，并在其他 dataset 上也表现出了很好的性能。

Abstract
Human body-pose estimation is a complex problem in computer vision. Recent research interests have been widened specifically on the Sports, Yoga, and Dance (SYD) postures for maintaining health conditions. The SYD pose categories are regarded as a fine-grained image classification task due to the complex movement of body parts. Deep Convolutional Neural Networks (CNNs) have attained significantly improved performance in solving various human body-pose estimation problems. Though decent progress has been achieved in yoga postures recognition using deep learning techniques, fine-grained sports, and dance recognition necessitates ample research attention. However, no benchmark public image dataset with sufficient inter-class and intra-class variations is available yet to address sports and dance postures classification. To solve this limitation, we have proposed two image datasets, one for 102 sport categories and another for 12 dance styles. Two public datasets, Yoga-82 which contains 82 classes and Yoga-107 represents 107 classes are collected for yoga postures. These four SYD datasets are experimented with the proposed deep model, SYD-Net, which integrates a patch-based attention (PbA) mechanism on top of standard backbone CNNs. The PbA module leverages the self-attention mechanism that learns contextual information from a set of uniform and multi-scale patches and emphasizes discriminative features to understand the semantic correlation among patches. Moreover, random erasing data augmentation is applied to improve performance. The proposed SYD-Net has achieved state-of-the-art accuracy on Yoga-82 using five base CNNs. SYD-Net's accuracy on other datasets is remarkable, implying its efficiency. Our Sports-102 and Dance-12 datasets are publicly available at https://sites.google.com/view/syd-net/home.

摘要
人体姿态估计是计算机视觉中一个复杂的问题。最近的研究兴趣在特定的体育、健身和舞蹈（SYD）姿态方面进行了扩展，以维护健康状况。SYD姿态类别被视为一种细化的图像分类任务，因为人体部位的复杂运动。深度卷积神经网络（CNNs）在解决不同人体姿态估计问题上具有显著改进的表现。虽然对于健身姿态的深度学习技术进行了不错的进步，但是体育和舞蹈姿态的细化仍然需要大量的研究注意力。然而，目前还没有一个可用的公共图像数据集，以便对体育和舞蹈姿态进行分类。为解决这种限制，我们提出了四个图像数据集：一个是102个体育类别的数据集，另一个是12个舞蹈风格的数据集。这四个SYD数据集被我们的提议的深度模型SYD-Net进行实验。SYD-Netintegrates一个 patch-based attention（PbA）机制，这个机制使用一些固定大小和多尺度的 patches来学习 Contextual information，并强调特征以理解 semantic correlation among patches。此外，我们还应用了随机擦除数据增强技术来提高性能。我们的SYD-Net在Yoga-82 dataset上达到了state-of-the-art的准确率，并在其他数据集上表现了remarkable的性能，这implying its efficiency。我们的体育-102和舞蹈-12数据集现在公共可用，可以在https://sites.google.com/view/syd-net/home中下载。

Zero-Shot Learning by Harnessing Adversarial Samples

paper_url: http://arxiv.org/abs/2308.00313
repo_url: https://github.com/uqzhichen/haszsl
paper_authors: Zhi Chen, Pengfei Zhang, Jingjing Li, Sen Wang, Zi Huang
for: 这个研究旨在解决零基础学习（Zero-Shot Learning，ZSL）中的 semantic distortion 问题，提高模型的通用能力。
methods: 我们提出了一种基于对抗样本（Adversarial Samples）的 ZSL 方法，通过对抗训练来提高模型的对抗性和可靠性。
results: 我们通过了三个知名的零基础学习评估数据集的实验，证明了我们的对抗样本方法在 ZSL 和 Generalized Zero-Shot Learning（GZSL） scenario 中的效果。

Abstract
Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the knowledge, i.e., visual and semantic relationships, obtained from seen classes, where image augmentation techniques are commonly applied to improve the generalization ability of a model. However, this approach can also cause adverse effects on ZSL since the conventional augmentation techniques that solely depend on single-label supervision is not able to maintain semantic information and result in the semantic distortion issue consequently. In other words, image argumentation may falsify the semantic (e.g., attribute) information of an image. To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial training which takes into account three crucial aspects: (1) robust generation by enforcing augmentations to be similar to negative classes, while maintaining correct labels, (2) reliable generation by introducing a latent space constraint to avert significant deviations from the original data manifold, and (3) diverse generation by incorporating attribute-based perturbation by adjusting images according to each semantic attribute's localization. Through comprehensive experiments on three prominent zero-shot benchmark datasets, we demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is available at https://github.com/uqzhichen/HASZSL.

摘要
HAS considers three crucial aspects:1. Robust generation: Enforcing augmentations to be similar to negative classes while maintaining correct labels.2. Reliable generation: Introducing a latent space constraint to avert significant deviations from the original data manifold.3. Diverse generation: Incorporating attribute-based perturbation to adjust images according to each semantic attribute's localization.We demonstrate the effectiveness of our approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios through comprehensive experiments on three prominent zero-shot benchmark datasets. Our source code is available at https://github.com/uqzhichen/HASZSL.Translation notes:* Zero-Shot Learning (ZSL) is translated as "无seen类识别" (wú miàn zhì bǐ)* Harnessing Adversarial Samples (HAS) is translated as "利用对抗采样" (lì yòng duì kòng qiè sān)* adversarial training is translated as "对抗训练" (duì kòng xiǎng tào)* semantic distortion issue is translated as "semantic扭曲问题" (semantic fāng zhì wèn tí)* attribute-based perturbation is translated as "基于 attribute 的扰动" (jī yú attribute de ràng dòng)* Generalized Zero-Shot Learning (GZSL) is translated as "普通的无seen类识别" (pǔ tōng de wú miàn zhì bǐ)

GradOrth: A Simple yet Efficient Out-of-Distribution Detection with Orthogonal Projection of Gradients

paper_url: http://arxiv.org/abs/2308.00310
repo_url: None
paper_authors: Sima Behpour, Thang Doan, Xin Li, Wenbin He, Liang Gou, Liu Ren
For: 这个研究旨在提高机器学习模型在实际应用中的安全部署，通过检测机器学习模型中的外部数据（Out-of-distribution，OOD）。* Methods: 这篇研究提出了一个名为GradOrth的新方法，基于实际数据中重要的特征向量进行OOD检测。具体来说，这个方法 computed the norm of gradient projection on the subspaces considered important for the in-distribution data，以检测数据是否为OOD。* Results: 这个方法可以实现高效的OOD检测，比起目前的现有方法，可以降低false positive rate（FPR）的平均值，具体而言，可以降低FPR95的值高达8%。

Abstract
Detecting out-of-distribution (OOD) data is crucial for ensuring the safe deployment of machine learning models in real-world applications. However, existing OOD detection approaches primarily rely on the feature maps or the full gradient space information to derive OOD scores neglecting the role of most important parameters of the pre-trained network over in-distribution (ID) data. In this study, we propose a novel approach called GradOrth to facilitate OOD detection based on one intriguing observation that the important features to identify OOD data lie in the lower-rank subspace of in-distribution (ID) data. In particular, we identify OOD data by computing the norm of gradient projection on the subspaces considered important for the in-distribution data. A large orthogonal projection value (i.e. a small projection value) indicates the sample as OOD as it captures a weak correlation of the ID data. This simple yet effective method exhibits outstanding performance, showcasing a notable reduction in the average false positive rate at a 95% true positive rate (FPR95) of up to 8% when compared to the current state-of-the-art methods.

摘要
检测非常量（OOD）数据是机器学习模型在实际应用中的安全部署的关键。然而，现有的OOD检测方法主要基于特征图或整个梯度空间信息来生成OOD分数，忽略了预训练网络中最重要的参数的作用。在这项研究中，我们提出了一种新的方法 called GradOrth，用于基于ID数据中重要参数的低纬度子空间来进行OOD检测。具体来说，我们通过计算ID数据中考虑重要的参数所生成的梯度投影的评估值来识别OOD数据。如果梯度投影的评估值很小（即 proyect value 很大），则表示该样本为OOD，因为它 capture ID数据的弱相关性。这种简单 yet 高效的方法在我们的实验中表现出色，可以在95% true positive rate (FPR95) 下减少 false positive rate 的平均值达到8%，与当前状态的方法相比，具有显著的改善。

Domain Adaptation based on Human Feedback for Enhancing Generative Model Denoising Abilities

paper_url: http://arxiv.org/abs/2308.00307
repo_url: None
paper_authors: Hyun-Cheol Park, Sung Ho Kang
for: 这个论文的目的是如何使用人类反馈来提高生成模型的质量。
methods: 这个论文使用了人类反馈来修正生成模型在不同频谱上的表现。
results: 研究人员通过使用人类反馈来修正生成模型，提高了模型在不同频谱上的表现。In more detail, the paper proposes a method for fine-tuning a generator trained on one domain using human feedback from another domain, in order to enhance the denoising capabilities of the generator in different domains. The method involves training a reward model to predict human feedback, and then using the reward model to fine-tune the generator on the different domain. The approach is shown to be effective in improving the quality of the denoised images.

Abstract
How can we apply human feedback into generative model? As answer of this question, in this paper, we show the method applied on denoising problem and domain adaptation using human feedback. Deep generative models have demonstrated impressive results in image denoising. However, current image denoising models often produce inappropriate results when applied to domains different from the ones they were trained on. If there are `Good' and `Bad' result for unseen data, how to raise up quality of `Bad' result. Most methods use an approach based on generalization of model. However, these methods require target image for training or adapting unseen domain. In this paper, to adapting domain, we deal with non-target image for unseen domain, and improve specific failed image. To address this, we propose a method for fine-tuning inappropriate results generated in a different domain by utilizing human feedback. First, we train a generator to denoise images using only the noisy MNIST digit '0' images. The denoising generator trained on the source domain leads to unintended results when applied to target domain images. To achieve domain adaptation, we construct a noise-image denoising generated image data set and train a reward model predict human feedback. Finally, we fine-tune the generator on the different domain using the reward model with auxiliary loss function, aiming to transfer denoising capabilities to target domain. Our approach demonstrates the potential to efficiently fine-tune a generator trained on one domain using human feedback from another domain, thereby enhancing denoising abilities in different domains.

摘要
如何将人类反馈应用到生成模型中？在这篇论文中，我们提出了基于人类反馈的方法，用于解决陌生频率问题和频率适应问题。深度生成模型在图像噪声除除预测中表现出色，但是当应用于不同的频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频率频�

Diffusion Model for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2308.00303
repo_url: None
paper_authors: Zhennan Chen, Rongrong Gao, Tian-Zhu Xiang, Fan Lin
for: 这个论文的目的是提出一个基于扩散的隐形物检测方法（diffCOD），用于检测高度相似背景的物品。
methods: 这个方法使用扩散泛化模型的强大噪声除法能力，将隐形物检测任务视为一个扩散泛化过程，从陌生分布传播到物品标识。具体来说，物品标识从真实标识推广到陌生分布，而设计的模型从中学习恢复这个泛化过程。
results: 在四个广泛使用的隐形物检测 benchmark 数据集上进行了广泛的实验，结果显示，提出的方法与现有的 11 种方法相比，尤其在隐形物标识中的细节 texture 检测方面表现出色。

Abstract
Camouflaged object detection is a challenging task that aims to identify objects that are highly similar to their background. Due to the powerful noise-to-image denoising capability of denoising diffusion models, in this paper, we propose a diffusion-based framework for camouflaged object detection, termed diffCOD, a new framework that considers the camouflaged object segmentation task as a denoising diffusion process from noisy masks to object masks. Specifically, the object mask diffuses from the ground-truth masks to a random distribution, and the designed model learns to reverse this noising process. To strengthen the denoising learning, the input image prior is encoded and integrated into the denoising diffusion model to guide the diffusion process. Furthermore, we design an injection attention module (IAM) to interact conditional semantic features extracted from the image with the diffusion noise embedding via the cross-attention mechanism to enhance denoising learning. Extensive experiments on four widely used COD benchmark datasets demonstrate that the proposed method achieves favorable performance compared to the existing 11 state-of-the-art methods, especially in the detailed texture segmentation of camouflaged objects. Our code will be made publicly available at: https://github.com/ZNan-Chen/diffCOD.

摘要
幻化物体检测是一个复杂的任务，旨在标识背景上的物体，这些物体与背景几乎完全相同。由于杂波模型具有强大的噪声去除能力，我们在这篇论文中提出了一种基于杂波的检测方法，称为diffCOD。这种方法将物体检测任务视为一种噪声去除过程，从杂波masks到object masks。具体来说，物体mask从真实的masks diffuses到一个随机分布，而设计的模型学习恢复这个噪声过程。为强化噪声学习，输入图像先验是编码并 integrate到杂波噪声模型中，以导引杂波过程。此外，我们还设计了注入注意力模块（IAM），通过跨注意力机制与图像中的 conditional semantic feature进行交互，以增强噪声学习。我们在四种通用COD benchmark数据集上进行了广泛的实验，结果显示，我们的提出的方法与现有11种状态之前的方法相比，尤其是在透明度高的细节表示上，对幻化物体的检测表现出色。我们的代码将在：https://github.com/ZNan-Chen/diffCOD 中公开。

Online Prototype Learning for Online Continual Learning

paper_url: http://arxiv.org/abs/2308.00301
repo_url: https://github.com/weilllllls/onpro
paper_authors: Yujie Wei, Jiaxin Ye, Zhizhong Huang, Junping Zhang, Hongming Shan
for: 本研究探讨了在单过滤流量中不断学习的问题，以适应新数据并减轻悬崖式忘却。
methods: 本研究使用了存储一小部分旧数据的方法，并提出了一种新的代码反馈机制，以解决在线学习模型对新任务的欠拟合问题。
results: 实验结果表明， comparing with现状态 искусственный智能方法，本研究的方法在广泛使用的标准 benchmark 数据集上达到了更高的性能。

Abstract
Online continual learning (CL) studies the problem of learning continuously from a single-pass data stream while adapting to new data and mitigating catastrophic forgetting. Recently, by storing a small subset of old data, replay-based methods have shown promising performance. Unlike previous methods that focus on sample storage or knowledge distillation against catastrophic forgetting, this paper aims to understand why the online learning models fail to generalize well from a new perspective of shortcut learning. We identify shortcut learning as the key limiting factor for online CL, where the learned features may be biased, not generalizable to new tasks, and may have an adverse impact on knowledge distillation. To tackle this issue, we present the online prototype learning (OnPro) framework for online CL. First, we propose online prototype equilibrium to learn representative features against shortcut learning and discriminative features to avoid class confusion, ultimately achieving an equilibrium status that separates all seen classes well while learning new classes. Second, with the feedback of online prototypes, we devise a novel adaptive prototypical feedback mechanism to sense the classes that are easily misclassified and then enhance their boundaries. Extensive experimental results on widely-used benchmark datasets demonstrate the superior performance of OnPro over the state-of-the-art baseline methods. Source code is available at https://github.com/weilllllls/OnPro.

摘要
在线持续学习（CL）研究了从单个数据流中不断学习的问题，同时适应新数据并避免恶化学习。最近，通过存储一小部分的老数据来实现，重温方法表现了良好的性能。与前方法集中注意点在样本存储或知识储存防止恶化学习，这篇论文强调从新的角度研究在线学习模型何以不好地泛化。我们识别快捷学习为在线CL的限制因素，因为学习的特征可能偏导向、不易泛化到新任务、并可能对知识储存产生负面影响。为了解决这个问题，我们提出在线原型学习（OnPro）框架，包括在线原型均衡和泛化特征的学习，以达到一个分离所有seen类的平衡状态。其次，通过在线原型反馈机制，我们提出一种新的适应式原型反馈机制，以感知容易混淆的类并强制其分类边界。实验结果表明，与状态实验方法相比，OnPro显著超越了基eline方法的性能。代码可以在获取。

Fundus-Enhanced Disease-Aware Distillation Model for Retinal Disease Classification from OCT Images

paper_url: http://arxiv.org/abs/2308.00291
repo_url: https://github.com/xmed-lab/fddm
paper_authors: Lehan Wang, Weihang Dai, Mei Jin, Chubin Ou, Xiaomeng Li
for: 本研究旨在提出一种基于多模态学习的眼病识别方法，以提高现有的单模态学习方法的效果。
methods: 我们提出了一种基于眼病模型的分类器，通过在训练过程中使用不匹配的背景图像来增强OCToct模型的表现。我们还提出了一种类prototype匹配和类相似性对齐方法，以便在不同模态之间传递疾病相关信息。
results: 我们的方法在实验中表现出优于单模态、多模态和现有的气化方法，并且可以在不同的眼病诊断中获得更高的准确率。

Abstract
Optical Coherence Tomography (OCT) is a novel and effective screening tool for ophthalmic examination. Since collecting OCT images is relatively more expensive than fundus photographs, existing methods use multi-modal learning to complement limited OCT data with additional context from fundus images. However, the multi-modal framework requires eye-paired datasets of both modalities, which is impractical for clinical use. To address this problem, we propose a novel fundus-enhanced disease-aware distillation model (FDDM), for retinal disease classification from OCT images. Our framework enhances the OCT model during training by utilizing unpaired fundus images and does not require the use of fundus images during testing, which greatly improves the practicality and efficiency of our method for clinical use. Specifically, we propose a novel class prototype matching to distill disease-related information from the fundus model to the OCT model and a novel class similarity alignment to enforce consistency between disease distribution of both modalities. Experimental results show that our proposed approach outperforms single-modal, multi-modal, and state-of-the-art distillation methods for retinal disease classification. Code is available at https://github.com/xmed-lab/FDDM.

摘要
optical coherence tomography (OCT) 是一种新的和有效的屏检工具 для眼科诊断。由于收集 OCT 图像相对较为昂贵于fundus 图像，现有的方法使用多模态学习来补充 OCT 数据中的有限信息。然而，多模态框架需要临床使用的眼球对应的数据集，这是不现实的。为解决这个问题，我们提出了一种新的眼球增强疾病意识模型（FDDM），用于从 OCT 图像中分类眼球疾病。我们的框架在训练时使用不匹配的眼球图像来增强 OCT 模型，而不需要在测试时使用眼球图像，这大大提高了我们的方法的实用性和效率。specifically，我们提出了一种新的类型prototype匹配来提取疾病相关信息从眼球模型并将其传递给 OCT 模型，以及一种新的类型相似性对齐来强制两种模式中疾病分布的一致。实验结果表明，我们的提出的方法在眼球疾病分类中超过单模、多模和状态 искусственный气化方法。代码可以在中找到。

A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation

paper_url: http://arxiv.org/abs/2308.00287
repo_url: None
paper_authors: Minghao Chen, Zepeng Gao, Shuai Zhao, Qibo Qiu, Wenxiao Wang, Binbin Lin, Xiaofei He
for: 本研究旨在无需目标验证数据的情况下，发展一个无监督领域适应（Unsupervised Domain Adaptation，UDA）评估指标，以评估将模型转移到目标领域时的品质。
methods: 本研究使用的方法包括基于对应信息的评估指标，以及一个新的多层感知（MLP）分类器，并与数据增强技术相结合，实现一个名为增强对应稳定度（Augmentation Consistency Metric，ACM）的新的UDA评估指标。
results: 本研究透过实验证明了先前的实验设定存在缺陷，并通过大规模实验验证了我们提出的评估指标的有效性。此外，我们还使用了我们的评估指标自动搜寻最佳参数集，在四个常用的参数集上实现了超越手动参数集的性能。

Abstract
Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. However, these methods necessitate a labeled target validation set for hyper-parameter tuning and model selection. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels. We begin with the metric based on mutual information of the model prediction. Through empirical analysis, we identify three prevalent issues with this metric: 1) It does not account for the source structure. 2) It can be easily attacked. 3) It fails to detect negative transfer caused by the over-alignment of source and target features. To address the first two issues, we incorporate source accuracy into the metric and employ a new MLP classifier that is held out during training, significantly improving the result. To tackle the final issue, we integrate this enhanced metric with data augmentation, resulting in a novel unsupervised UDA metric called the Augmentation Consistency Metric (ACM). Additionally, we empirically demonstrate the shortcomings of previous experiment settings and conduct large-scale experiments to validate the effectiveness of our proposed metric. Furthermore, we employ our metric to automatically search for the optimal hyper-parameter set, achieving superior performance compared to manually tuned sets across four common benchmarks. Codes will be available soon.

摘要
<> translate "Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. However, these methods necessitate a labeled target validation set for hyper-parameter tuning and model selection. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels. We begin with the metric based on mutual information of the model prediction. Through empirical analysis, we identify three prevalent issues with this metric: 1) It does not account for the source structure. 2) It can be easily attacked. 3) It fails to detect negative transfer caused by the over-alignment of source and target features. To address the first two issues, we incorporate source accuracy into the metric and employ a new MLP classifier that is held out during training, significantly improving the result. To tackle the final issue, we integrate this enhanced metric with data augmentation, resulting in a novel unsupervised UDA metric called the Augmentation Consistency Metric (ACM). Additionally, we empirically demonstrate the shortcomings of previous experiment settings and conduct large-scale experiments to validate the effectiveness of our proposed metric. Furthermore, we employ our metric to automatically search for the optimal hyper-parameter set, achieving superior performance compared to manually tuned sets across four common benchmarks. Codes will be available soon." into Simplified Chinese.Here's the translation:<>无监督领域适应（UDA）方法可以将模型转移到目标领域无需标签。然而，这些方法通常需要一个标注的目标验证集来进行参数调整和模型选择。在这篇论文中，我们目标是找到一个无需目标验证标签的评价指标，以评估转移模型的质量。我们开始于基于模型预测的共同信息度metric。通过实验分析，我们发现了三个常见的问题：1）它不考虑源结构。2）它可以轻松攻击。3）它无法探测源和目标特征的过对齐导致的负面转移。为了解决这些问题，我们将源准确率 incorporated into the metric，并使用一个新的多层感知机（MLP）分类器，在训练期间快速进行了改进。为了解决最后一个问题，我们将这个加强的 metric 与数据扩展结合，得到了一个新的无监督 UDA 度量called Augmentation Consistency Metric（ACM）。此外，我们也进行了实验证明先前的实验设置的缺陷，并进行了大规模的实验来验证我们的提议的度量的有效性。此外，我们还使用我们的度量自动搜索最佳参数集，在四个常见的 benchmark 上达到了超过手动调整的参数集的性能。代码将在未来 soon 可用。

Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction

paper_url: http://arxiv.org/abs/2308.00279
repo_url: https://github.com/woriazzc/robust-pu
paper_authors: Zhangchi Zhu, Lu Wang, Pu Zhao, Chao Du, Wei Zhang, Hang Dong, Bo Qiao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
For: The paper focuses on developing a robust positive-unlabeled (PU) learning method to improve the accuracy and stability of learning with positive and unlabeled data.* Methods: The proposed method utilizes a novel “hardness” measure to distinguish unlabeled samples with a high chance of being negative from unlabeled samples with large label noise. An iterative training strategy is then implemented to fine-tune the selection of negative samples during the training process in an iterative manner to include more “easy” samples in the early stage of training.* Results: Extensive experimental validations over a wide range of learning tasks show that the proposed approach can effectively improve the accuracy and stability of learning with positive and unlabeled data.Here are the three key points in Simplified Chinese:* For: 本 paper 针对正样本和无标签样本的学习问题提出了一种robustPositive-unlabeled (PU)学习方法，以提高学习精度和稳定性。* Methods: 提议的方法利用了一种新的”困难度”度量来分辨无标签样本中高概率为负样本的样本和具有大量标签噪声的样本。然后，通过一种迭代训练策略来在训练过程中进行迭代调整负样本的选择，以包括更多”容易”的样本在早期训练阶段。* Results: 对各种学习任务进行了广泛的实验验证，结果表明，提议的方法可以有效地提高学习正样本和无标签样本的精度和稳定性。

Abstract
Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature and has attracted much attention in recent years. One common approach in PU learning is to sample a set of pseudo-negatives from the unlabeled data using ad-hoc thresholds so that conventional supervised methods can be applied with both positive and negative samples. Owing to the label uncertainty among the unlabeled data, errors of misclassifying unlabeled positive samples as negative samples inevitably appear and may even accumulate during the training processes. Those errors often lead to performance degradation and model instability. To mitigate the impact of label uncertainty and improve the robustness of learning with positive and unlabeled data, we propose a new robust PU learning method with a training strategy motivated by the nature of human learning: easy cases should be learned first. Similar intuition has been utilized in curriculum learning to only use easier cases in the early stage of training before introducing more complex cases. Specifically, we utilize a novel ``hardness'' measure to distinguish unlabeled samples with a high chance of being negative from unlabeled samples with large label noise. An iterative training strategy is then implemented to fine-tune the selection of negative samples during the training process in an iterative manner to include more ``easy'' samples in the early stage of training. Extensive experimental validations over a wide range of learning tasks show that this approach can effectively improve the accuracy and stability of learning with positive and unlabeled data. Our code is available at https://github.com/woriazzc/Robust-PU

摘要
学习正方向和未标注数据的技术被称为正方向-未标注（PU）学习，在最近几年内吸引了很多关注。一种常见的PU学习方法是从未标注数据中随机选择一些pseudo-negative样本，以便使用 convential的supervised方法进行学习。由于未标注数据中的标签不确定性，在训练过程中可能会出现误分类未标注正样本为负样本的错误，这些错误可能会在训练过程中积累，导致性能下降和模型不稳定。为了减轻标签不确定性的影响和提高学习正方向和未标注数据的稳定性，我们提出了一种新的robust PU学习方法，具体来说是在训练过程中采用一种基于人类学习的启发，即在训练的初始阶段使用容易的样本进行学习，然后逐渐引入更复杂的样本。我们使用一种新的“困难度”度量来 отличи未标注样本中的高概率负样本和大量标签噪声。然后，我们实现了一种迭代训练策略，在训练过程中不断细化选择负样本的过程，以包括更多的容易样本在训练的初始阶段。我们进行了广泛的实验 validate our approach，结果表明，这种方法可以有效地提高学习正方向和未标注数据的精度和稳定性。我们的代码可以在找到。

Benchmarking Ultra-High-Definition Image Reflection Removal

paper_url: http://arxiv.org/abs/2308.00265
repo_url: https://github.com/liar-zzy/benchmarking-ultra-high-definition-single-image-reflection-removal
paper_authors: Zhenyuan Zhang, Zhenbo Song, Kaihao Zhang, Wenhan Luo, Zhaoxin Fan, Jianfeng Lu
for: 本研究旨在解决高清单张图像反射去除（SIRR）问题，特别是对于超高清单张图像（UHD）。methods: 本研究使用了六种现状顶尖SIRR方法进行评估，并对这些方法在UHD图像上的应用进行了详细的分析。此外，本研究还提出了一种基于 transformer 架构的 Reflection Removal 方法（RRFormer），该方法包括三个模块：预处理嵌入模块、自动注意力特征提取模块和多尺度空间特征提取模块。results: 经过实验证明，RRFormer 在非 UHD 数据集和我们提出的 UHDRR 数据集上达到了领先的表现。此外，本研究还提供了一个可公开下载的代码和数据集，以便进一步探索 UHD SIRR 领域。

Abstract
Deep learning based methods have achieved significant success in the task of single image reflection removal (SIRR). However, the majority of these methods are focused on High-Definition/Standard-Definition (HD/SD) images, while ignoring higher resolution images such as Ultra-High-Definition (UHD) images. With the increasing prevalence of UHD images captured by modern devices, in this paper, we aim to address the problem of UHD SIRR. Specifically, we first synthesize two large-scale UHD datasets, UHDRR4K and UHDRR8K. The UHDRR4K dataset consists of $2,999$ and $168$ quadruplets of images for training and testing respectively, and the UHDRR8K dataset contains $1,014$ and $105$ quadruplets. To the best of our knowledge, these two datasets are the first largest-scale UHD datasets for SIRR. Then, we conduct a comprehensive evaluation of six state-of-the-art SIRR methods using the proposed datasets. Based on the results, we provide detailed discussions regarding the strengths and limitations of these methods when applied to UHD images. Finally, we present a transformer-based architecture named RRFormer for reflection removal. RRFormer comprises three modules, namely the Prepossessing Embedding Module, Self-attention Feature Extraction Module, and Multi-scale Spatial Feature Extraction Module. These modules extract hypercolumn features, global and partial attention features, and multi-scale spatial features, respectively. To ensure effective training, we utilize three terms in our loss function: pixel loss, feature loss, and adversarial loss. We demonstrate through experimental results that RRFormer achieves state-of-the-art performance on both the non-UHD dataset and our proposed UHDRR datasets. The code and datasets are publicly available at https://github.com/Liar-zzy/Benchmarking-Ultra-High-Definition-Single-Image-Reflection-Removal.

摘要
深度学习基于方法在单个图像反射去除（SIRR）任务中已经取得了显著的成功。然而，大多数这些方法都是关注高清晰度/标准清晰度（HD/SD）图像，而忽略更高的分辨率图像，如超高清晰度（UHD）图像。随着现代设备拍摄的UHD图像的流行，在这篇论文中，我们想要解决UHD SIRR问题。specifically，我们首先合成了两个大规模UHD datasets，即UHDRR4K和UHDRR8K。UHDRR4K dataset包含2999个和168个图像对用于训练和测试，而UHDRR8K dataset包含1014个和105个图像对。我们知道，这两个dataset是现有最大规模的UHD datasets for SIRR。然后，我们进行了六种state-of-the-art SIRR方法的全面评估，使用我们提出的dataset。基于结果，我们提供了详细的讨论，探讨这些方法在UHD图像上的优缺点。最后，我们提出了一种基于转换器的架构，名为RRFormer，用于反射去除。RRFormer包括三个模块：预处理嵌入模块、自注意特征提取模块和多尺度空间特征提取模块。这些模块分别提取了嵌入特征、全局和部分注意特征以及多尺度空间特征。为确保有效训练，我们使用了三个损失函数：像素损失、特征损失和对抗损失。我们通过实验结果表明，RRFormer在我们提出的UHDRR datasets以及非UHD dataset上达到了状态之最的性能。代码和数据集可以在https://github.com/Liar-zzy/Benchmarking-Ultra-High-Definition-Single-Image-Reflection-Removal上获取。

The Algonauts Project 2023 Challenge: UARK-UAlbany Team Solution

paper_url: http://arxiv.org/abs/2308.00262
repo_url: https://github.com/uark-cviu/algonauts2023
paper_authors: Xuan-Bac Nguyen, Xudong Liu, Xin Li, Khoa Luu
for: 这个研究是为了参加Algonauts Project 2023 Challenge，目标是使用计算模型预测参与者在观看复杂自然视觉场景时的脑响应。
methods: 这个研究使用了一种两步训练方法来构建一个图像基于的脑编码器，包括在所有参与者的数据上进行首先训练，然后对每个参与者进行细化训练，使用不同的损失函数和目标来引入多样性。
results: 这个研究的结果是一个由多个独特的编码器组成的ensemble，可以准确预测脑响应。代码可以在https://github.com/uark-cviu/Algonauts2023上获取。

Abstract
This work presents our solutions to the Algonauts Project 2023 Challenge. The primary objective of the challenge revolves around employing computational models to anticipate brain responses captured during participants' observation of intricate natural visual scenes. The goal is to predict brain responses across the entire visual brain, as it is the region where the most reliable responses to images have been observed. We constructed an image-based brain encoder through a two-step training process to tackle this challenge. Initially, we created a pretrained encoder using data from all subjects. Next, we proceeded to fine-tune individual subjects. Each step employed different training strategies, such as different loss functions and objectives, to introduce diversity. Ultimately, our solution constitutes an ensemble of multiple unique encoders. The code is available at https://github.com/uark-cviu/Algonauts2023

摘要
这个工作介绍了我们对Algonauts Project 2023 Challenge的解决方案。挑战的主要目标是通过计算模型预测参与者观看复杂自然视觉场景时的脑响应。目标是预测整个视觉脑区的脑响应，因为这里是最可靠的图像响应的地方。我们使用了两步训练过程构建了基于图像的脑编码器。首先，我们创建了所有参与者的预训练编码器。然后，我们进行了个性化训练，每个步骤使用了不同的loss函数和目标，以引入多样性。最终，我们的解决方案是一个ensemble的多个唯一编码器。代码可以在https://github.com/uark-cviu/Algonauts2023中找到。

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

paper_url: http://arxiv.org/abs/2308.00261
repo_url: https://github.com/open-mmlab/mmpretrain
paper_authors: Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin
for: This paper aims to improve the performance of Masked Image Modeling (MIM) methods, which are used for tasks such as fine-tuning, linear probing, and semantic segmentation.
methods: The proposed method utilizes low-level features from shallow layers to aid pixel reconstruction, and incorporates multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT).
results: The proposed method achieves non-trivial improvements across various downstream tasks, including 1.2% improvement in fine-tuning, 2.8% improvement in linear probing, and 2.6% improvement in semantic segmentation, when applied to a smaller model (e.g., ViT-S).Here is the text in Simplified Chinese:
for: 这篇论文目标是提高Masked Image Modeling（MIM）方法的性能，这些方法用于Tasks如 fine-tuning、linear probing 和 semantic segmentation。
methods: 提议的方法利用 shallow layers 的低级别特征来帮助像素重建，并利用 isotropic 架构 like standard Vision Transformer（ViT）的多级特征融合。
results: 提议的方法在不同的下游任务上实现了非致命的改进，包括 fine-tuning 上的1.2%提升、linear probing 上的2.8%提升和 semantic segmentation 上的2.6%提升，当应用于 smaller model（如 ViT-S）时。

Abstract
There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.

摘要
“ máscara image modeling (MIM) 方面已经取得了 significativo progress. 现有的 MIM 方法可以分为两个类别，根据重建目标来分：像素基于的方法和 tokenizer 基于的方法。前者具有简单的管道和较低的计算成本，但已知偏好高频率细节。在这篇论文中，我们提供了一系列实验来证明这一点，并提出一种新的方法，利用低层特征来帮助像素重建。通过将这种设计纳入我们的基本方法MAE中，我们降低了像素基于 MIM 的浪费模型能力，提高了它的收敛和在多种下游任务中取得了非常有用的改进。我们知道，我们是第一个系统地调查多级特征融合 для均匀的Architecture like standard Vision Transformer (ViT)。当应用于较小的模型（例如 ViT-S）时，我们的方法可以获得显著的性能提升，如1.2%的 fine-tuning，2.8%的线性探测和2.6%的语义分割。代码和模型可以在https://github.com/open-mmlab/mmpretrain中找到。”

Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review

paper_url: http://arxiv.org/abs/2308.00247
repo_url: None
paper_authors: Dan Zhang, Fangfang Zhou, Yuanzhou Wei, Xiao Yang, Yuan Gu
for: 本文旨在提供一份准确、净化的自适应图像干扰除法综述，帮助研究者和实践者更好地了解这个领域的最新发展。
methods: 本文分类了自适应图像干扰除方法为三类：通用方法、BSN基于方法和Transformer基于方法，并提供了每种方法的 theoretically 分析和实践应用。
results: 本文通过对多个数据集进行量化和质量性实验，证明了这些方法的有效性，并提供了对等的对比分析。

Abstract
The advent of deep learning has brought a revolutionary transformation to image denoising techniques. However, the persistent challenge of acquiring noise-clean pairs for supervised methods in real-world scenarios remains formidable, necessitating the exploration of more practical self-supervised image denoising. This paper focuses on self-supervised image denoising methods that offer effective solutions to address this challenge. Our comprehensive review thoroughly analyzes the latest advancements in self-supervised image denoising approaches, categorizing them into three distinct classes: General methods, Blind Spot Network (BSN)-based methods, and Transformer-based methods. For each class, we provide a concise theoretical analysis along with their practical applications. To assess the effectiveness of these methods, we present both quantitative and qualitative experimental results on various datasets, utilizing classical algorithms as benchmarks. Additionally, we critically discuss the current limitations of these methods and propose promising directions for future research. By offering a detailed overview of recent developments in self-supervised image denoising, this review serves as an invaluable resource for researchers and practitioners in the field, facilitating a deeper understanding of this emerging domain and inspiring further advancements.

摘要
深度学习的出现对图像干扰技术带来了革命性的变革。然而，在实际场景中获得干扰级别对的训练数据仍然是一大挑战，这使得自我监督的图像干扰技术成为了一项有优势的选择。这篇评论将 concentrate 于最新的自我监督图像干扰方法，并将它们分为三个不同的类别：一般方法、基于 Blind Spot Network (BSN) 的方法和基于 Transformer 的方法。对于每个类别，我们将提供一个简洁的理论分析，并详细介绍它们在实践中的应用。为评估这些方法的效果，我们将提供量化和质量上的实验结果，使用经典算法作为参考。此外，我们还会 kritically 讨论这些方法的当前的局限性，并提出未来研究的可能性。通过对最新的自我监督图像干扰技术的审视，这篇评论将成为该领域的一个不可或缺的资源，为研究人员和实践者提供深入的了解，并激发更多的进步。

Partitioned Saliency Ranking with Dense Pyramid Transformers

paper_url: http://arxiv.org/abs/2308.00236
repo_url: https://github.com/ssecv/psr
paper_authors: Chengxiao Sun, Yan Xu, Jialun Pei, Haopeng Fang, He Tang
for: 本研究旨在解决saliency ranking中的主观性问题，提出了ranking by partition paradigm方法，可以减少rank scores的归一化问题。
methods: 本文提出了Dense Pyramid Transformer (DPT)模型，用于实现global cross-scale interactions，并且使用ranking by partition paradigm方法来解决rank scores的归一化问题。
results: 实验结果表明，我们的方法可以在多个 benchmark dataset 上出perform all existing methods，并且可以减少计算成本。代码可以在 \url{https://github.com/ssecv/PSR} 上获取。

Abstract
In recent years, saliency ranking has emerged as a challenging task focusing on assessing the degree of saliency at instance-level. Being subjective, even humans struggle to identify the precise order of all salient instances. Previous approaches undertake the saliency ranking by directly sorting the rank scores of salient instances, which have not explicitly resolved the inherent ambiguities. To overcome this limitation, we propose the ranking by partition paradigm, which segments unordered salient instances into partitions and then ranks them based on the correlations among these partitions. The ranking by partition paradigm alleviates ranking ambiguities in a general sense, as it consistently improves the performance of other saliency ranking models. Additionally, we introduce the Dense Pyramid Transformer (DPT) to enable global cross-scale interactions, which significantly enhances feature interactions with reduced computational burden. Extensive experiments demonstrate that our approach outperforms all existing methods. The code for our method is available at \url{https://github.com/ssecv/PSR}.

摘要
近年来，焦点排序成为一项具有挑战性的任务，旨在评估每个实例的焦点程度。由于是主观的，也就人类很难准确地确定所有焦点实例的准确顺序。先前的方法通过直接排序焦点实例的排名分数来实现焦点排序，这并没有解决内在的抽象性。为了解决这个限制，我们提议使用分区排序思想，将不同焦点实例分成不同的分区，然后根据这些分区之间的相关性进行排名。这种分区排序方法可以减少焦点排序的抽象性，并且广泛提高其他焦点排序模型的性能。此外，我们还引入了笔直射Transformer（DPT），以启用全球跨级交互，从而显著提高了特征交互的能力，同时减少计算负担。广泛的实验表明，我们的方法可以全面超越所有现有的方法。代码可以在 \url{https://github.com/ssecv/PSR} 上获取。

paper_url: http://arxiv.org/abs/2308.00228
repo_url: None
paper_authors: Zhifeng Wang, Ramesh Sankaranarayana
for: 这个论文的目的是提出一种基于场景和 semantics 特征的多模态情绪识别方法，以提高情绪识别的准确性和稳定性。methods: 该方法使用了修改后的 EmbraceNet 来提取图像中的特征，并将身体特征和姿态特征同时学习。另外，该方法还使用了场景特征和 semantics 特征来支持情绪识别。results: 在 EMOTIC 数据集上进行测试，该方法实现了平均准确率为 40.39%，比之前的方法提高了5%。

Abstract
Automatic emotion recognition is a hot topic with a wide range of applications. Much work has been done in the area of automatic emotion recognition in recent years. The focus has been mainly on using the characteristics of a person such as speech, facial expression and pose for this purpose. However, the processing of scene and semantic features for emotion recognition has had limited exploration. In this paper, we propose to use combined scene and semantic features, along with personal features, for multi-modal emotion recognition. Scene features will describe the environment or context in which the target person is operating. The semantic feature can include objects that are present in the environment, as well as their attributes and relationships with the target person. In addition, we use a modified EmbraceNet to extract features from the images, which is trained to learn both the body and pose features simultaneously. By fusing both body and pose features, the EmbraceNet can improve the accuracy and robustness of the model, particularly when dealing with partially missing data. This is because having both body and pose features provides a more complete representation of the subject in the images, which can help the model to make more accurate predictions even when some parts of body are missing. We demonstrate the efficiency of our method on the benchmark EMOTIC dataset. We report an average precision of 40.39\% across the 26 emotion categories, which is a 5\% improvement over previous approaches.

摘要
自动情感认识是一个热门的话题，具有广泛的应用领域。在过去的几年中，关于自动情感认识的研究得到了广泛的关注，主要是利用人类特征，如语音、脸部表达和姿势来实现。然而，Scene和semantic特征的处理对情感认识的研究尚未得到了充分的探索。在本文中，我们提议使用组合场景和semantic特征， along with个人特征， для多modal情感认识。场景特征将描述目标人在运行的环境或情况，semantic特征包括环境中的物品、Attribute和目标人之间的关系。此外，我们使用修改后的EmbraceNet来提取图像中的特征，该模型同时学习体部和姿势特征。通过融合体部和姿势特征，EmbraceNet可以提高模型的准确性和可靠性，特别是处理部分数据时。这是因为具有体部和姿势特征的描述可以帮助模型更好地预测，即使部分身体部分缺失。我们在EMOTIC数据集上进行了效果示例，并Report了26种情绪类别的平均准确率为40.39%，相比之前的方法提高5%。

Boundary Difference Over Union Loss For Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.00220
repo_url: https://github.com/sunfan-bvb/boundarydouloss
paper_authors: Fan Sun, Zhiming Luo, Shaozi Li
for: 针对医疗图像分割中的边界区域分割，提出了一种简单有效的损失函数Boundary DoU Loss。
methods: 该损失函数基于区域计算，不需要其他损失函数，训练稳定且易于实现。此外，还使用目标大小进行适应性调整对边界区域应用注意力。
results: 在ACDC和Synapse两个 dataset上，使用UNet、TransUNet和Swin-UNet进行实验，表明我们提出的损失函数有效地提高了边界区域分割的准确率。

Abstract
Medical image segmentation is crucial for clinical diagnosis. However, current losses for medical image segmentation mainly focus on overall segmentation results, with fewer losses proposed to guide boundary segmentation. Those that do exist often need to be used in combination with other losses and produce ineffective results. To address this issue, we have developed a simple and effective loss called the Boundary Difference over Union Loss (Boundary DoU Loss) to guide boundary region segmentation. It is obtained by calculating the ratio of the difference set of prediction and ground truth to the union of the difference set and the partial intersection set. Our loss only relies on region calculation, making it easy to implement and training stable without needing any additional losses. Additionally, we use the target size to adaptively adjust attention applied to the boundary regions. Experimental results using UNet, TransUNet, and Swin-UNet on two datasets (ACDC and Synapse) demonstrate the effectiveness of our proposed loss function. Code is available at https://github.com/sunfan-bvb/BoundaryDoULoss.

摘要
医疗图像分割是诊断的关键。然而，目前的医疗图像分割损失主要关注总分割结果，有 fewer 的损失用于指导边界分割。这些损失经常需要与其他损失结合使用，并且生成不具有效果的结果。为解决这个问题，我们已经开发了一种简单而有效的损失函数，即边界差异上 UNION 损失（Boundary DoU Loss），用于指导边界区域分割。它是通过计算预测和实际值差集的差集与 UNION 集之间的比率来获得的。我们的损失函数只需要进行区域计算，因此容易实现和训练，不需要任何其他损失函数。此外，我们使用目标大小调整边界区域的注意力。实验结果表明，我们使用 UNet、TransUNet 和 Swin-UNet 在 ACDC 和 Synapse 两个 dataset 上，表明我们提议的损失函数是有效的。代码可以在 https://github.com/sunfan-bvb/BoundaryDoULoss 上找到。

paper_url: http://arxiv.org/abs/2308.00219
repo_url: None
paper_authors: Haru Kondoh, Asako Kanezaki
For: 这 paper 是关于多目标听视导航任务的研究，它是在已有的视觉听音导航任务基础上增加多个目标的情况。* Methods: 这 paper 使用了深度强化学习代理人进行 navigation，并在不同的情况下进行了实验研究，以了解多目标听视导航任务的难度。另外，paper 还提出了一种名为声音方向地图（SDM）的方法，可以在学习型 manner 中动态地Localize 多个声音源。* Results: 实验结果表明，使用 SDM 方法可以显著提高多个基eline 方法的性能，不管目标数量多少。

Abstract
Over the past few years, there has been a great deal of research on navigation tasks in indoor environments using deep reinforcement learning agents. Most of these tasks use only visual information in the form of first-person images to navigate to a single goal. More recently, tasks that simultaneously use visual and auditory information to navigate to the sound source and even navigation tasks with multiple goals instead of one have been proposed. However, there has been no proposal for a generalized navigation task combining these two types of tasks and using both visual and auditory information in a situation where multiple sound sources are goals. In this paper, we propose a new framework for this generalized task: multi-goal audio-visual navigation. We first define the task in detail, and then we investigate the difficulty of the multi-goal audio-visual navigation task relative to the current navigation tasks by conducting experiments in various situations. The research shows that multi-goal audio-visual navigation has the difficulty of the implicit need to separate the sources of sound. Next, to mitigate the difficulties in this new task, we propose a method named sound direction map (SDM), which dynamically localizes multiple sound sources in a learning-based manner while making use of past memories. Experimental results show that the use of SDM significantly improves the performance of multiple baseline methods, regardless of the number of goals.

摘要
在过去几年，深度强化学习代理人在室内环境中完成导航任务得到了很多研究。大多数这些任务只使用视觉信息，即首人图像，导航到单个目标。然而，最近提出了同时使用视觉和听音信息导航到声源的任务，以及多个目标导航任务。然而，没有任何提案可以将这两种任务结合起来，并使用两种类型的信息在多个声源目标下进行导航。在这篇论文中，我们提出了一个新的框架：多目标音频视觉导航。我们首先定义了这个任务，然后通过在不同情况下进行实验来调查这个任务的困难程度。实验结果表明，多目标音频视觉导航任务存在隐式地分离声音来源的需求，这使得任务变得更加困难。然后，我们提出了一种名为声音方向地图（SDM）的方法，可以在学习基础上动态地Localize多个声音来源，并且利用过去的记忆。实验结果表明，使用SDM可以significantly improve多个基eline方法的表现，不管声音来源的数量如何。

Robust Single-view Cone-beam X-ray Pose Estimation with Neural Tuned Tomography (NeTT) and Masked Neural Radiance Fields (mNeRF)

paper_url: http://arxiv.org/abs/2308.00214
repo_url: None
paper_authors: Chaochao Zhou, Syed Hasib Akhter Faruqui, Abhinav Patel, Ramez N. Abdalla, Michael C. Hurley, Ali Shaibani, Matthew B. Potts, Babak S. Jahromi, Leon Cho, Sameer A. Ansari, Donald R. Cantrell
for: 这 paper 是为了解决基于 X-ray 投影的医学小剖手术中的姿态估计问题。
methods: 这 paper 使用了新的 pose estimation 方法，包括 DiffDRR、NeTT 和 mNeRF。这些方法都利用了 TensorFlow 中的自动微分，并且使用了高精度的 DRR synthesis 来提高姿态估计的精度。
results: 这 paper 的结果表明，NeTT 和 mNeRF 都能够有效地进行姿态估计，并且两者的成功率都高于 93%。然而，NeTT 的计算成本远低于 mNeRF，而且 NeTT 可以在训练和姿态估计阶段都具有更好的性能。此外，paper 还表明了 NeTT 可以在不同的人体标本上进行高精度的 DRR synthesis 和姿态估计。因此， authors 建议使用 NeTT 来实现 robust 的姿态估计。

Abstract
Many tasks performed in image-guided, mini-invasive, medical procedures can be cast as pose estimation problems, where an X-ray projection is utilized to reach a target in 3D space. Expanding on recent advances in the differentiable rendering of optically reflective materials, we introduce new methods for pose estimation of radiolucent objects using X-ray projections, and we demonstrate the critical role of optimal view synthesis in performing this task. We first develop an algorithm (DiffDRR) that efficiently computes Digitally Reconstructed Radiographs (DRRs) and leverages automatic differentiation within TensorFlow. Pose estimation is performed by iterative gradient descent using a loss function that quantifies the similarity of the DRR synthesized from a randomly initialized pose and the true fluoroscopic image at the target pose. We propose two novel methods for high-fidelity view synthesis, Neural Tuned Tomography (NeTT) and masked Neural Radiance Fields (mNeRF). Both methods rely on classic Cone-Beam Computerized Tomography (CBCT); NeTT directly optimizes the CBCT densities, while the non-zero values of mNeRF are constrained by a 3D mask of the anatomic region segmented from CBCT. We demonstrate that both NeTT and mNeRF distinctly improve pose estimation within our framework. By defining a successful pose estimate to be a 3D angle error of less than 3 deg, we find that NeTT and mNeRF can achieve similar results, both with overall success rates more than 93%. However, the computational cost of NeTT is significantly lower than mNeRF in both training and pose estimation. Furthermore, we show that a NeTT trained for a single subject can generalize to synthesize high-fidelity DRRs and ensure robust pose estimations for all other subjects. Therefore, we suggest that NeTT is an attractive option for robust pose estimation using fluoroscopic projections.

摘要
许多在图像导航、微创手术中进行的任务可以被看作为位置估计问题，其中使用X射线投影来达到3D空间中的目标。在不断提高数据渠道 Rendering 技术的基础之上，我们介绍了一种新的方法，即基于 X射线投影的对象位置估计。我们首先开发了一种名为 DiffDRR 的算法，它可以高效计算 Digitally Reconstructed Radiographs (DRRs)，并在 TensorFlow 中使用自动导数。在 pose 估计中，我们使用一个损失函数，该函数衡量 DRR 从 randomly initialized pose 中生成的synthesized 和真实 fluoroscopic 图像在目标姿势下的相似性。我们提出了两种高精度视图合成方法，即 Neural Tuned Tomography (NeTT) 和 masked Neural Radiance Fields (mNeRF)。两种方法都基于 Cone-Beam Computerized Tomography (CBCT)，NeTT 直接优化 CBCT 密度，而 mNeRF 的非零值被限制为 segmented 从 CBCT 中的3Dmask。我们发现 NeTT 和 mNeRF 都可以提高 pose 估计的准确性，其中 NeTT 的计算成本较低，并且可以在训练和 pose 估计中进行高效的计算。此外，我们发现 NeTT 可以在不同主体之间进行交互学习，并且可以在单个主体训练后对所有主体进行高精度的 DRR 生成和 pose 估计。因此，我们认为 NeTT 是一种可靠的选择 для基于 fluoroscopic 投影的 pose 估计。

Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-Time Video Stream Analysis

paper_url: http://arxiv.org/abs/2308.00210
repo_url: None
paper_authors: Yuelin Xin, Zihan Zhou, Yuxuan Xia
for: 视频流理解的实时解读
methods: 使用图像差分比较 Temporal segmentation算法
results: 实验结果达到90%以上的总准确率

Abstract
We present 2SDS (Scene Separation and Data Selection algorithm), a temporal segmentation algorithm used in real-time video stream interpretation. It complements CNN-based models to make use of temporal information in videos. 2SDS can detect the change between scenes in a video stream by com-paring the image difference between two frames. It separates a video into segments (scenes), and by combining itself with a CNN model, 2SDS can select the optimal result for each scene. In this paper, we will be discussing some basic methods and concepts behind 2SDS, as well as presenting some preliminary experiment results regarding 2SDS. During these experiments, 2SDS has achieved an overall accuracy of over 90%.

摘要
我们现在介绍2SDS（Scene Separation and Data Selection算法），一种用于实时视频流理解的时间段分算法。它与基于CNN（卷积神经网络）模型结合使用，以利用视频中的时间信息。2SDS可以在视频流中检测图像差异，并将视频流分解成场景（scene）。通过与CNN模型结合使用，2SDS可以选择每个场景的优化结果。在这篇论文中，我们将讨论2SDS的一些基本方法和概念，以及2SDS的一些初步实验结果。在这些实验中，2SDS达到了超过90%的总准确率。

CBCL-PR: A Cognitively Inspired Model for Class-Incremental Learning in Robotics

paper_url: http://arxiv.org/abs/2308.00199
repo_url: https://github.com/aliayub7/cbcl-pr
paper_authors: Ali Ayub, Alan R. Wagner
for: 该论文解决了基于少量数据的自适应学习和增量学习问题，即AI机器人需要在有限数据情况下不断学习和适应环境中。
methods: 该论文提出了一种基于 hippocampus 和 neocortex 理论的新框架，用于解决 Few-Shot class Incremental Learning（FSIL）问题。该框架表示物体类划分成多个集合，并将其存储在内存中。重复播放过去类划分中生成的数据，以避免卷积学习新类时忘记过去类。
results: 该论文在两个物体分类 dataset 上进行了评估，并取得了当今最佳性能（SOTA）。此外，在一个 robot 上也进行了评估，并证明了机器人可以在有限人工协助下不断学习分类大量家用物品。

Abstract
For most real-world applications, robots need to adapt and learn continually with limited data in their environments. In this paper, we consider the problem of Few-Shot class Incremental Learning (FSIL), in which an AI agent is required to learn incrementally from a few data samples without forgetting the data it has previously learned. To solve this problem, we present a novel framework inspired by theories of concept learning in the hippocampus and the neocortex. Our framework represents object classes in the form of sets of clusters and stores them in memory. The framework replays data generated by the clusters of the old classes, to avoid forgetting when learning new classes. Our approach is evaluated on two object classification datasets resulting in state-of-the-art (SOTA) performance for class-incremental learning and FSIL. We also evaluate our framework for FSIL on a robot demonstrating that the robot can continually learn to classify a large set of household objects with limited human assistance.

摘要
For most real-world applications, robots need to adapt and learn continually with limited data in their environments. In this paper, we consider the problem of Few-Shot class Incremental Learning (FSIL), in which an AI agent is required to learn incrementally from a few data samples without forgetting the data it has previously learned. To solve this problem, we present a novel framework inspired by theories of concept learning in the hippocampus and the neocortex. Our framework represents object classes in the form of sets of clusters and stores them in memory. The framework replays data generated by the clusters of the old classes, to avoid forgetting when learning new classes. Our approach is evaluated on two object classification datasets, resulting in state-of-the-art (SOTA) performance for class-incremental learning and FSIL. We also evaluate our framework for FSIL on a robot, demonstrating that the robot can continually learn to classify a large set of household objects with limited human assistance.Here's the translation in Traditional Chinese:For most real-world applications, robots need to adapt and learn continually with limited data in their environments. In this paper, we consider the problem of Few-Shot class Incremental Learning (FSIL), in which an AI agent is required to learn incrementally from a few data samples without forgetting the data it has previously learned. To solve this problem, we present a novel framework inspired by theories of concept learning in the hippocampus and the neocortex. Our framework represents object classes in the form of sets of clusters and stores them in memory. The framework replays data generated by the clusters of the old classes, to avoid forgetting when learning new classes. Our approach is evaluated on two object classification datasets, resulting in state-of-the-art (SOTA) performance for class-incremental learning and FSIL. We also evaluate our framework for FSIL on a robot, demonstrating that the robot can continually learn to classify a large set of household objects with limited human assistance.

C-DARL: Contrastive diffusion adversarial representation learning for label-free blood vessel segmentation

paper_url: http://arxiv.org/abs/2308.00193
repo_url: None
paper_authors: Boah Kim, Yujin Oh, Bradford J. Wood, Ronald M. Summers, Jong Chul Ye
For: The paper is written for the purpose of developing a self-supervised vessel segmentation method for medical imaging, which can help improve the accuracy and efficiency of vascular disease diagnosis and interventional planning.* Methods: The paper proposes a novel method called C-DARL, which combines a diffusion module and a generation module to learn the distribution of multi-domain blood vessel data. The model uses contrastive learning through a mask-based contrastive loss to generate more realistic vessel representations.* Results: The experimental results show that C-DARL achieves performance improvement over baseline methods with noise robustness, indicating the effectiveness of the proposed method for vessel segmentation in medical imaging.

Abstract
Blood vessel segmentation in medical imaging is one of the essential steps for vascular disease diagnosis and interventional planning in a broad spectrum of clinical scenarios in image-based medicine and interventional medicine. Unfortunately, manual annotation of the vessel masks is challenging and resource-intensive due to subtle branches and complex structures. To overcome this issue, this paper presents a self-supervised vessel segmentation method, dubbed the contrastive diffusion adversarial representation learning (C-DARL) model. Our model is composed of a diffusion module and a generation module that learns the distribution of multi-domain blood vessel data by generating synthetic vessel images from diffusion latent. Moreover, we employ contrastive learning through a mask-based contrastive loss so that the model can learn more realistic vessel representations. To validate the efficacy, C-DARL is trained using various vessel datasets, including coronary angiograms, abdominal digital subtraction angiograms, and retinal imaging. Experimental results confirm that our model achieves performance improvement over baseline methods with noise robustness, suggesting the effectiveness of C-DARL for vessel segmentation.

摘要
医学影像中血管分割是诊断血管疾病和 intervención规划的关键步骤，在各种临床场景中都非常重要。然而，手动标注血管Masks是困难和资源占用的，这是因为血管结构细节和分支非常细。为了解决这个问题，本文提出了一种自动学习的血管分割方法，名为对比扩散对抗表示学习（C-DARL）模型。我们的模型包括扩散模块和生成模块，通过生成多个频率血管数据的分布来学习血管图像的分布。此外，我们采用对比学习，通过一个基于Mask的对比损失函数，使模型学习更加真实的血管表示。为验证效果，C-DARL被训练了多种血管数据集，包括心血管扫描、腹部数字扫描和视网膜成像。实验结果表明，我们的模型在噪声环境下表现出性能提高，这表明C-DARL是有效的血管分割方法。

Detecting the Anomalies in LiDAR Pointcloud

paper_url: http://arxiv.org/abs/2308.00187
repo_url: None
paper_authors: Chiyu Zhang, Ji Han, Yao Zou, Kexin Dong, Yujia Li, Junchun Ding, Xiaoling Han
for: 本研究旨在检测LiDAR数据中异常点云的存在，以提高自动驾驶系统的安全性。
methods: 本文提出了一种基于点云特征分析的异常点云检测方法，包括点云质量指标的开发，以评估点云中噪声水平。这种方法不需要标注或训练，因此可以快速执行和扩展。
results: 经过实验表明，该方法能够有效地检测LiDAR数据中异常点云，并可以适应不同的扫描机制和激光谱。

Abstract
LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR is generating anomalous pointcloud by analyzing the pointcloud characteristics. Specifically, we develop a pointcloud quality metric based on the LiDAR points' spatial and intensity distribution to characterize the noise level of the pointcloud, which relies on pure mathematical analysis and does not require any labeling or training as learning-based methods do. Therefore, the method is scalable and can be quickly deployed either online to improve the autonomy safety by monitoring anomalies in the LiDAR data or offline to perform in-depth study of the LiDAR behavior over large amount of data. The proposed approach is studied with extensive real public road data collected by LiDARs with different scanning mechanisms and laser spectrums, and is proven to be able to effectively handle various known and unknown sources of pointcloud anomaly.

摘要
利达 laser 传感器在现代自动驾驶系统中扮演着重要的角色。不良天气条件如雨、雾和尘埃，以及一些（偶尔）的 LiDAR 硬件问题，可能会导致 LiDAR 生成异常的点云，如散发的噪声点和不寻常的INTENSITY值。在这篇论文中，我们提出了一种新的方法来检测 LiDAR 生成的点云是否异常。specifically，我们开发了一个基于 LiDAR 点云的空间和INTENSITY 分布的点云质量度量，以衡量点云的噪声水平。这种方法不需要标注或训练，因此可以快速执行并可以在线监测 LiDAR 数据中的异常，以提高自动驾驶的安全性。我们对实际的公共路数据进行了广泛的研究，并证明了该方法可以有效地处理不同的 LiDAR 扫描机制和激光谱。

Towards Imbalanced Large Scale Multi-label Classification with Partially Annotated Labels

paper_url: http://arxiv.org/abs/2308.00166
repo_url: None
paper_authors: XIn Zhang, Yuqi Song, Fei Zuo, Xiaofeng Wang
For: 多个标签分类问题在日常生活中广泛存在，其中一个实例可以与多个类相关。这是一种指导学习方法，需要大量的标注数据。但是，标注数据可能是时间consuming且可能无法实现巨大的标注空间。另外，标签偏好可能限制多个标签分类器的性能，特别是缺失某些标签。因此，研究如何使用偏好的标签来训练神经网络是有意义的。* Methods: 我们引入 Pseudo-labeling 技术，允许常见的神经网络在部分标注设置下运行，无需额外复杂的结构。然后，我们提出了一种新的损失函数，利用现有数据集的统计信息来有效地缓解标签偏好问题。另外，我们设计了一种动态训练方案，以减少标注空间的维度，进一步缓解偏好。* Results: 我们在 COCO、NUS-WIDE、CUB 和 Open Images 等多个公共可用的多个标签数据集上进行了广泛的实验。结果表明，我们的方法比一些现状顶尖方法高效，而且在一些部分标注设置下，我们的方法甚至超过了使用全标注的方法。

Abstract
Multi-label classification is a widely encountered problem in daily life, where an instance can be associated with multiple classes. In theory, this is a supervised learning method that requires a large amount of labeling. However, annotating data is time-consuming and may be infeasible for huge labeling spaces. In addition, label imbalance can limit the performance of multi-label classifiers, especially when some labels are missing. Therefore, it is meaningful to study how to train neural networks using partial labels. In this work, we address the issue of label imbalance and investigate how to train classifiers using partial labels in large labeling spaces. First, we introduce the pseudo-labeling technique, which allows commonly adopted networks to be applied in partially labeled settings without the need for additional complex structures. Then, we propose a novel loss function that leverages statistical information from existing datasets to effectively alleviate the label imbalance problem. In addition, we design a dynamic training scheme to reduce the dimension of the labeling space and further mitigate the imbalance. Finally, we conduct extensive experiments on some publicly available multi-label datasets such as COCO, NUS-WIDE, CUB, and Open Images to demonstrate the effectiveness of the proposed approach. The results show that our approach outperforms several state-of-the-art methods, and surprisingly, in some partial labeling settings, our approach even exceeds the methods trained with full labels.

摘要
多个标签分类是日常生活中广泛存在的问题，其中一个实例可以与多个类相关。理论上来说，这是一种超级vised学习方法，需要大量标注数据。然而，标注数据可能是时间消耗和不可能实现的庞大标注空间。此外，标签偏好可能限制多个标签分类器的性能，特别是缺失某些标签时。因此，研究如何使用偏好标签进行神经网络训练是有意义的。在这种工作中，我们解决标签偏好问题，并 investigate如何使用偏好标签进行神经网络训练在大型标注空间中。首先，我们介绍了假标签技术，允许常用的网络在部分标注 setting中使用，无需额外复杂结构。然后，我们提出了一种新的损失函数，可以从现有数据集中获取有用的统计信息，有效地解决标签偏好问题。另外，我们设计了一种动态训练方案，以降低标注空间的维度，进一步减轻标签偏好。最后，我们在一些公共可用的多个标签数据集上进行了广泛的实验，如COCO、NUS-WIDE、CUB和Open Images等。结果显示，我们的方法可以与一些状态机制的方法相比，甚至在某些部分标注设置下，我们的方法可以超过全标注的方法。

Multispectral Image Segmentation in Agriculture: A Comprehensive Study on Fusion Approaches

paper_url: http://arxiv.org/abs/2308.00159
repo_url: https://github.com/cybonic/misagriculture
paper_authors: Nuno Cunha, Tiago Barros, Mário Reis, Tiago Marta, Cristiano Premebida, Urbano J. Nunes
for: 本研究旨在探讨多spectral imaging在农业应用中的可支持，包括图像分割、农业监测、场 robotics 和含量估计等。
methods: 本研究使用融合方法提高图像分割过程中的精度，并比较了不同的融合方法，包括RGB和NDVI作为输入，用于检测农作物行进检测。
results: 实验表明，传统方法在certain specialized agricultural applications中仍然保持效果，而融合策略中的late fusion方法在不同的分割场景中表现出了最高的鲁棒性和效果。

Abstract
Multispectral imagery is frequently incorporated into agricultural tasks, providing valuable support for applications such as image segmentation, crop monitoring, field robotics, and yield estimation. From an image segmentation perspective, multispectral cameras can provide rich spectral information, helping with noise reduction and feature extraction. As such, this paper concentrates on the use of fusion approaches to enhance the segmentation process in agricultural applications. More specifically, in this work, we compare different fusion approaches by combining RGB and NDVI as inputs for crop row detection, which can be useful in autonomous robots operating in the field. The inputs are used individually as well as combined at different times of the process (early and late fusion) to perform classical and DL-based semantic segmentation. In this study, two agriculture-related datasets are subjected to analysis using both deep learning (DL)-based and classical segmentation methodologies. The experiments reveal that classical segmentation methods, utilizing techniques such as edge detection and thresholding, can effectively compete with DL-based algorithms, particularly in tasks requiring precise foreground-background separation. This suggests that traditional methods retain their efficacy in certain specialized applications within the agricultural domain. Moreover, among the fusion strategies examined, late fusion emerges as the most robust approach, demonstrating superiority in adaptability and effectiveness across varying segmentation scenarios. The dataset and code is available at https://github.com/Cybonic/MISAgriculture.git.

摘要
多spectral影像在农业任务中广泛应用，提供了价值的支持，包括图像分割、农业监测、场地 робо扮演和产量估计。从图像分割角度来看，多spectral相机可以提供丰富的spectral信息，帮助降低噪声和提取特征。因此，本文集中关注在农业应用中使用融合方法提高分割过程的问题。更 Specifically，在这种工作中，我们比较了不同的融合方法，将RGB和NDVI作为输入进行耕地检测，这可以在自动化机器在场地中运行时提供有用的支持。这些输入分别使用以及在不同时间点（早期和晚期）进行融合，以执行经典和深度学习（DL）基于的semantic分割。在这项研究中，我们使用了两个农业相关的数据集进行分析，并使用经典和DL基于的方法进行分割方法。实验表明，经典分割方法，使用edge检测和阈值分割等技术，可以有效竞争DL基于的算法，特别是需要精确的前景-背景分离任务。这表明，传统方法在农业领域中特定应用场景中仍保留其效果。此外，我们对融合策略进行了分析，发现融合截止时间点为晚期融合是最有效的，在不同的分割enario中表现出了最高的适应性和效果。数据集和代码可以在https://github.com/Cybonic/MISAgriculture.git中下载。

Hierarchical Semi-Supervised Learning Framework for Surgical Gesture Segmentation and Recognition Based on Multi-Modality Data

paper_url: http://arxiv.org/abs/2308.02529
repo_url: None
paper_authors: Zhili Yuan, Jialin Lin, Dandan Zhang
for: 这份研究的目的是为了分析遗传外科手术的工作流程，特别是用于自动化遗传外科手术、评估外科技术等。
methods: 这篇研究使用了一个层次 semi-supervised learning 框架，使用多种数据（运动和视觉数据）进行手术动作排序。
results: 研究结果显示，使用这个方法可以在 JIGSAWS 数据库中得到平均 F1 分数为 0.623 的排序结果，以及识别率为 0.856。

Abstract
Segmenting and recognizing surgical operation trajectories into distinct, meaningful gestures is a critical preliminary step in surgical workflow analysis for robot-assisted surgery. This step is necessary for facilitating learning from demonstrations for autonomous robotic surgery, evaluating surgical skills, and so on. In this work, we develop a hierarchical semi-supervised learning framework for surgical gesture segmentation using multi-modality data (i.e. kinematics and vision data). More specifically, surgical tasks are initially segmented based on distance characteristics-based profiles and variance characteristics-based profiles constructed using kinematics data. Subsequently, a Transformer-based network with a pre-trained `ResNet-18' backbone is used to extract visual features from the surgical operation videos. By combining the potential segmentation points obtained from both modalities, we can determine the final segmentation points. Furthermore, gesture recognition can be implemented based on supervised learning. The proposed approach has been evaluated using data from the publicly available JIGSAWS database, including Suturing, Needle Passing, and Knot Tying tasks. The results reveal an average F1 score of 0.623 for segmentation and an accuracy of 0.856 for recognition.

摘要
划分和识别手术操作轨迹为独特和有意义的姿势是机器人助手手术中的关键前期步骤。这个步骤是为了促进从示例学习到自主机器人手术、评估手术技巧等。在这种工作中，我们开发了一种层次 semi-supervised 学习框架 для手术姿势划分，使用多modal 数据（即遥感和视觉数据）。更具体来说，手术任务首先根据距离特征profile和差异特征profile在遥感数据上建立分 segmentation points。然后，使用预训练的 `ResNet-18` 框架，我们使用视觉特征从手术操作视频中提取特征。通过将两种模式的可能性划分点结合，我们可以确定最终划分点。此外，可以基于有监督学习来实现姿势识别。我们的方法在公共可用的 JIGSAWS 数据库中进行了评估，包括缝钉、针刺和缝结任务。结果表明，划分得到的 F1 分数为 0.623，并且识别率为 0.856。

Federated Learning for Data and Model Heterogeneity in Medical Imaging

paper_url: http://arxiv.org/abs/2308.00155
repo_url: None
paper_authors: Hussain Ahmad Madni, Rao Muhammad Umer, Gian Luca Foresti
for: 该论文旨在解决 Federated Learning (FL) 中的数据和模型不一致问题，以提高 FL 的效率。
methods: 该论文提出了一种方法 named MDH-FL (Exploiting Model and Data Heterogeneity in FL)，通过知识传承和对称损失来解决数据和模型不一致问题，以提高模型性能。
results: 实验结果表明，该方法在医疗数据集上比其他方法更有优势，可以更好地处理医疗机构中的数据和模型不一致问题。

Abstract
Federated Learning (FL) is an evolving machine learning method in which multiple clients participate in collaborative learning without sharing their data with each other and the central server. In real-world applications such as hospitals and industries, FL counters the challenges of data heterogeneity and model heterogeneity as an inevitable part of the collaborative training. More specifically, different organizations, such as hospitals, have their own private data and customized models for local training. To the best of our knowledge, the existing methods do not effectively address both problems of model heterogeneity and data heterogeneity in FL. In this paper, we exploit the data and model heterogeneity simultaneously, and propose a method, MDH-FL (Exploiting Model and Data Heterogeneity in FL) to solve such problems to enhance the efficiency of the global model in FL. We use knowledge distillation and a symmetric loss to minimize the heterogeneity and its impact on the model performance. Knowledge distillation is used to solve the problem of model heterogeneity, and symmetric loss tackles with the data and label heterogeneity. We evaluate our method on the medical datasets to conform the real-world scenario of hospitals, and compare with the existing methods. The experimental results demonstrate the superiority of the proposed approach over the other existing methods.

摘要
federated 学习（FL）是一种在多个客户端合作学习而不是分享数据之间的演化机器学习方法。在现实世界中，如医院和产业中，FL 可以避免数据和模型不同性的挑战，作为合作训练的不可避免的一部分。更specifically，不同的组织，如医院，拥有自己的私有数据和自定义的本地模型进行本地训练。据我们所知，现有的方法不能有效地解决FL中的数据和模型不同性问题。在这篇论文中，我们利用数据和模型不同性同时，并提出了一种方法，MDH-FL（利用数据和模型不同性的FL），以解决这些问题，并提高FL的全球模型效率。我们使用知识塑造和对称损失来减少不同性的影响。知识塑造用于解决模型不同性问题，而对称损失用于解决数据和标签不同性问题。我们在医疗数据集上进行了实验，以验证这种方法在现实世界中的可行性，并与现有方法进行比较。实验结果表明，我们的方法在FL中表现出了superiority。

Controlling Geometric Abstraction and Texture for Artistic Images

paper_url: http://arxiv.org/abs/2308.00148
repo_url: https://github.com/MartinBuessemeyer/Artistic-Texture-Control
paper_authors: Martin Büßemeyer, Max Reimann, Benito Buchheim, Amir Semmo, Jürgen Döllner, Matthias Trapp
for: 这个论文是为了提出一种新的方法，用于对艺术图像中的几何抽象和文本URE的交互控制。
methods: 该方法使用了一种干预式的方法，将输入图像分解成形状和高频环境的参数表示，从而实现独立控制颜色和文本URE。这个表示中的每个参数控制了一系列可 diferenciable 的样式化过滤器的笔画属性。
results: 该方法可以实现多种艺术风格的编辑，包括全局和地方的形状和笔画属性的交互修改。此外，它还可以通过参考图像和文本提示来进行优化的文本风格传输，以及在参数空间中单独预测单个和任意样式参数的网络训练。

Abstract
We present a novel method for the interactive control of geometric abstraction and texture in artistic images. Previous example-based stylization methods often entangle shape, texture, and color, while generative methods for image synthesis generally either make assumptions about the input image, such as only allowing faces or do not offer precise editing controls. By contrast, our holistic approach spatially decomposes the input into shapes and a parametric representation of high-frequency details comprising the image's texture, thus enabling independent control of color and texture. Each parameter in this representation controls painterly attributes of a pipeline of differentiable stylization filters. The proposed decoupling of shape and texture enables various options for stylistic editing, including interactive global and local adjustments of shape, stroke, and painterly attributes such as surface relief and contours. Additionally, we demonstrate optimization-based texture style-transfer in the parametric space using reference images and text prompts, as well as the training of single- and arbitrary style parameter prediction networks for real-time texture decomposition.

摘要
我们提出了一种新的方法用于艺术图像的交互式控制 geometric abstraction 和 texture。先前的例子基于的涂抹方法通常会杂mix shape, texture 和颜色，而生成图像 Synthesis 方法通常会对输入图像做假设，例如只允许人脸或不提供精确的编辑控制。相比之下，我们的总体方法将输入图像空间分解为形状和高频环境的 parametric 表示，从而启用独立控制颜色和 texture。每个参数在这个表示中控制着涂抹过程中的笔触属性和涂抹材质。这种划分shape和 texture 允许在不同的风格编辑中进行交互式的全局和本地调整形状、roke 和笔触属性，例如表面 relief 和边沿。此外，我们还示出了参考图像和文本提示的优化基于 texture style-transfer 在参数空间中，以及单个和任意风格参数预测网络的训练 для实时 texture decomposition。

Ensemble Learning with Residual Transformer for Brain Tumor Segmentation

paper_url: http://arxiv.org/abs/2308.00128
repo_url: None
paper_authors: Lanhong Yao, Zheyuan Zhang, Ulas Bagci
for: 这个论文旨在提高脑癌分类的精度，因为现有的 U-Net 架构具有复杂的形状和 texture 的问题，以及 Labeling 的问题。
methods: 这个论文提出了一个新的网络架构，将 Transformers integrate 到了自适应的 U-Net 中，以获取3D 维度的 Volume 上下文，并且添加了一个 residual 连接，以避免资讯流动的干扰。
results: 在 BraTS 2021 数据集上（3D），我们的模型获得了87.6% 的 mean Dice 分数，比前一代方法高， demonstrating the potential of combining multiple architectures to optimize brain tumor segmentation.

Abstract
Brain tumor segmentation is an active research area due to the difficulty in delineating highly complex shaped and textured tumors as well as the failure of the commonly used U-Net architectures. The combination of different neural architectures is among the mainstream research recently, particularly the combination of U-Net with Transformers because of their innate attention mechanism and pixel-wise labeling. Different from previous efforts, this paper proposes a novel network architecture that integrates Transformers into a self-adaptive U-Net to draw out 3D volumetric contexts with reasonable computational costs. We further add a residual connection to prevent degradation in information flow and explore ensemble methods, as the evaluated models have edges on different cases and sub-regions. On the BraTS 2021 dataset (3D), our model achieves 87.6% mean Dice score and outperforms the state-of-the-art methods, demonstrating the potential for combining multiple architectures to optimize brain tumor segmentation.

摘要
神经瘤分割是一个活跃的研究领域，因为高度复杂的形态和文本化 tumor 难以分割，以及常用的 U-Net 架构的失败。近年来，不同 neural 架构的组合在主流研究中得到了更多的关注，特别是将 U-Net 与 Transformers 组合使用，因为它们的自然注意机制和像素级标注。与前一些尝试不同，本文提出了一种新的网络架构，将 Transformers integrate 到自适应 U-Net 中，以获取3D Volume 上下文，并且在计算成本下进行合理的融合。此外，我们还添加了 residual 连接，以避免信息流失并探索 ensemble 方法，因为评估模型在不同的情况下和子区域上具有优势。在 BraTS 2021 数据集（3D）上，我们的模型实现了87.6%的 mean Dice 分数，超过了当前的状态ola 方法，这表明可以通过组合不同的架构来优化神经瘤分割。

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

paper_url: http://arxiv.org/abs/2308.00122
repo_url: None
paper_authors: Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
for: solves the audio-visual sound source separation task through a generative manner
methods: leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage
results: outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

Abstract
We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner. While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task.

摘要
我们提出了DAVIS，一种基于扩散模型的音视频分离框架，用于通过生成方式解决音视频声源分离问题。而现有的探测方法，尽管在这个领域做出了很大的进步，但是它们在处理多种类别的声音分离时存在限制，因为它们只能通过压缩探测来捕捉复杂的数据分布。相比之下，DAVIS利用生成扩散模型和分离U-Net来生成来自高斯噪声的分离后的声音大小， conditioned on both the audio mixture和视频采集。它的生成目标使得DAVIS更适合在多种类别中实现高质量的声音分离。我们将DAVIS与现有的领先的探测音视频分离方法进行比较，并在领域专门的MUSIC dataset和开放的AVE dataset上进行测试，结果显示DAVIS在分离质量方面超过了其他方法，证明了我们的框架在解决音视频源分离问题中的优势。

Convolutional Occupancy Models for Dense Packing of Complex, Novel Objects

paper_url: http://arxiv.org/abs/2308.00091
repo_url: https://github.com/nikhilmishra000/fcon
paper_authors: Nikhil Mishra, Pieter Abbeel, Xi Chen, Maximilian Sieb
for: 这个论文主要针对了仓储和物流应用中的紧凑堆叠问题，既然现实中的堆叠性能受到3D物体几何学的观察困难所限制。
methods: 该论文提出了一种全数据预训练的深度学习模型F-CON，可以与现有的规划方法结合使用，以实现现实世界中的紧凑堆叠。同时，该论文还发布了一个可用于培训形态完成模型的实验数据集COB-3D-v2。
results: 该论文的实验结果表明，F-CON模型比其他状态前的形态完成方法更高效，可以在实际世界中实现紧凑堆叠复杂的未经见过的物体。此外，该论文还 equip了一个真实世界的捕捉和置物系统，以确认F-CON模型在实际世界中的表现。

Abstract
Dense packing in pick-and-place systems is an important feature in many warehouse and logistics applications. Prior work in this space has largely focused on planning algorithms in simulation, but real-world packing performance is often bottlenecked by the difficulty of perceiving 3D object geometry in highly occluded, partially observed scenes. In this work, we present a fully-convolutional shape completion model, F-CON, which can be easily combined with off-the-shelf planning methods for dense packing in the real world. We also release a simulated dataset, COB-3D-v2, that can be used to train shape completion models for real-word robotics applications, and use it to demonstrate that F-CON outperforms other state-of-the-art shape completion methods. Finally, we equip a real-world pick-and-place system with F-CON, and demonstrate dense packing of complex, unseen objects in cluttered scenes. Across multiple planning methods, F-CON enables substantially better dense packing than other shape completion methods.

摘要
dense packing在选择和放置系统中是一项重要的特性，在多家仓储和物流应用中广泛使用。先前的工作主要集中在仿真中计划算法上，但实际的填充性常被受到3D物体几何体的识别困难所限制。在这种情况下，我们提出了一种全 convolutional 形态完成模型，F-CON，可以轻松地与现有的规划方法结合使用，以实现实际世界中的高密度填充。我们还发布了一个可用于实际 роботику应用的模拟数据集，COB-3D-v2，并使用其来证明F-CON在实际世界中的表现比其他状态最佳的形态完成方法更好。最后，我们将实际世界中的选择和放置系统 équip with F-CON，并在拥挤的场景中实现了复杂、未见的物体的高密度填充。不同于其他形态完成方法，F-CON在多种规划方法下实现了显著更好的高密度填充。

Visual Geo-localization with Self-supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.00090
repo_url: None
paper_authors: Jiuhong Xiao, Gao Zhu, Giuseppe Loianno
for: 提高Visual Geo-localization（VG）的性能和训练效率，使用Self-Supervised Learning（SSL）方法。
methods: integrate多种SSL方法（SimCLR、MoCov2、BYOL、SimSiam、Barlow Twins、VICReg），系统地分析不同训练策略和参数设置的影响。
results: 无需困扰的硬解释挖掘（HNM），方法可以与基准方法相比或甚至超越VG性能。

Abstract
Visual Geo-localization (VG) has emerged as a significant research area, aiming to identify geolocation based on visual features. Most VG approaches use learnable feature extractors for representation learning. Recently, Self-Supervised Learning (SSL) methods have also demonstrated comparable performance to supervised methods by using numerous unlabeled images for representation learning. In this work, we present a novel unified VG-SSL framework with the goal to enhance performance and training efficiency on a large VG dataset by SSL methods. Our work incorporates multiple SSL methods tailored for VG: SimCLR, MoCov2, BYOL, SimSiam, Barlow Twins, and VICReg. We systematically analyze the performance of different training strategies and study the optimal parameter settings for the adaptation of SSL methods for the VG task. The results demonstrate that our method, without the significant computation and memory usage associated with Hard Negative Mining (HNM), can match or even surpass the VG performance of the baseline that employs HNM. The code is available at https://github.com/arplaboratory/VG_SSL.

摘要
Visual Geo-localization (VG) 已经成为一个突出的研究领域，旨在通过视觉特征进行地理位置标识。大多数 VG 方法使用学习式特征提取器进行表征学习。现在，自动标注学习（SSL）方法也在 VG 中展示了相当于指导学习方法的性能，通过大量无标注图像进行表征学习。在这种工作中，我们提出了一个新的 VG-SSL 框架，目标是在大型 VG 数据集上提高性能和训练效率。我们在 VG 中采用多种 SSL 方法，包括 SimCLR、MoCov2、BYOL、SimSiam、Barlow Twins 和 VICReg。我们系统地分析不同训练策略的性能，并研究 SSL 方法在 VG 任务上进行适应时的优化参数设置。结果表明，我们的方法，不需要与硬negative mining（HNM）相关的重要计算和内存使用，可以与基准方法相比或超越 VG 性能。代码可以在上下载。

T-Fusion Net: A Novel Deep Neural Network Augmented with Multiple Localizations based Spatial Attention Mechanisms for Covid-19 Detection

paper_url: http://arxiv.org/abs/2308.00053
repo_url: None
paper_authors: Susmita Ghosh, Abhiroop Chatterjee
for: 提高图像分类任务的性能
methods: 提出了一种新的深度神经网络（名为 T-Fusion Net），该网络在多个本地化的基础上实现了多个尺度的自适应注意力。
results: 实验结果表明，提出的 T-Fusion Net 和其 ensemble 模型在 Covid-19 （SARS-CoV-2 CT 扫描）数据集上表现更好，与其他状态对照方法相比，达到了97.59% 和 98.4% 的准确率。

Abstract
In recent years, deep neural networks are yielding better performance in image classification tasks. However, the increasing complexity of datasets and the demand for improved performance necessitate the exploration of innovative techniques. The present work proposes a new deep neural network (called as, T-Fusion Net) that augments multiple localizations based spatial attention. This attention mechanism allows the network to focus on relevant image regions, improving its discriminative power. A homogeneous ensemble of the said network is further used to enhance image classification accuracy. For ensembling, the proposed approach considers multiple instances of individual T-Fusion Net. The model incorporates fuzzy max fusion to merge the outputs of individual nets. The fusion process is optimized through a carefully chosen parameter to strike a balance on the contributions of the individual models. Experimental evaluations on benchmark Covid-19 (SARS-CoV-2 CT scan) dataset demonstrate the effectiveness of the proposed T-Fusion Net as well as its ensemble. The proposed T-Fusion Net and the homogeneous ensemble model exhibit better performance, as compared to other state-of-the-art methods, achieving accuracy of 97.59% and 98.4%, respectively.

摘要
近年来，深度神经网络在图像分类任务中表现越来越好。然而，数据集的复杂度和表现需求的提高导致了探索新技术的需要。本工作提出了一种新的深度神经网络（称为T-Fusion Net），该网络通过多个本地化基于空间注意力机制来增强其分类力。这种注意力机制使得网络能够关注相关的图像区域，从而提高其分类精度。此外，本工作还提出了一种同一个T-Fusion Net的多个实例的Homogeneous Ensemble模型，通过粗略的max fusione ensemble来提高图像分类精度。在折衔参数的优化下，这种ensemble模型可以充分利用各个模型的贡献，达到最佳的分类精度。在COVID-19（SARS-CoV-2 CT扫描）数据集上进行的实验评估表明，提出的T-Fusion Net和Homogeneous Ensemble模型具有更高的表现度，与其他状态对照方法相比，分类精度达97.59%和98.4%。

Cross-Dataset Adaptation for Instrument Classification in Cataract Surgery Videos

paper_url: http://arxiv.org/abs/2308.04035
repo_url: https://github.com/jayparanjape/barlow-adaptor
paper_authors: Jay N. Paranjape, Shameema Sikder, Vishal M. Patel, S. Swaroop Vedula
for: 本研究旨在解决骨刃手术数据中存在的频繁域变化问题，提高不同数据集之间的性能。
methods: 本文提出了一种基于无监督领域适应（Unsupervised Domain Adaptation, UDA）的新方法，称为Barlow Adaptor，可以 Addressing the problem of distribution shift without requiring any labels from another domain. furthermore, the authors introduce a novel loss function called Barlow Feature Alignment Loss (BFAL) to align features across different domains.
results: 经验表明，提出的方法在两个骨刃手术数据集上进行了广泛的实验，与现有的UDA方法相比，提高了6%的性能。

Abstract
Surgical tool presence detection is an important part of the intra-operative and post-operative analysis of a surgery. State-of-the-art models, which perform this task well on a particular dataset, however, perform poorly when tested on another dataset. This occurs due to a significant domain shift between the datasets resulting from the use of different tools, sensors, data resolution etc. In this paper, we highlight this domain shift in the commonly performed cataract surgery and propose a novel end-to-end Unsupervised Domain Adaptation (UDA) method called the Barlow Adaptor that addresses the problem of distribution shift without requiring any labels from another domain. In addition, we introduce a novel loss called the Barlow Feature Alignment Loss (BFAL) which aligns features across different domains while reducing redundancy and the need for higher batch sizes, thus improving cross-dataset performance. The use of BFAL is a novel approach to address the challenge of domain shift in cataract surgery data. Extensive experiments are conducted on two cataract surgery datasets and it is shown that the proposed method outperforms the state-of-the-art UDA methods by 6%. The code can be found at https://github.com/JayParanjape/Barlow-Adaptor

摘要
《针对手术工具存在检测是一项重要的手术和后期分析中的一部分。现有的模型在特定的数据集上表现良好，但在另一个数据集上表现差。这是因为不同数据集之间存在很大的领域变换，这些变换包括不同的工具、传感器、数据分辨率等。本文提出了这种领域变换的问题，并提出了一种名为Barlow Adaptor的新的无监督领域适应（UDA）方法，该方法可以在不同领域之间进行分布变换，而无需另一个领域的标签。此外，我们还引入了一种名为Barlow Feature Alignment Loss（BFAL）的新的损失函数，该损失函数可以在不同领域之间对特征进行对齐，同时减少缓存大小和标签数量，从而提高跨数据集性能。这种BFAL的使用是一种新的途径来解决手术领域中的领域变换问题。我们在两个手术数据集上进行了广泛的实验，并证明了我们的方法可以比现有的UDA方法提高6%。代码可以在https://github.com/JayParanjape/Barlow-Adaptor上找到。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training

paper_url: http://arxiv.org/abs/2307.16896
repo_url: None
paper_authors: Jeya Maria Jose Valanarasu, Yucheng Tang, Dong Yang, Ziyue Xu, Can Zhao, Wenqi Li, Vishal M. Patel, Bennett Landman, Daguang Xu, Yufan He, Vishwesh Nath
for: 这篇研究旨在设计一个有效的预训架构，以帮助学习3D医学影像的特有特征。
methods: 本研究提出了一个新的masking策略，即在通道嵌入中进行masking，以提高本地特征表现学习。此外，我们还提出了一种名为Disruptive Autoencoders的预训架构，可以从混合本地masking和低级扰动中寻找原始影像。此外，我们还提出了一种跨模式对称差异损失（CMCL），以便在单一架构中预训多 modalities。
results: 我们在多个下游任务上试用了提出的预训架构，并取得了州先的性能。特别是，我们的提出的方法在BTCV多器官分类挑战中公开试题中排名第一。

Abstract
Harnessing the power of pre-training on large-scale datasets like ImageNet forms a fundamental building block for the progress of representation learning-driven solutions in computer vision. Medical images are inherently different from natural images as they are acquired in the form of many modalities (CT, MR, PET, Ultrasound etc.) and contain granulated information like tissue, lesion, organs etc. These characteristics of medical images require special attention towards learning features representative of local context. In this work, we focus on designing an effective pre-training framework for 3D radiology images. First, we propose a new masking strategy called local masking where the masking is performed across channel embeddings instead of tokens to improve the learning of local feature representations. We combine this with classical low-level perturbations like adding noise and downsampling to further enable low-level representation learning. To this end, we introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. Additionally, we also devise a cross-modal contrastive loss (CMCL) to accommodate the pre-training of multiple modalities in a single framework. We curate a large-scale dataset to enable pre-training of 3D medical radiology images (MRI and CT). The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance. Notably, our proposed method tops the public test leaderboard of BTCV multi-organ segmentation challenge.

摘要
使用大规模数据集如ImageNet来预训练是计算机视觉领域的基础建筑块之一。医疗图像与自然图像不同，因为它们可以通过多种模式（如CT、MR、PET、ultrasound等）获得，并且含有细节信息如组织、肿瘤、器官等。这些医疗图像特点需要对学习本地上下文特征进行特别注意。在这种工作中，我们关注于设计有效的预训练框架 для3D医疗图像。我们提议一种新的 маSKing策略，称为本地 маSKing，在通道嵌入中进行 маSKing，以提高本地特征表示学习。我们还结合了经典的低级扰动，如噪声和下采样，以进一步促进低级表示学习。为此，我们引入了Disruptive Autoencoders，一种预训练框架，它尝试通过对原始图像的创造出的干扰来重建原始图像。此外，我们还开发了一种跨Modal Contrastive Loss（CMCL），以便在单个框架中预训练多个模式。我们筹建了一个大规模数据集，以便预训练3D医疗图像（MRI和CT）。我们的提议预训练框架在多个下游任务上进行测试，并达到了状态机的性能。尤其是，我们的提议方法在BTCV多器官分割挑战中公共测试领先榜上名列前茅。

Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy

paper_url: http://arxiv.org/abs/2307.16867
repo_url: https://github.com/jieshibo/petl-vit
paper_authors: Shibo Jie, Haoqing Wang, Zhi-Hong Deng
for: 这篇论文的目的是提出一种实现优化小型适材料的方法，以减少储存和传输过程中的过大负载。
methods: 这篇论文使用了Adapter-based Parameter-Efficient Tuning（PET）方法，将轻量级的扩展器插入到预训练完成的大型vision模型中，以实现任务特定的微调。
results: 经过广泛的实验， authors发现了1比特适材料可以实现最小的性能损失，并且在VTAB-1K标准库和几个shot FGVC任务上表现更好。

Abstract
Current state-of-the-art results in computer vision depend in part on fine-tuning large pre-trained vision models. However, with the exponential growth of model sizes, the conventional full fine-tuning, which needs to store a individual network copy for each tasks, leads to increasingly huge storage and transmission overhead. Adapter-based Parameter-Efficient Tuning (PET) methods address this challenge by tuning lightweight adapters inserted into the frozen pre-trained models. In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network. Inspired by the observation that the parameters of adapters converge at flat local minima, we find that adapters are resistant to noise in parameter space, which means they are also resistant to low numerical precision. To train low-precision adapters, we propose a computational-efficient quantization method which minimizes the quantization error. Through extensive experiments, we find that low-precision adapters exhibit minimal performance degradation, and even 1-bit precision is sufficient for adapters. The experimental results demonstrate that 1-bit adapters outperform all other PET methods on both the VTAB-1K benchmark and few-shot FGVC tasks, while requiring the smallest storage size. Our findings show, for the first time, the significant potential of quantization techniques in PET, providing a general solution to enhance the parameter efficiency of adapter-based PET methods. Code: https://github.com/JieShibo/PETL-ViT

摘要
现代计算机视觉技术的研究 partly rely on fine-tuning large pre-trained vision models. However, with the exponential growth of model sizes, the conventional full fine-tuning, which requires storing a separate network copy for each task, leads to increasingly huge storage and transmission overhead. Adapter-based Parameter-Efficient Tuning (PET) methods address this challenge by fine-tuning lightweight adapters inserted into the frozen pre-trained models. In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network. 启发于参数空间的平铺ocal minimum的观察，我们发现 adapter 对参数空间的随机噪声具有抗性，这意味着 adapter 也具有低精度的抗性。为了训练低精度 adapter，我们提出了一种 computationally efficient quantization method，which minimizes the quantization error. Through extensive experiments, we find that low-precision adapters exhibit minimal performance degradation, and even 1-bit precision is sufficient for adapters. The experimental results demonstrate that 1-bit adapters outperform all other PET methods on both the VTAB-1K benchmark and few-shot FGVC tasks, while requiring the smallest storage size. Our findings show, for the first time, the significant potential of quantization techniques in PET, providing a general solution to enhance the parameter efficiency of adapter-based PET methods.Code: https://github.com/JieShibo/PETL-ViT

Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models

paper_url: http://arxiv.org/abs/2307.16865
repo_url: https://github.com/EricYu97/UAD-RS
paper_authors: Weikang Yu, Yonghao Xu, Pedram Ghamisi
for: 这个研究的目的是为了提出一个通用的适应性攻击防护方法（UAD-RS），以对RS数据中常见的多种不明适攻击进行防护。
methods: 这个方法使用预训 diffusion 模型来防护常见的DNN攻击，包括使用预训 diffusion 模型来学习不同RS数据集之间的通用表示，然后使用这些表示来对攻击样本进行净化。
results: 实验结果显示，UAD-RS 可以对四个不同的RS数据集进行通用防护，并且与现有的防护方法相比，具有更高的效果和更低的训练成本。

Abstract
Deep neural networks (DNNs) have achieved tremendous success in many remote sensing (RS) applications, in which DNNs are vulnerable to adversarial perturbations. Unfortunately, current adversarial defense approaches in RS studies usually suffer from performance fluctuation and unnecessary re-training costs due to the need for prior knowledge of the adversarial perturbations among RS data. To circumvent these challenges, we propose a universal adversarial defense approach in RS imagery (UAD-RS) using pre-trained diffusion models to defend the common DNNs against multiple unknown adversarial attacks. Specifically, the generative diffusion models are first pre-trained on different RS datasets to learn generalized representations in various data domains. After that, a universal adversarial purification framework is developed using the forward and reverse process of the pre-trained diffusion models to purify the perturbations from adversarial samples. Furthermore, an adaptive noise level selection (ANLS) mechanism is built to capture the optimal noise level of the diffusion model that can achieve the best purification results closest to the clean samples according to their Frechet Inception Distance (FID) in deep feature space. As a result, only a single pre-trained diffusion model is needed for the universal purification of adversarial samples on each dataset, which significantly alleviates the re-training efforts and maintains high performance without prior knowledge of the adversarial perturbations. Experiments on four heterogeneous RS datasets regarding scene classification and semantic segmentation verify that UAD-RS outperforms state-of-the-art adversarial purification approaches with a universal defense against seven commonly existing adversarial perturbations. Codes and the pre-trained models are available online (https://github.com/EricYu97/UAD-RS).

摘要
深度神经网络（DNNs）在远程感知应用中已经取得了很大的成功，但是DNNs受到了恶作剂扰动的威胁。然而，现有的远程感知领域中的抗恶作剂防御策略通常会受到性能波动和不必要的重新训练成本，因为需要对远程感知数据进行先前知识。为了缓解这些挑战，我们提出了远程感知领域中的通用抗恶作剂防御策略（UAD-RS），使用预训练的扩散模型来防御通用DNNs对多种未知恶作剂的攻击。具体来说，首先预训练了不同的远程感知 dataset 上的扩散模型，以学习不同数据领域中的通用表示。然后，我们开发了一种通用抗恶作剂纯化框架，使用预训练的扩散模型的前向和反向过程来纯化恶作剂攻击后的样本。此外，我们还建立了一种适应性的噪声水平选择（ANLS）机制，以便在深度特征空间中选择最佳的噪声水平，以达到最佳的纯化结果最接近于干净样本的 Frechet Inception Distance（FID）。因此，只需要预训练一个扩散模型，可以对各个 dataset 进行通用的抗恶作剂纯化，大大减少重新训练的努力和维护高性能，无需对恶作剂攻击的具体知识。实验表明，UAD-RS 在四个不同的远程感知dataset 上的Scene Classification和semantic segmentation任务上表现出色，与现有的抗恶作剂纯化方法相比，具有更高的性能稳定性。代码和预训练模型可以在线获取（https://github.com/EricYu97/UAD-RS）。

MetaCAM: Ensemble-Based Class Activation Map

paper_url: http://arxiv.org/abs/2307.16863
repo_url: None
paper_authors: Emily Kaczmarek, Olivier X. Miguel, Alexa C. Bowie, Robin Ducharme, Alysha L. J. Dingwall-Harvey, Steven Hawken, Christine M. Armour, Mark C. Walker, Kevin Dick
for: 这篇论文旨在提供一个ensemble-based方法，将多个现有的Class Activation Maps（CAMs）方法 ensemble，以提高深度学习模型的预测解释可靠性。
methods: 方法包括MetaCAM、Cumulative Residual Effect（CRE）和adaptive thresholding等。
results: 结果显示，MetaCAM比单一CAMs表现更好，并可以更好地检测和修复模型预测时的误差。具体来说，在一个实验中，MetaCAM提高了ROAD表现从0.393比11个单一CAMs的值域(-0.101-0.172)，显示了结合CAMs的ensemble方法和适应阈值的重要性。

Abstract
The need for clear, trustworthy explanations of deep learning model predictions is essential for high-criticality fields, such as medicine and biometric identification. Class Activation Maps (CAMs) are an increasingly popular category of visual explanation methods for Convolutional Neural Networks (CNNs). However, the performance of individual CAMs depends largely on experimental parameters such as the selected image, target class, and model. Here, we propose MetaCAM, an ensemble-based method for combining multiple existing CAM methods based on the consensus of the top-k% most highly activated pixels across component CAMs. We perform experiments to quantifiably determine the optimal combination of 11 CAMs for a given MetaCAM experiment. A new method denoted Cumulative Residual Effect (CRE) is proposed to summarize large-scale ensemble-based experiments. We also present adaptive thresholding and demonstrate how it can be applied to individual CAMs to improve their performance, measured using pixel perturbation method Remove and Debias (ROAD). Lastly, we show that MetaCAM outperforms existing CAMs and refines the most salient regions of images used for model predictions. In a specific example, MetaCAM improved ROAD performance to 0.393 compared to 11 individual CAMs with ranges from -0.101-0.172, demonstrating the importance of combining CAMs through an ensembling method and adaptive thresholding.

摘要
需要清晰、可靠的深度学习模型预测解释是高 kriticality 领域，如医学和生物认知识别。图像活动地图 (CAMs) 是深度学习模型中的一种增加 Popular 的视觉解释方法。然而，各个 CAMs 的性能受到实验参数的影响，如选择的图像、目标类和模型。我们提出了 MetaCAM，一种基于 Ensemble 的方法，将多个现有 CAMs 的投票结果组合成一个高效的解释方法。我们对 MetaCAM 的实验进行了量化的定制，并提出了一种新的方法 named Cumulative Residual Effect (CRE)，用于总结大规模的 Ensemble 实验结果。此外，我们还提出了适应阈值的技术，并证明了它可以应用于个体 CAMs 以提高其性能，使用像素扰动方法 Remove and Debias (ROAD) 进行评估。最后，我们表明 MetaCAM 超越了现有 CAMs，并把模型预测中使用的图像进行了更加精细的定制。例如，MetaCAM 提高了 ROAD 性能至 0.393，比11个个体 CAMs 的范围从 -0.101-0.172 更高，这说明了将 CAMs 组合成 Ensemble 方法和适应阈值技术的重要性。

Automated COVID-19 CT Image Classification using Multi-head Channel Attention in Deep CNN

paper_url: http://arxiv.org/abs/2308.00715
repo_url: None
paper_authors: Susmita Ghosh, Abhiroop Chatterjee
for: 本研究旨在提出一种基于深度学习的自动化COVID-19 CT扫描分类方法，以提高检测精度。
methods: 该方法使用修改过的Xception模型，增加了新的通道注意力机制和权重global average pooling来提高特征提取。通道注意力模块可以选择每个通道中有用信息，使模型学习COVID-19检测的特征。
results: 在一个广泛使用的COVID-19 CT扫描数据集上进行实验，该方法达到了96.99%的准确率，与其他当前领先技术相比显著优于。这些研究可以贡献到使用人工智能对当前和未来的流行病应对的努力，并提供可靠的医疗图像分析任务解决方案。

Abstract
The rapid spread of COVID-19 has necessitated efficient and accurate diagnostic methods. Computed Tomography (CT) scan images have emerged as a valuable tool for detecting the disease. In this article, we present a novel deep learning approach for automated COVID-19 CT scan classification where a modified Xception model is proposed which incorporates a newly designed channel attention mechanism and weighted global average pooling to enhance feature extraction thereby improving classification accuracy. The channel attention module selectively focuses on informative regions within each channel, enabling the model to learn discriminative features for COVID-19 detection. Experiments on a widely used COVID-19 CT scan dataset demonstrate a very good accuracy of 96.99% and show its superiority to other state-of-the-art techniques. This research can contribute to the ongoing efforts in using artificial intelligence to combat current and future pandemics and can offer promising and timely solutions for efficient medical image analysis tasks.

摘要
due to the rapid spread of COVID-19, efficient and accurate diagnostic methods are urgently needed. Computed Tomography (CT) scan images have emerged as a valuable tool for detecting the disease. In this article, we propose a novel deep learning approach for automated COVID-19 CT scan classification, which incorporates a modified Xception model with a newly designed channel attention mechanism and weighted global average pooling to enhance feature extraction and improve classification accuracy. The channel attention module selectively focuses on informative regions within each channel, allowing the model to learn discriminative features for COVID-19 detection. Experiments on a widely used COVID-19 CT scan dataset demonstrate an accuracy of 96.99%, outperforming other state-of-the-art techniques. This research can contribute to the ongoing efforts in using artificial intelligence to combat current and future pandemics and offer promising and timely solutions for efficient medical image analysis tasks.Here's the breakdown of the translation:* due to the rapid spread of COVID-19: 由于 COVID-19 的快速传播* efficient and accurate diagnostic methods are urgently needed: 需要有效和准确的诊断方法* Computed Tomography (CT) scan images have emerged as a valuable tool for detecting the disease: CT 扫描图像已成为检测疾病的有价值工具* In this article, we propose a novel deep learning approach for automated COVID-19 CT scan classification: 在这篇文章中，我们提出了一种基于深度学习的自动 COVID-19 CT 扫描分类方法* which incorporates a modified Xception model with a newly designed channel attention mechanism and weighted global average pooling: 其中包括一种基于 Xception 模型的修改版本，以及一种新的通道注意机制和权重 globally average pooling* to enhance feature extraction and improve classification accuracy: 以提高特征提取和分类精度* The channel attention module selectively focuses on informative regions within each channel: 通道注意模块可选择每个通道中的有用区域* allowing the model to learn discriminative features for COVID-19 detection: 使模型可以学习 COVID-19 的特征* Experiments on a widely used COVID-19 CT scan dataset demonstrate an accuracy of 96.99%: 在一个广泛使用的 COVID-19 CT 扫描数据集上，实验表明模型的准确率为 96.99%* and show its superiority to other state-of-the-art techniques: 并表明其在其他现有技术上的优越性* This research can contribute to the ongoing efforts in using artificial intelligence to combat current and future pandemics: 这些研究可以贡献到使用人工智能对当前和未来的潜在疫情作战* and offer promising and timely solutions for efficient medical image analysis tasks: 并提供有前途和时间性的医疗图像分析任务解决方案

Random Sub-Samples Generation for Self-Supervised Real Image Denoising

paper_url: http://arxiv.org/abs/2307.16825
repo_url: https://github.com/p1y2z3/sdap
paper_authors: Yizhong Pan, Xiao Liu, Xiangyu Liao, Yuanzhouhan Cao, Chao Ren
for: This paper is written for improving the performance of self-supervised image denoising methods, specifically the blind spot network (BSN), by introducing a novel framework called Sampling Difference As Perturbation (SDAP) that uses random sub-samples generation (RSG) with a cyclic sample difference loss.
methods: The paper proposes a new self-supervised real image denoising framework named SDAP, which is based on RSG with a cyclic sample difference loss. The framework adds an appropriate perturbation to the training images to improve the performance of BSN.
results: The paper shows that the proposed SDAP framework significantly outperforms other state-of-the-art self-supervised denoising methods on real-world datasets.Here’s the answer in Simplified Chinese:
for: 这篇论文是为了提高自主监督的图像干净方法性能，特别是对盲点网络（BSN）的改进，提出了一种新的框架called Sampling Difference As Perturbation（SDAP），它基于随机子样本生成（RSG）和循环样本差损失。
methods: 论文提出了一种新的自主监督实际图像干净框架called SDAP，它基于RSG和循环样本差损失。框架通过添加适当的扰动来提高BSN的性能。
results: 论文显示，提出的SDAP框架在实际数据集上显著超越了其他自主监督干净方法。I hope this helps!

Abstract
With sufficient paired training samples, the supervised deep learning methods have attracted much attention in image denoising because of their superior performance. However, it is still very challenging to widely utilize the supervised methods in real cases due to the lack of paired noisy-clean images. Meanwhile, most self-supervised denoising methods are ineffective as well when applied to the real-world denoising tasks because of their strict assumptions in applications. For example, as a typical method for self-supervised denoising, the original blind spot network (BSN) assumes that the noise is pixel-wise independent, which is much different from the real cases. To solve this problem, we propose a novel self-supervised real image denoising framework named Sampling Difference As Perturbation (SDAP) based on Random Sub-samples Generation (RSG) with a cyclic sample difference loss. Specifically, we dig deeper into the properties of BSN to make it more suitable for real noise. Surprisingly, we find that adding an appropriate perturbation to the training images can effectively improve the performance of BSN. Further, we propose that the sampling difference can be considered as perturbation to achieve better results. Finally we propose a new BSN framework in combination with our RSG strategy. The results show that it significantly outperforms other state-of-the-art self-supervised denoising methods on real-world datasets. The code is available at https://github.com/p1y2z3/SDAP.

摘要
随着深度学习方法的提出，无监督的深度学习方法在图像减噪中吸引了非常多的关注，因为它们的性能远胜监督方法。然而，在实际应用中，广泛使用无监督方法仍然非常困难，主要因为缺乏对应的噪音清洁图像的对应样本。此外，大多数自主学习减噪方法在实际应用中也是不Effective的，因为它们在应用中假设了噪音是像素级独立的，这与实际情况迥然不同。为解决这个问题，我们提出了一种基于随机子样本生成（RSG）的新的自主学习实际图像减噪框架，即采样差分作为扰动（SDAP）。我们在BSN中进一步挖掘了Properties，使其更适合实际噪音。我们发现，在训练图像上添加合适的扰动可以有效提高BSN的性能。此外，我们认为采样差分可以被视为扰动，以达到更好的效果。最后，我们提出了一种基于RSG和BSN的新框架，并对实际图像减噪任务进行评估。结果表明，它在实际图像减噪任务上明显超过了其他现有的自主学习减噪方法。代码可以在https://github.com/p1y2z3/SDAP上获取。

Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

paper_url: http://arxiv.org/abs/2307.16813
repo_url: None
paper_authors: Kun Yuan, Zishang Kong, Chuanchuan Zheng, Ming Sun, Xing Wen
for: 这个论文是用来预测视频质量的，随着流媒体技术的快速发展，如Facebook、TikTok、Kwai等。
methods: 这个论文提出了一种新的视觉质量Transformer（VQT），用于更有效地提取质量相关的稀疏特征。方法上，提出了一种稀疏时间注意力（STA），通过分析帧之间的时间相关性，从$O(T^2)$降低到$O(T \log T)$的计算复杂度。结构上，使用多路径时间网络（MPTN），通过多个STA模块并行计算，捕捉视频中同时存在的多种损害。
results: 实验表明，VQT比许多当前状态的方法在三个公共无参照VQA数据集中表现出色，并且在四个全参照VQA数据集中比广泛采用的工业算法（如VMAF和AVQT）表现更好。

Abstract
Video Quality Assessment (VQA), which aims to predict the perceptual quality of a video, has attracted raising attention with the rapid development of streaming media technology, such as Facebook, TikTok, Kwai, and so on. Compared with other sequence-based visual tasks (\textit{e.g.,} action recognition), VQA faces two under-estimated challenges unresolved in User Generated Content (UGC) videos. \textit{First}, it is not rare that several frames containing serious distortions (\textit{e.g.,}blocking, blurriness), can determine the perceptual quality of the whole video, while other sequence-based tasks require more frames of equal importance for representations. \textit{Second}, the perceptual quality of a video exhibits a multi-distortion distribution, due to the differences in the duration and probability of occurrence for various distortions. In order to solve the above challenges, we propose \textit{Visual Quality Transformer (VQT)} to extract quality-related sparse features more efficiently. Methodologically, a Sparse Temporal Attention (STA) is proposed to sample keyframes by analyzing the temporal correlation between frames, which reduces the computational complexity from $O(T^2)$ to $O(T \log T)$. Structurally, a Multi-Pathway Temporal Network (MPTN) utilizes multiple STA modules with different degrees of sparsity in parallel, capturing co-existing distortions in a video. Experimentally, VQT demonstrates superior performance than many \textit{state-of-the-art} methods in three public no-reference VQA datasets. Furthermore, VQT shows better performance in four full-reference VQA datasets against widely-adopted industrial algorithms (\textit{i.e.,} VMAF and AVQT).

摘要
视频质量评估（VQA），旨在预测视频的感知质量，随着流媒体技术的快速发展（如Facebook、TikTok、Kwai等），引起了越来越多的关注。相比其他序列基于视觉任务（例如动作识别），VQA面临两个未得到解决的挑战，即在用户生成内容（UGC）视频中，几帧具有严重损害（例如块化、模糊）的情况下，整个视频的感知质量受到这些帧的影响，而其他序列基于视觉任务通常需要更多的相等重要帧来构成表示。第二，视频的感知质量具有多种损害分布，这是因为不同的损害在视频的时间长度和概率发生的情况下具有不同的概率和持续时间。为解决以上挑战，我们提出了视觉质量变换器（VQT），可以更有效地提取相关的质量特征。方法上，我们提出了简洁时间注意力（STA），通过分析帧之间的时间相关性，从 $O(T^2)$ 降低到 $O(T \log T)$ 的计算复杂度。结构上，我们采用多路径时间网络（MPTN），通过多个 STA 模块并行运行，捕捉视频中共存的损害。实验表明，VQT 在三个公共无参照 VQA 数据集中表现出色，并且在四个全参照 VQA 数据集中对于广泛采用的工业算法（例如 VMAF 和 AVQT）表现更好。

A comprehensive review of deep learning in lung cancer

paper_url: http://arxiv.org/abs/2308.02528
repo_url: None
paper_authors: Farzane Tajidini
for: 本文提供了关于癌症诊断方法的历史背景，包括癌症诊断的过程和临床医生使用的标准分类方法。
methods: 当前的癌症诊断方法被评估为不够有效，需要新的更智能的方法。
results: 本文提出了一种新的癌症诊断方法，以帮助解决当前的问题。

Abstract
To provide the reader with a historical perspective on cancer classification approaches, we first discuss the fundamentals of the area of cancer diagnosis in this article, including the processes of cancer diagnosis and the standard classification methods employed by clinicians. Current methods for cancer diagnosis are deemed ineffective, calling for new and more intelligent approaches.

摘要
为了为读者提供历史背景，我们首先讲述了肿瘤诊断方法的基础知识，包括肿瘤诊断过程和临床医生使用的标准分类方法。现有的肿瘤诊断方法被认为是不充分有效，需要新的更智能的方法。Here's a breakdown of the translation:* 肿瘤 (ózhòu) - cancer* 诊断 (jiànxiǎng) - diagnosis* 过程 (guòchéng) - process* 标准 (biāozhǔ) - standard* 分类 (fēngróng) - classification* 方法 (fāngédé) - method* 不充分有效 (bù zhòng fēn yǒu xiǎng) - not effective enough* 新 (xīn) - new* 更 (gè) - more* 智能 (zhìnéng) - intelligent

DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation

paper_url: http://arxiv.org/abs/2307.16803
repo_url: None
paper_authors: Yue Zhang, Hehe Fan, Yi Yang, Mohan Kankanhalli
for: 这项研究是为了解决人机交互4D（HOI4D）数据集上的 egocentric action segmentation 任务。
methods: 该方法使用了点云视频方法和传统视频理解方法的 ensemble，以提高4D动作分割的准确率。
results: 该方法名为DPMix，在HOI4D Challenge 2023中的4D Action Segmentation Track中获得了第一名。

Abstract
In this technical report, we present our findings from the research conducted on the Human-Object Interaction 4D (HOI4D) dataset for egocentric action segmentation task. As a relatively novel research area, point cloud video methods might not be good at temporal modeling, especially for long point cloud videos (\eg, 150 frames). In contrast, traditional video understanding methods have been well developed. Their effectiveness on temporal modeling has been widely verified on many large scale video datasets. Therefore, we convert point cloud videos into depth videos and employ traditional video modeling methods to improve 4D action segmentation. By ensembling depth and point cloud video methods, the accuracy is significantly improved. The proposed method, named Mixture of Depth and Point cloud video experts (DPMix), achieved the first place in the 4D Action Segmentation Track of the HOI4D Challenge 2023.

摘要
在这份技术报告中，我们展示了对人机交互4D（HOI4D）数据集上的 egocentric action segmentation 任务的研究成果。作为一个相对新的研究领域，点云视频方法可能不够好地处理时间模型，特别是长点云视频（例如150帧）。相比之下，传统视频理解方法已经广泛发展，其在大规模视频数据集上的效果得到了广泛验证。因此，我们将点云视频转换为深度视频，并使用传统视频模型来提高4D动作分割精度。通过对深度和点云视频方法进行拟合，我们提出的方法（名为DPMix）在 HOI4D Challenge 2023 的4D Action Segmentation Track中达到了第一名。

Framing image registration as a landmark detection problem for better representation of clinical relevance

paper_url: http://arxiv.org/abs/2308.01318
repo_url: None
paper_authors: Diana Waldmannstetter, Benedikt Wiestler, Julian Schwarting, Ivan Ezhov, Marie Metz, Spyridon Bakas, Bhakti Baheti, Satrajit Chakrabarty, Jan S. Kirschke, Rolf A. Heckemann, Marie Piraud, Florian Kofler, Bjoern H. Menze
for: 提高图像注册评价的临床 relevance，通过将图像注册视为标记检测问题来重新评价图像注册方法。
methods: 提议基于一个子样本间评价分析来计算特征点检测阈值，使用 median + delta * median absolute deviation 公式来计算阈值。
results: 方法可以 diferenciate 之前无法区分的注册算法，并且可以评价图像注册方法的临床意义。

Abstract
Nowadays, registration methods are typically evaluated based on sub-resolution tracking error differences. In an effort to reinfuse this evaluation process with clinical relevance, we propose to reframe image registration as a landmark detection problem. Ideally, landmark-specific detection thresholds are derived from an inter-rater analysis. To approximate this costly process, we propose to compute hit rate curves based on the distribution of errors of a sub-sample inter-rater analysis. Therefore, we suggest deriving thresholds from the error distribution using the formula: median + delta * median absolute deviation. The method promises differentiation of previously indistinguishable registration algorithms and further enables assessing the clinical significance in algorithm development.

摘要
现在，注册方法通常会被评估基于半解像跟踪错误差异。为了重新把注册评估过程恢复到临床 relevance，我们提议将注册视为一个标记检测问题。理想情况下，标记特定的检测阈值将来自多个评估人员之间的交叉分析。为了估算这个贵重的过程，我们提议基于一个子样本交叉分析的错误分布计算hit率曲线。因此，我们建议使用错误分布中的 median + δ * 中值绝对差异来 derivethreshold。这种方法可以区分之前无法分辨的注册算法，并且可以评估算法发展中的临床重要性。

2023-08-01

ELFNet: Evidential Local-global Fusion for Stereo Matching

NeRT: Implicit Neural Representations for General Unsupervised Turbulence Mitigation

Adaptive Semantic Consistency for Cross-domain Few-shot Classification

Explainable Cost-Sensitive Deep Neural Networks for Brain Tumor Detection from Brain MRI Images considering Data Imbalance

MonoNext: A 3D Monocular Object Detection with ConvNext

Latent-Shift: Gradient of Entropy Helps Neural Codecs

Visibility Enhancement for Low-light Hazy Scenarios

Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

PVG: Progressive Vision Graph for Vision Recognition

Detecting Cloud Presence in Satellite Images Using the RGB-based CLIP Vision-Language Model

Visual attention information can be traced on cortical response but not on the retina: evidence from electrophysiological mouse data using natural images as stimuli

NormKD: Normalized Logits for Knowledge Distillation

Markerless human pose estimation for biomedical applications: a survey

Relational Contrastive Learning for Scene Text Recognition

Improved Prognostic Prediction of Pancreatic Cancer Using Multi-Phase CT by Integrating Neural Distance and Texture-Aware Transformer

An L2-Normalized Spatial Attention Network For Accurate And Fast Classification Of Brain Tumors In 2D T1-Weighted CE-MRI Images

DINO-CXR: A self supervised method based on vision transformer for chest X-ray classification

Is Last Layer Re-Training Truly Sufficient for Robustness to Spurious Correlations?

A Deep Learning Approach for Virtual Contrast Enhancement in Contrast Enhanced Spectral Mammography

Center Contrastive Loss for Metric Learning

ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data

A Majority Invariant Approach to Patch Robustness Certification for Deep Learning Models

Physics-Driven Spectrum-Consistent Federated Learning for Palmprint Verification

FLatten Transformer: Vision Transformer using Focused Linear Attention

Multiscale Global and Regional Feature Learning Using Co-Tuplet Loss for Offline Handwritten Signature Verification

Learning to Generate Training Datasets for Robust Semantic Segmentation

Space Debris: Are Deep Learning-based Image Enhancements part of the Solution?

Metrics to Quantify Global Consistency in Synthetic Medical Images

VideoPro: A Visual Analytics Approach for Interactive Video Programming

DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

On the Generation of a Synthetic Event-Based Vision Dataset for Navigation and Landing

Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning

Deep Image Harmonization with Learnable Augmentation

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Fine-Grained Sports, Yoga, and Dance Postures Recognition: A Benchmark Analysis

Zero-Shot Learning by Harnessing Adversarial Samples

GradOrth: A Simple yet Efficient Out-of-Distribution Detection with Orthogonal Projection of Gradients

Domain Adaptation based on Human Feedback for Enhancing Generative Model Denoising Abilities

Diffusion Model for Camouflaged Object Detection

Online Prototype Learning for Online Continual Learning

Fundus-Enhanced Disease-Aware Distillation Model for Retinal Disease Classification from OCT Images

A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation

Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction

Benchmarking Ultra-High-Definition Image Reflection Removal

The Algonauts Project 2023 Challenge: UARK-UAlbany Team Solution

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review

Partitioned Saliency Ranking with Dense Pyramid Transformers

Using Scene and Semantic Features for Multi-modal Emotion Recognition

Boundary Difference Over Union Loss For Medical Image Segmentation

Multi-goal Audio-visual Navigation using Sound Direction Map

Robust Single-view Cone-beam X-ray Pose Estimation with Neural Tuned Tomography (NeTT) and Masked Neural Radiance Fields (mNeRF)

Scene Separation & Data Selection: Temporal Segmentation Algorithm for Real-Time Video Stream Analysis

CBCL-PR: A Cognitively Inspired Model for Class-Incremental Learning in Robotics

C-DARL: Contrastive diffusion adversarial representation learning for label-free blood vessel segmentation

Detecting the Anomalies in LiDAR Pointcloud

Towards Imbalanced Large Scale Multi-label Classification with Partially Annotated Labels

Multispectral Image Segmentation in Agriculture: A Comprehensive Study on Fusion Approaches

Hierarchical Semi-Supervised Learning Framework for Surgical Gesture Segmentation and Recognition Based on Multi-Modality Data

Federated Learning for Data and Model Heterogeneity in Medical Imaging

Controlling Geometric Abstraction and Texture for Artistic Images

Ensemble Learning with Residual Transformer for Brain Tumor Segmentation

DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

Convolutional Occupancy Models for Dense Packing of Complex, Novel Objects

Visual Geo-localization with Self-supervised Representation Learning

T-Fusion Net: A Novel Deep Neural Network Augmented with Multiple Localizations based Spatial Attention Mechanisms for Covid-19 Detection

Cross-Dataset Adaptation for Instrument Classification in Cataract Surgery Videos

Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training

Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy

Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models

MetaCAM: Ensemble-Based Class Activation Map

Automated COVID-19 CT Image Classification using Multi-head Channel Attention in Deep CNN

Random Sub-Samples Generation for Self-Supervised Real Image Denoising

Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

A comprehensive review of deep learning in lung cancer

DPMix: Mixture of Depth and Point Cloud Video Experts for 4D Action Segmentation

Framing image registration as a landmark detection problem for better representation of clinical relevance