2023-08-06

cs.CV

cs.CV - 2023-08-06

Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction

paper_url: http://arxiv.org/abs/2308.03807
repo_url: https://github.com/fanxiaohong/Nest-DGIL
paper_authors: Xiaohong Fan, Yin Yang, Ke Chen, Yujie Feng, Jianping Zhang
for: This paper proposes a deep geometric incremental learning framework for image reconstruction, which can effectively alleviate artifacts and guarantee the reconstruction of geometric texture details.
methods: The proposed method uses second Nesterov proximal gradient optimization and a cascade geometric incremental learning module to compensate for missing texture information from different geometric spectral decomposition domains.
results: The proposed method demonstrates superior reconstruction performance compared to existing state-of-the-art methods, and can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains.Here’s the same information in a more detailed format:
for: This paper aims to develop a deep learning-based method for image reconstruction that can effectively alleviate artifacts and preserve geometric texture details.
methods: The proposed method uses a cascade geometric incremental learning module, which is inspired by the overlap-tile strategy, to compensate for missing texture information from different geometric spectral decomposition domains. The method also employs second Nesterov proximal gradient optimization to improve the convergence speed and ensure the reconstruction of geometric texture details. All parameters in the proposed model are learnable, and an adaptive initialization technique of physical-parameters is used to make the model flexible and ensure smooth convergence.
results: The proposed method is compared with existing state-of-the-art methods, and the results show that it demonstrates superior reconstruction performance. The method can effectively alleviate artifacts and preserve geometric texture details, and can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains.I hope this helps! Let me know if you have any further questions.

Abstract
Proximal gradient-based optimization is one of the most common strategies for solving image inverse problems as well as easy to implement. However, these techniques often generate heavy artifacts in image reconstruction. One of the most popular refinement methods is to fine-tune the regularization parameter to alleviate such artifacts, but it may not always be sufficient or applicable due to increased computational costs. In this work, we propose a deep geometric incremental learning framework based on second Nesterov proximal gradient optimization. The proposed end-to-end network not only has the powerful learning ability for high/low frequency image features,but also can theoretically guarantee that geometric texture details will be reconstructed from preliminary linear reconstruction.Furthermore, it can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains and achieve fast convergence. Our reconstruction framework is decomposed into four modules including general linear reconstruction, cascade geometric incremental restoration, Nesterov acceleration and post-processing. In the image restoration step,a cascade geometric incremental learning module is designed to compensate for the missing texture information from different geometric spectral decomposition domains. Inspired by overlap-tile strategy, we also develop a post-processing module to remove the block-effect in patch-wise-based natural image reconstruction. All parameters in the proposed model are learnable,an adaptive initialization technique of physical-parameters is also employed to make model flexibility and ensure converging smoothly. We compare the reconstruction performance of the proposed method with existing state-of-the-art methods to demonstrate its superiority. Our source codes are available at https://github.com/fanxiaohong/Nest-DGIL.

摘要
近似梯度基本优化是解决图像反问题的一种非常常用的策略，但这些技术经常生成严重的artefacts。一种非常流行的改进方法是微调正则化参数，以避免这些artefacts，但这可能不一定是可靠的或可行的，因为它可能会增加计算成本。在这项工作中，我们提出了一种深度凸 геометрическое增量学习框架，基于第二个奈斯特洛夫梯度优化。我们的提案的端到端网络不仅具有高/低频图像特征的强大学习能力，而且可以 theoretically guarantee that geometric texture details will be reconstructed from preliminary linear reconstruction。此外，它可以避免中间重建结果落在几何分解域之外，并 achieve fast convergence。我们的重建框架分为四个模块：通用线性重建、几何增量归一化、奈斯特洛夫加速和后处理。在图像重建步骤中，我们设计了一个几何增量学习模块，用于补做不同几何 spectral decomposition domains 中缺失的文本信息。受 overlap-tile 灵感，我们还开发了一个后处理模块，用于在 patch-wise 基于自然图像重建中消除块效应。所有模型参数都是学习可以，并且我们采用了一种适应性的 физи学参数初始化技术，以确保模型的灵活性和平滑的 converges。我们与现有状态的方法进行比较，以示其超越性。我们的源代码可以在 https://github.com/fanxiaohong/Nest-DGIL 上获取。

PNN: From proximal algorithms to robust unfolded image denoising networks and Plug-and-Play methods

paper_url: http://arxiv.org/abs/2308.03139
repo_url: None
paper_authors: Hoang Trieu Vy Le, Audrey Repetti, Nelly Pustelnik
for: 这篇论文的目的是提出一种基于拟合优化理论的神经网络架构，用于解决频率噪声约束的图像恢复问题。
methods: 这篇论文使用了迭代 proximal 算法，并将其与深度学习策略结合，以提高估计质量。具体来说，这篇论文提出了一种基于 proximal 算法的神经网络架构，称为 proximal neural network (PNN)，可以解决任何基于 proximal 算法的图像恢复任务。
results: 作者们在这篇论文中提出了一种基于 dual-FB 和 primal-dual Chambolle-Pock 算法的 PNN 架构，并证明了这种架构可以在图像恢复任务中实现更好的效果。此外，作者们还提出了一些不同的学习策略，并对其的稳定性（Lipschitz 性）和恢复效果进行了 исследование。最后，作者们证明了这种 PNN 架构在一种镜像减震问题中的稳定性。

Abstract
A common approach to solve inverse imaging problems relies on finding a maximum a posteriori (MAP) estimate of the original unknown image, by solving a minimization problem. In thiscontext, iterative proximal algorithms are widely used, enabling to handle non-smooth functions and linear operators. Recently, these algorithms have been paired with deep learning strategies, to further improve the estimate quality. In particular, proximal neural networks (PNNs) have been introduced, obtained by unrolling a proximal algorithm as for finding a MAP estimate, but over a fixed number of iterations, with learned linear operators and parameters. As PNNs are based on optimization theory, they are very flexible, and can be adapted to any image restoration task, as soon as a proximal algorithm can solve it. They further have much lighter architectures than traditional networks. In this article we propose a unified framework to build PNNs for the Gaussian denoising task, based on both the dual-FB and the primal-dual Chambolle-Pock algorithms. We further show that accelerated inertial versions of these algorithms enable skip connections in the associated NN layers. We propose different learning strategies for our PNN framework, and investigate their robustness (Lipschitz property) and denoising efficiency. Finally, we assess the robustness of our PNNs when plugged in a forward-backward algorithm for an image deblurring problem.

摘要
一般来说，解决反射图像问题的常用方法是找到最大 posteriori (MAP) 估计原始未知图像，解决一个最小化问题。在这种情况下，迭代 proximal 算法广泛使用，以处理非滑动函数和线性运算员。最近，这些算法与深度学习策略相结合，以进一步改善估计质量。特别是， proximal 神经网络 (PNNs) 已经引入，它们是通过固定数量的迭代器来找到 MAP 估计，但是具有学习的线性运算员和参数。由于 PNNs 基于优化理论，它们非常灵活，可以适应任何图像恢复任务，只要可以使用 proximal 算法解决它。此外，它们的架构非常轻量级，比传统神经网络更加轻量级。在这篇文章中，我们提出一个统一的框架来建立 PNNs для Gaussian 噪声问题，基于 dual-FB 和 primal-dual Chambolle-Pock 算法。我们还证明，使用加速增量版本的这些算法可以在相关的 CNN 层中添加跳过连接。我们提出了不同的学习策略，并investigate 其Robustness 和噪声除去效率。最后，我们评估了我们的 PNNs 在一个前向-后向算法中的稳定性。

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

paper_url: http://arxiv.org/abs/2308.03135
repo_url: None
paper_authors: Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, Lin Wang
for: 提高 event-based 图像识别 task 的性能
methods: 提出了一个新的框架 E-CLIP，通过在 event 数据上模型缺失的大规模数据集和模式差异，挖掘 CLIP 的潜在能力
results: 在 N-Caltech 数据集上取得了 +3.94% 和 +4.62% 的提升，在细化设定和少量示例设定中都达到了最佳性能

Abstract
Contrasting Language-image pertaining (CLIP) has recently shown promising open-world and few-shot performance on 2D image-based recognition tasks. However, the transferred capability of CLIP to the novel event camera data still remains under-explored. In particular, due to the modality gap with the image-text data and the lack of large-scale datasets, achieving this goal is non-trivial and thus requires significant research innovation. In this paper, we propose E-CLIP, a novel and effective framework that unleashes the potential of CLIP for event-based recognition to compensate for the lack of large-scale event-based datasets. Our work addresses two crucial challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. To this end, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile generates event prompts to promote the modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance the E-CLIP's generalization ability across diverse datasets. With the proposed event encoder, text encoder, and original image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We conduct extensive experiments on two recognition benchmarks, and the results demonstrate that our E-CLIP outperforms existing methods by a large margin of +3.94% and +4.62% on the N-Caltech dataset, respectively, in both fine-tuning and few-shot settings. Moreover, our E-CLIP can be flexibly extended to the event retrieval task using both text or image queries, showing plausible performance.

摘要
另一个挑战是如何将CLIP的视觉编码器应用到事件数据上，并充分利用事件的特有特征，如稀疏性和高时间分辨率。为此，我们首先引入了一种新的事件编码器，它灵活地模拟了事件中的时间信息，同时生成了事件提示，以便模式桥接。然后，我们设计了一个文本编码器，它生成了内容提示，并使用混合文本提示来提高E-CLIP的泛化能力。最后，我们引入了一个新的层次 triple contrastive alignment（HTCA）模块，以同时优化相关性和各模态之间的知识传递。我们在两个认证标准列表上进行了广泛的实验，结果表明，我们的E-CLIP在精度调整和少量调整下比 EXISTS的方法提高了+3.94%和+4.62%的margin。此外，我们的E-CLIP可以灵活地扩展到事件检索任务，使用文本或图像查询，表现可靠。

NNVISR: Bring Neural Network Video Interpolation and Super Resolution into Video Processing Framework

paper_url: http://arxiv.org/abs/2308.03121
repo_url: https://github.com/tongyuantongyu/vs-nnvisr
paper_authors: Yuan Tong, Mengshun Hu, Zheng Wang
for: 这个论文是为了提供一个基于神经网络的视频提高工具，用于进行各种视频提高任务，包括噪声除除、超分辨、 interpolate 和空间时间超分辨。
methods: 该论文使用的方法是基于神经网络的，可以接受任何能够提高一组帧的神经网络，并处理所有其他网络不依赖的细节 during 视频处理。
results: 该论文的实验结果表明，NNVISR 可以高效地进行视频提高任务，并且可以提供比较高的图像质量。

Abstract
We present NNVISR - an open-source filter plugin for the VapourSynth video processing framework, which facilitates the application of neural networks for various kinds of video enhancing tasks, including denoising, super resolution, interpolation, and spatio-temporal super-resolution. NNVISR fills the gap between video enhancement neural networks and video processing pipelines, by accepting any network that enhances a group of frames, and handling all other network agnostic details during video processing. NNVISR is publicly released at https://github.com/tongyuantongyu/vs-NNVISR.

摘要
我团队现在公开发布一款开源的滤波器插件NNVISR，用于VapourSynth视频处理框架，以应用神经网络进行视频优化任务，包括噪声除除、超分辨率、 interpolate 和空间时间超分辨率。NNVISR将视频优化神经网络与视频处理管道相连接，接受任何可以提高帧组的网络，并处理所有其他网络不关注的细节 durante el procesamiento de video. NNVISR está disponible en .

SAAM: Stealthy Adversarial Attack on Monoculor Depth Estimation

paper_url: http://arxiv.org/abs/2308.03108
repo_url: None
paper_authors: Amira Guesmi, Muhammad Abdullah Hanif, Bassem Ouni, Muhammad Shafique
for: 本研究探讨了基于深度学习的MDE（多层感知探测）系统在面对抗尾攻击时的漏洞。
methods: 我们提出了一种新的隐蔽式抗尾攻击方法，称为SAAM（隐蔽式抗尾攻击），该方法可以让MDE系统伪造物体的深度估计。我们的实验结果表明，我们设计的隐蔽式抗尾攻击贴图成功地使DNN基于MDE系统产生深度错误。具体来说，我们的设计的隐蔽式抗尾攻击贴图可以在99%的影响区域内达到60%的深度错误。
results: 我们的实验结果表明，我们的SAAM方法可以成功地使MDE系统产生深度错误，并且这些错误具有自然的外观，使其难以被人类识别。我们认为这种威胁对MDE系统在边缘设备上的应用产生了重要的影响，并且希望这种威胁能够引起社区的关注，并促进更多的robust和适应性的防御机制的研究。

Abstract
In this paper, we investigate the vulnerability of MDE to adversarial patches. We propose a novel \underline{S}tealthy \underline{A}dversarial \underline{A}ttacks on \underline{M}DE (SAAM) that compromises MDE by either corrupting the estimated distance or causing an object to seamlessly blend into its surroundings. Our experiments, demonstrate that the designed stealthy patch successfully causes a DNN-based MDE to misestimate the depth of objects. In fact, our proposed adversarial patch achieves a significant 60\% depth error with 99\% ratio of the affected region. Importantly, despite its adversarial nature, the patch maintains a naturalistic appearance, making it inconspicuous to human observers. We believe that this work sheds light on the threat of adversarial attacks in the context of MDE on edge devices. We hope it raises awareness within the community about the potential real-life harm of such attacks and encourages further research into developing more robust and adaptive defense mechanisms.

摘要
在这篇论文中，我们研究了多光谱探测（MDE）对恶意质patch的抵触性。我们提出了一种新的隐蔽的恶意质patch（SAAM），该patch可以让MDE估算误差或使物体顺滑地融入到它所在的环境中。我们的实验表明，我们设计的隐蔽patch成功地使得基于DNN的MDE估算深度错误。事实上，我们的提案的恶意质patch实现了60%的深度错误率，99%的affected region。重要的是，即使具有恶意目的，patch仍然保持自然的外观，使其对人类观察者难以发现。我们认为，这项工作着光了MDE在边缘设备上的攻击风险，并希望这项研究会引起社区对此类攻击的关注，并促进更多的robust和适应性的防御机制的研究。

Incorporating Pre-training Data Matters in Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2308.03097
repo_url: None
paper_authors: Yinsong Xu, Aidong Men, Yang Liu, Qingchao Chen
for: 本研究旨在探讨半监督领域适应（UDA）和无源领域适应（SFUDA）方法中的问题，具体来说是研究图像Net、源频谱和目标频谱之间的相互关系，以及在这些频谱上进行 pré-训练后，对目标风险的影响。
methods: 本研究使用了一种名为TriDA的新框架，它在 fine-tuning 过程中保持了预训练集（图像Net）的 semantic 结构，以提高适应性能。
results: 实验结果显示，TriDA 可以在多个 UDA 和 SFUDA 标准测试集上达到当前最佳性能。

Abstract
Unsupervised domain adaptation(UDA) and Source-free UDA(SFUDA) methods formulate the problem involving two domains: source and target. They typically employ a standard training approach that begins with models pre-trained on large-scale datasets e.g., ImageNet, while rarely discussing its effect. Recognizing this gap, we investigate the following research questions: (1) What is the correlation among ImageNet, the source, and the target domain? (2) How does pre-training on ImageNet influence the target risk? To answer the first question, we empirically observed an interesting Spontaneous Pulling (SP) Effect in fine-tuning where the discrepancies between any two of the three domains (ImageNet, Source, Target) decrease but at the cost of the impaired semantic structure of the pre-train domain. For the second question, we put forward a theory to explain SP and quantify that the target risk is bound by gradient disparities among the three domains. Our observations reveal a key limitation of existing methods: it hinders the adaptation performance if the semantic cluster structure of the pre-train dataset (i.e.ImageNet) is impaired. To address it, we incorporate ImageNet as the third domain and redefine the UDA/SFUDA as a three-player game. Specifically, inspired by the theory and empirical findings, we present a novel framework termed TriDA which additionally preserves the semantic structure of the pre-train dataset during fine-tuning. Experimental results demonstrate that it achieves state-of-the-art performance across various UDA and SFUDA benchmarks.

摘要
Unsupervised domain adaptation（UDA）和Source-free UDA（SFUDA）方法通常假设有两个领域：源领域和目标领域。它们通常采用标准训练方法，开始于在大规模数据集上预训练模型，如ImageNet，而rarely探讨其效果。认可这个空隙，我们调查以下研究问题：（1）ImageNet、源领域和目标领域之间有哪些相互关系？（2）预训练在ImageNet上会对目标风险产生何种影响？为回答第一个问题，我们观察了一种有趣的自然抽象（SP）效应在细化中，其中任何两个领域之间的差异都会减少，但是会导致预训练频率的结构受损。为回答第二个问题，我们提出了一种理论来解释SP和量化目标风险受到三个领域的梯度差异的限制。我们的观察表明现有方法的一个重要限制：如果预训练数据集（即ImageNet）的 semantic cluster structure被破坏，那么适应性会受到影响。为解决这个限制，我们将ImageNet作为第三个领域，并重新定义UDA/SFUDA为三个玩家的游戏。具体来说，我们提出了一种新的框架，称为TriDA，它在细化过程中保持预训练数据集的semantic结构。实验结果表明，TriDA可以在多个UDA和SFUDA benchmark上实现状态机器人的性能。

ECT: Fine-grained Edge Detection with Learned Cause Tokens

paper_url: http://arxiv.org/abs/2308.03092
repo_url: https://github.com/daniellli/ect
paper_authors: Shaocong Xu, Xiaoxue Chen, Yuhang Zheng, Guyue Zhou, Yurong Chen, Hongbin Zha, Hao Zhao
for: 本研究强调细化边检测任务，即预测由反射、照明、正常和深度变化引起的特定边。
methods: 我们提出了一种基于转换器的两阶段网络，先预测通用边，然后预测细化边，使用注意机制实现全局响应野。我们还使用可学习的 causal 令和边绑定损失来保证边的一致性。
results: 我们在公共测试集 BSDS-RIND 上以及一些新定义的测试集上进行了评估，并实现了新的状态计算结果。我们的代码、数据和模型都公开在 GitHub 上（ https://github.com/Daniellli/ECT.git）。

Abstract
In this study, we tackle the challenging fine-grained edge detection task, which refers to predicting specific edges caused by reflectance, illumination, normal, and depth changes, respectively. Prior methods exploit multi-scale convolutional networks, which are limited in three aspects: (1) Convolutions are local operators while identifying the cause of edge formation requires looking at far away pixels. (2) Priors specific to edge cause are fixed in prediction heads. (3) Using separate networks for generic and fine-grained edge detection, and the constraint between them may be violated. To address these three issues, we propose a two-stage transformer-based network sequentially predicting generic edges and fine-grained edges, which has a global receptive field thanks to the attention mechanism. The prior knowledge of edge causes is formulated as four learnable cause tokens in a cause-aware decoder design. Furthermore, to encourage the consistency between generic edges and fine-grained edges, an edge aggregation and alignment loss is exploited. We evaluate our method on the public benchmark BSDS-RIND and several newly derived benchmarks, and achieve new state-of-the-art results. Our code, data, and models are publicly available at https://github.com/Daniellli/ECT.git.

摘要
在这个研究中，我们面临着细化的边检测任务，即根据反射、照明、 нормаль和深度变化预测特定的边。先前的方法利用多尺度卷积网络，它们受到以下三种限制：（1）卷积是本地操作符，而预测边的原因需要查看远程像素。（2）预测头中的特征预设是固定的。（3）使用分开的网络来预测通用和细化的边，以及这两个网络之间的约束可能会被违反。为了解决这些问题，我们提出了一个两stage的 transformer-based 网络，先预测通用的边，然后预测细化的边，具有全局响应场，因为注意机制。此外，我们还形式化了边的原因为四个学习的 Token，并在 cause-aware 解码器中实现了这些 Token。此外，为了促进通用边和细化边之间的一致性，我们还利用了边聚合和对齐损失。我们在公共的 benchmark BSDS-RIND 上以及一些新 derivation 的 benchmark 上进行了测试，并实现了新的州��archy 记录。我们的代码、数据和模型都公开可用于。

Study for Performance of MobileNetV1 and MobileNetV2 Based on Breast Cancer

paper_url: http://arxiv.org/abs/2308.03076
repo_url: None
paper_authors: Jiuqi Yan
for: 这个实验的目的是比较MobileNetV1和MobileNetV2模型在分析乳腺癌图像方面的表现。
methods: 这个实验使用了Kaggle上下载的 histopathological 图像集进行训练，并使用了 MobileNetV1 和 MobileNetV2 模型进行分类。
results: 实验结果显示，在处理这个数据集时，MobileNetV1 模型表现更好，其验证精度和过拟合性也较高。

Abstract
Artificial intelligence is constantly evolving and can provide effective help in all aspects of people's lives. The experiment is mainly to study the use of artificial intelligence in the field of medicine. The purpose of this experiment was to compare which of MobileNetV1 and MobileNetV2 models was better at detecting histopathological images of the breast downloaded at Kaggle. When the doctor looks at the pathological image, there may be errors that lead to errors in judgment, and the observation speed is slow. Rational use of artificial intelligence can effectively reduce the error of doctor diagnosis in breast cancer judgment and speed up doctor diagnosis. The dataset was downloaded from Kaggle and then normalized. The basic principle of the experiment is to let the neural network model learn the downloaded data set. Then find the pattern and be able to judge on your own whether breast tissue is cancer. In the dataset, benign tumor pictures and malignant tumor pictures have been classified, of which 198738 are benign tumor pictures and 78, 786 are malignant tumor pictures. After calling MobileNetV1 and MobileNetV2, the dataset is trained separately, the training accuracy and validation accuracy rate are obtained, and the image is drawn. It can be observed that MobileNetV1 has better validation accuracy and overfit during MobileNetV2 training. From the experimental results, it can be seen that in the case of processing this dataset, MobileNetV1 is much better than MobileNetV2.

摘要
人工智能不断演化，可以提供有效的帮助在人们的生活中。这个实验主要是研究人工智能在医疗领域的应用。实验的目的是比较MobileNetV1和MobileNetV2模型在Kaggle上下载的乳腺病理图像上的性能。当医生查看病理图像时，可能会出现错误，导致诊断错误， Observation speed also slow. 合理使用人工智能可以有效减少医生诊断 breast cancer 的错误率，并加快医生诊断速度。数据集来自Kaggle，然后 норmalized。实验的基本原则是让神经网络模型学习下载的数据集，然后找出模式，以便 judge 是否 breast tissue 是癌变。在数据集中，癌变和非癌变图像已经分类，其中有198738个癌变图像和78786个非癌变图像。在MobileNetV1和MobileNetV2之后，数据集进行了独立的训练，并获得了训练精度和验证精度率，并将图像绘制出来。可以看到，在处理这个数据集时，MobileNetV1比MobileNetV2更好。

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

paper_url: http://arxiv.org/abs/2308.03063
repo_url: None
paper_authors: Hao Tang, Jun Liu, Shuanglin Yan, Rui Yan, Zechao Li, Jinhui Tang
for:* The paper is written for fine-grained action recognition, specifically addressing the challenges of capturing subtle action details and learning from limited data with high intra-class variance and inter-class similarity.methods:* The proposed M$^3$Net framework incorporates multi-view encoding, multi-view matching, and multi-view fusion to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints.* The framework uses various matching functions to integrate instance-specific, category-specific, and task-specific perspectives, and employs multi-task collaborative learning to enhance embedding generalizability.results:* Experimental results on three challenging benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained action details and achieving state-of-the-art performance for few-shot fine-grained action recognition.Here’s the information in Simplified Chinese text:for:* 这篇论文是为了解决细化动作识别问题，特别是捕捉细化动作细节和从有限数据中学习的问题。methods:* 提议的 M$^3$Net 框架具有多视图编码、多视图匹配和多视图融合，以便在多个视点上进行嵌入编码、相似匹配和决策。* 该框架使用多种匹配函数集成多种视点，包括实例特定、类别特定和任务特定的视点，以处理多尺度空间时间变化。results:* 实验结果表明，M$^3$Net 能够出色地捕捉细化动作细节并在少量数据下实现状态的最佳性能。

Abstract
Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances. Despite the progress made in FS coarse-grained action recognition, current approaches encounter two challenges when dealing with the fine-grained action categories: the inability to capture subtle action details and the insufficiency of learning from limited data that exhibit high intra-class variance and inter-class similarity. To address these limitations, we propose M$^3$Net, a matching-based framework for FS-FG action recognition, which incorporates \textit{multi-view encoding}, \textit{multi-view matching}, and \textit{multi-view fusion} to facilitate embedding encoding, similarity matching, and decision making across multiple viewpoints. \textit{Multi-view encoding} captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data. \textit{Multi-view matching} integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. \textit{Multi-view fusion} consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

摘要
多视图编码 captures rich contextual details from the intra-frame, intra-video, and intra-episode perspectives, generating customized higher-order embeddings for fine-grained data. 多视图匹配 integrates various matching functions enabling flexible relation modeling within limited samples to handle multi-scale spatio-temporal variations by leveraging the instance-specific, category-specific, and task-specific perspectives. 多视图融合 consists of matching-predictions fusion and matching-losses fusion over the above views, where the former promotes mutual complementarity and the latter enhances embedding generalizability by employing multi-task collaborative learning. Explainable visualizations and experimental results on three challenging benchmarks demonstrate the superiority of M$^3$Net in capturing fine-grained action details and achieving state-of-the-art performance for FS-FG action recognition.

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild

paper_url: http://arxiv.org/abs/2308.03061
repo_url: None
paper_authors: Yanyan Shao, Qi Ye, Wenhan Luo, Kaihao Zhang, Jiming Chen
for: 本研究旨在解决人机交互中物体识别的问题，即在受到遮挡、背景噪音和干扰物体的情况下，准确地识别人与物体之间的交互。
methods: 本研究提出了一种基于手套空间时间信息的交互物体跟踪方法，不需要先知道将要跟踪的通用物体，可以在不同的交互场景下准确地识别和跟踪交互物体。
results: 比较实验结果表明，提出的方法在持续交互场景下的表现明显优于现有方法，具体来说，在100DOH数据集上测试和评估的视频水平手套交互数据集上，我们的方法在AP指标下取得了约10%的提升。此外，我们的质量发现也表明，我们的方法可以生成更加连续的交互物体轨迹。

Abstract
Understanding human interaction with objects is an important research topic for embodied Artificial Intelligence and identifying the objects that humans are interacting with is a primary problem for interaction understanding. Existing methods rely on frame-based detectors to locate interacting objects. However, this approach is subjected to heavy occlusions, background clutter, and distracting objects. To address the limitations, in this paper, we propose to leverage spatio-temporal information of hand-object interaction to track interactive objects under these challenging cases. Without prior knowledge of the general objects to be tracked like object tracking problems, we first utilize the spatial relation between hands and objects to adaptively discover the interacting objects from the scene. Second, the consistency and continuity of the appearance of objects between successive frames are exploited to track the objects. With this tracking formulation, our method also benefits from training on large-scale general object-tracking datasets. We further curate a video-level hand-object interaction dataset for testing and evaluation from 100DOH. The quantitative results demonstrate that our proposed method outperforms the state-of-the-art methods. Specifically, in scenes with continuous interaction with different objects, we achieve an impressive improvement of about 10% as evaluated using the Average Precision (AP) metric. Our qualitative findings also illustrate that our method can produce more continuous trajectories for interacting objects.

摘要
人类与物体之间的互动理解是人工智能embodied的重要研究领域，并且确定互动中的物体是primary problem。现有方法通过框架基于的检测器来定位互动中的物体，但这种方法受到压束、背景噪音和干扰物体的影响。为了解决这些限制，在这篇论文中，我们提出了利用手部和物体之间的空间时间信息来跟踪互动中的物体。不需要先知道需要跟踪的通用物体类型，我们首先利用手部和物体之间的空间关系来自适应地发现互动中的物体。其次，我们利用物体之间的相似性和连续性来跟踪物体。这种跟踪方法受益于在大规模的通用物体跟踪数据集上进行训练。我们还制作了基于100DOH的视频级手部物体互动数据集用于测试和评估。量值结果表明，我们提出的方法在不断互动的不同物体场景中表现出色，相比状态艺术方法，我们的方法在AP metric下达到了约10%的提升。我们的质量发现也表明了我们的方法可以生成更加连续的互动物体轨迹。

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

paper_url: http://arxiv.org/abs/2308.03060
repo_url: https://github.com/chaofengc/iqa-pytorch
paper_authors: Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin
for: 这个论文主要针对图像质量评估（IQA）领域，旨在提高图像质量评估的精度和效率。
methods: 该方法基于人类视觉系统的特点，使用多级尺度特征（global和local特征），并通过提高semantic信息的利用来提高表示能力。具体来说，该方法提出了一种顶部下降的方法（TOPIQ），通过高级semantic信息导引低级特征进行重要区域的强调，从而提高表示能力。
results: 该方法通过设计了一种冗余抽象网络（CFANet），可以用于Full-Reference（FR）和No-Reference（NR）IQA，并且可以与现有的视transformer方法相比，在大多数公共FR和NR标准benchmark上表现更好或竞争性好，同时具有许多更高效的特点。

Abstract
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. Inspired by the characteristics of the human visual system, existing methods typically use a combination of global and local representations (\ie, multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as \emph{TOPIQ}. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. CFANet can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that CFANet achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only ${\sim}13\%$ FLOPS of the current best FR method). Codes are released at \url{https://github.com/chaofengc/IQA-PyTorch}.

摘要
Image质量评估（IQA）是计算机视觉中的基本任务，它在深度神经网络的推动下已经取得了非常出色的进步。人类视觉系统的特点 inspirits existing methods typically use a combination of global and local representations (\ie, multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as \emph{TOPIQ}. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. CFANet can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that CFANet achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only $\sim13\%$ FLOPS of the current best FR method). Codes are released at \url{https://github.com/chaofengc/IQA-PyTorch}.

Multi-scale Alternated Attention Transformer for Generalized Stereo Matching

paper_url: http://arxiv.org/abs/2308.03048
repo_url: None
paper_authors: Wei Miao, Hong Zhao, Tongjia Chen, Wei Huang, Changyan Xiao
for: 提高三视图匹配性能，提供一种新的简单 yet effective的网络结构。
methods: 使用 Alternated Attention U-shaped Transformer (AAUformer) 网络结构，包括窗口自注意力和多尺度 Alternated Attention 几层网络，以提高单视图特征的解释力和匹配精度。
results: 在 Scene Flow 数据集上达到了状态的推出表现，并在 KITTI 2015 数据集上进行了细致的调整，在 synthetic 和 real-world 数据集上进行了跨种植入性比较，表现出色。

Abstract
Recent stereo matching networks achieves dramatic performance by introducing epipolar line constraint to limit the matching range of dual-view. However, in complicated real-world scenarios, the feature information based on intra-epipolar line alone is too weak to facilitate stereo matching. In this paper, we present a simple but highly effective network called Alternated Attention U-shaped Transformer (AAUformer) to balance the impact of epipolar line in dual and single view respectively for excellent generalization performance. Compared to other models, our model has several main designs: 1) to better liberate the local semantic features of the single-view at pixel level, we introduce window self-attention to break the limits of intra-row self-attention and completely replace the convolutional network for denser features before cross-matching; 2) the multi-scale alternated attention backbone network was designed to extract invariant features in order to achieves the coarse-to-fine matching process for hard-to-discriminate regions. We performed a series of both comparative studies and ablation studies on several mainstream stereo matching datasets. The results demonstrate that our model achieves state-of-the-art on the Scene Flow dataset, and the fine-tuning performance is competitive on the KITTI 2015 dataset. In addition, for cross generalization experiments on synthetic and real-world datasets, our model outperforms several state-of-the-art works.

摘要
现代ステレオ匹配ネットワークは、双视点间の匹配范囲を制限するepipolar线制约を导入して、优れた性能を示している。 However, 现実世界の复雑なシーンでは、単视点内の情报に基づいた匹配は弱すぎる。在这篇论文中，我们は、简单で强大的网络模型 called Alternated Attention U-shaped Transformer (AAUformer) を提出します。この模型は、双视点および単视点の両方での匹配に対して、均衡的な影响を持つように设计されています。この模型には、以下の主要なデザインが含まれています。1. 単视点の地域的semantic featureをより良く解放するため、ウィンドウ自己注意を导入しました。これにより、梯度网络の制约を完全に超えることができます。2. 多スケールのalternated attentionBackbone Networkを设计しました。これにより、不挥発的な特徴を抽出し、coarse-to-fine匹配プロセスにおいて、难度の高い领域での匹配を可能にします。我们は、several mainstream stereo matching datasetsに対して、比较研究およびablation studyを実施しました。结果は、Scene Flow datasetでstate-of-the-artを记录し、KITTI 2015 datasetでの细化学习性能は、优れた性能を示しています。また、合成および実世界のデータセットでの测定においても、several state-of-the-art worksを越える性能を示しています。

Prototypes-oriented Transductive Few-shot Learning with Conditional Transport

paper_url: http://arxiv.org/abs/2308.03047
repo_url: None
paper_authors: Long Tian, Jingyi Feng, Wenchao Chen, Xiaoqiang Chai, Liming Wang, Xiyang Liu, Bo Chen
for: 提高异常少量学习（TFSL）模型在异常分布的情况下的性能，即使在类别不均衡的情况下。
methods: 提出一种基于Conditional Transport（CT）的不均衡TFSL模型，称为PUTM，可以全面利用异常query样本的不均衡统计信息，并使用前进和后退导航器作为传输矩阵来偏好每个类别的查询样本分布。
results: 通过实验证明，在四个标准 benchmark上，包括miniImageNet、 tieredImageNet、CUB 和 CIFAR-FS，OUR模型在类别不均衡的情况下表现出优于其他模型。

Abstract
Transductive Few-Shot Learning (TFSL) has recently attracted increasing attention since it typically outperforms its inductive peer by leveraging statistics of query samples. However, previous TFSL methods usually encode uniform prior that all the classes within query samples are equally likely, which is biased in imbalanced TFSL and causes severe performance degradation. Given this pivotal issue, in this work, we propose a novel Conditional Transport (CT) based imbalanced TFSL model called {\textbf P}rototypes-oriented {\textbf U}nbiased {\textbf T}ransfer {\textbf M}odel (PUTM) to fully exploit unbiased statistics of imbalanced query samples, which employs forward and backward navigators as transport matrices to balance the prior of query samples per class between uniform and adaptive data-driven distributions. For efficiently transferring statistics learned by CT, we further derive a closed form solution to refine prototypes based on MAP given the learned navigators. The above two steps of discovering and transferring unbiased statistics follow an iterative manner, formulating our EM-based solver. Experimental results on four standard benchmarks including miniImageNet, tieredImageNet, CUB, and CIFAR-FS demonstrate superiority of our model in class-imbalanced generalization.

摘要
传ductive 少样学习 (TFSL) 在最近几年引起了越来越多的关注，因为它通常比其它 inductive 对手优秀，通过训练尝试样本的统计信息来实现。然而，先前的 TFSL 方法通常采用了所有查询样本中类别的均勋统计，这会导致不均衡的 TFSL 表现不佳，特别是在类别异质的情况下。为了解决这一问题，在本工作中，我们提出了一种基于 Conditional Transport (CT) 的不均衡 TFSL 模型，称为 Prototypes-oriented Unbiased Transfer Model (PUTM)。该模型利用前向和后向导航器作为传输矩阵，以均衡查询样本中每个类别的先前统计信息。为了有效地传输 CT 学习的统计信息，我们还 derivates 一个关注 MAP 的闭形解，以修改基于导航器学习的抽象。这两个步骤，jointly 组成我们的 EM-based 解决方案。实验结果表明，我们的模型在 miniImageNet、tieredImageNet、CUB 和 CIFAR-FS 四个标准准标 benchmark 上具有优秀的类别不均衡泛化能力。

Learning Fine-Grained Features for Pixel-wise Video Correspondences

paper_url: http://arxiv.org/abs/2308.03040
repo_url: https://github.com/qianduoduolr/fgvc
paper_authors: Rui Li, Shenglong Zhou, Dong Liu
for: 学习 pixel-wise 视频对应关系，提高视频分析效果。
methods: 基于自我超vised学习和 optical flows 的方法，使用 labelled 和 unlabeled 视频进行学习，采用对抗学习 scheme 提高泛化能力。
results: 在多种对应任务上达到了 state-of-the-art 精度和效率。

Abstract
Video analysis tasks rely heavily on identifying the pixels from different frames that correspond to the same visual target. To tackle this problem, recent studies have advocated feature learning methods that aim to learn distinctive representations to match the pixels, especially in a self-supervised fashion. Unfortunately, these methods have difficulties for tiny or even single-pixel visual targets. Pixel-wise video correspondences were traditionally related to optical flows, which however lead to deterministic correspondences and lack robustness on real-world videos. We address the problem of learning features for establishing pixel-wise correspondences. Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos for learning fine-grained representations in a holistic framework. We adopt an adversarial learning scheme to enhance the generalization ability of the learned features. Moreover, we design a coarse-to-fine framework to pursue high computational efficiency. Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency.

摘要
视频分析任务强调将不同帧中的像素与同一个视觉目标进行对应。为解决这个问题， latest studies have advocated feature learning methods that aim to learn distinctive representations to match the pixels, especially in a self-supervised fashion. However, these methods have difficulties for tiny or even single-pixel visual targets. 传统的像素级视频对应方法与 optic flow 相关，但是这些方法具有决定性对应和在实际视频中缺乏鲁棒性。我们解决了学习建立像素级对应的特征的问题。我们采用了 optical flow 以及自主学习特征学习的思路，并在一个整体框架中学习细腻的表示。我们采用了对抗学习方案来增强学习的通用能力，并设计了一个从粗到细的框架来提高计算效率。我们的实验结果表明，我们的方法在多种对应任务中超过了当前的竞争对手， both in terms of accuracy and efficiency.

FourLLIE: Boosting Low-Light Image Enhancement by Fourier Frequency Information

paper_url: http://arxiv.org/abs/2308.03033
repo_url: https://github.com/wangchx67/fourllie
paper_authors: Chenxi Wang, Hongjun Wu, Zhi Jin
for: This paper focuses on improving the lightness of low-light images using Fourier frequency information.methods: The proposed FourLLIE network uses a two-stage approach, first estimating the amplitude transform map in the Fourier space and then introducing an SNR map to integrate global Fourier frequency and local spatial information.results: The proposed FourLLIE method outperforms existing SOTA LLIE methods on four representative datasets while maintaining good model efficiency.Here’s the text in Simplified Chinese:for: 本文主要针对低光照图像的亮度提升问题，利用傅里叶频率信息。methods: FourLLIE网络采用两stage方法，首先估计傅里叶频率空间中的振荡变换图，然后引入Signal-to-Noise-Ratio（SNR）地图，将全球傅里叶频率和本地空间信息集成。results: FourLLIE方法在四个表示性数据集上比存在状态的方法（SOTA）高，同时保持良好的模型效率。

Abstract
Recently, Fourier frequency information has attracted much attention in Low-Light Image Enhancement (LLIE). Some researchers noticed that, in the Fourier space, the lightness degradation mainly exists in the amplitude component and the rest exists in the phase component. By incorporating both the Fourier frequency and the spatial information, these researchers proposed remarkable solutions for LLIE. In this work, we further explore the positive correlation between the magnitude of amplitude and the magnitude of lightness, which can be effectively leveraged to improve the lightness of low-light images in the Fourier space. Moreover, we find that the Fourier transform can extract the global information of the image, and does not introduce massive neural network parameters like Multi-Layer Perceptrons (MLPs) or Transformer. To this end, a two-stage Fourier-based LLIE network (FourLLIE) is proposed. In the first stage, we improve the lightness of low-light images by estimating the amplitude transform map in the Fourier space. In the second stage, we introduce the Signal-to-Noise-Ratio (SNR) map to provide the prior for integrating the global Fourier frequency and the local spatial information, which recovers image details in the spatial space. With this ingenious design, FourLLIE outperforms the existing state-of-the-art (SOTA) LLIE methods on four representative datasets while maintaining good model efficiency.

摘要
近些时候，弗洛伦幂频信息在低光照图像增强（LLIE）中吸引了非常多的注意力。一些研究人员发现，在弗洛伦空间中，低光照图像的亮度异常主要集中在幂频成分中，剩下的则集中在相位成分中。通过将弗洛伦频率和空间信息结合起来，这些研究人员提出了非常出色的解决方案。在这个工作中，我们进一步探索幂频成分的积分和亮度之间的正相关关系，可以有效地提高低光照图像的亮度在弗洛伦空间中。此外，我们发现弗洛伦变换可以提取图像的全局信息，而不需要大量的神经网络参数，比如多层感知器（MLP）或转移器。为了实现这一目标，我们提出了一种两stage的弗洛伦基于LLIE网络（FourLLIE）。在第一stage中，我们使用幂频变换Map来提高低光照图像的亮度在弗洛伦空间中。在第二stage中，我们引入信噪比特（SNR）Map，以提供优先顺序，将全局弗洛伦频率和本地空间信息集成起来，以恢复图像细节在空间空间中。与既有的SOTA方法相比，FourLLIE在四个代表性的数据集上表现出了出色的性能，同时保持了良好的模型效率。

Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement

paper_url: http://arxiv.org/abs/2308.03029
repo_url: None
paper_authors: Chenxi Wang, Zhi Jin
for: 提高低光照图像的 perceived 质量 (improve the perceptual quality of low-light images)
methods: 提出一种“炯化和颜色化”网络 (propose a “brighten-and-colorize” network)，将低光照图像进行炯化和颜色化，同时实现具有精度颜色的增强和个性化增强 based on user preferences. (use a “brighten-and-colorize” network to brighten and colorize low-light images, while achieving accurate color and customized enhancement based on user preferences)
results: 实验结果表明，提出的方法可以同时实现 State-Of-The-Art 性能和用户友好的个性化增强。 (experimental results show that the proposed method can achieve both State-Of-The-Art performance and user-friendly customization)

Abstract
Low-Light Image Enhancement (LLIE) aims to improve the perceptual quality of an image captured in low-light conditions. Generally, a low-light image can be divided into lightness and chrominance components. Recent advances in this area mainly focus on the refinement of the lightness, while ignoring the role of chrominance. It easily leads to chromatic aberration and, to some extent, limits the diverse applications of chrominance in customized LLIE. In this work, a ``brighten-and-colorize'' network (called BCNet), which introduces image colorization to LLIE, is proposed to address the above issues. BCNet can accomplish LLIE with accurate color and simultaneously enables customized enhancement with varying saturations and color styles based on user preferences. Specifically, BCNet regards LLIE as a multi-task learning problem: brightening and colorization. The brightening sub-task aligns with other conventional LLIE methods to get a well-lit lightness. The colorization sub-task is accomplished by regarding the chrominance of the low-light image as color guidance like the user-guide image colorization. Upon completion of model training, the color guidance (i.e., input low-light chrominance) can be simply manipulated by users to acquire customized results. This customized process is optional and, due to its decoupled nature, does not compromise the structural and detailed information of lightness. Extensive experiments on the commonly used LLIE datasets show that the proposed method achieves both State-Of-The-Art (SOTA) performance and user-friendly customization.

摘要
低光照图像提升（LLIE）的目标是提高低光照图像的感知质量。通常，低光照图像可以分为亮度和色彩组成部分。现有的研究主要关注亮度的精细调整，而忽略了色彩的角色。这可能会导致彩色偏差和限制了自定义LLIE的多样化应用。在这项工作中，我们提出了一种“炬光化和彩色”网络（BCNet），该网络引入图像彩色化，以解决上述问题。BCNet可以同时完成LLIE和自定义增强，并提供了不同的饱和度和颜色风格的个性化调整，基于用户的偏好。具体来说，BCNet将LLIE视为多任务学习问题：炬光和彩色。炬光子任务与其他传统的LLIE方法一样，以获得良好的亮度。彩色子任务是基于低光照图像的色彩作为颜色指南，与用户指南图像彩色相似。在模型训练完成后，用户可以简单地 manipulate 输入低光照图像的色彩，以获得自定义结果。这个自定义过程是可选的，并且由于其分离的性质，不会对图像的结构和细节信息产生影响。我们对常用的LLIE数据集进行了广泛的实验，结果表明，我们提出的方法同时实现了State-Of-The-Art（SOTA）性能和用户友好的自定义。

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

paper_url: http://arxiv.org/abs/2308.03027
repo_url: None
paper_authors: Rihao Chang, Yongtao Ma, Weizhi Nie, Jie Nie, An-an Liu
for: 本研究旨在提出一种基于 causal disentanglement hidden markov model (CDHM) 的FAULT DIAGNOSIS方法，以便更好地捕捉 bearing 故障机制的特征，并实现更加精准的FAULT TYPE预测。
methods: 本研究使用了时间序列数据，逐步分离了振荡信号，将 fault-relevant 和 fault-irrelevant 因素分离出来。并使用了 ELBO 优化方法来学习 causal disentanglement markov model。此外，还采用了无监督领域适应技术来传递学习的拟合表示。
results: 在 CWRU 数据集和 IMS 数据集上进行了实验，结果表明，提出的方法能够更好地捕捉 bearing 故障机制的特征，并实现更加精准的FAULT TYPE预测。

Abstract
In modern industries, fault diagnosis has been widely applied with the goal of realizing predictive maintenance. The key issue for the fault diagnosis system is to extract representative characteristics of the fault signal and then accurately predict the fault type. In this paper, we propose a Causal Disentanglement Hidden Markov model (CDHM) to learn the causality in the bearing fault mechanism and thus, capture their characteristics to achieve a more robust representation. Specifically, we make full use of the time-series data and progressively disentangle the vibration signal into fault-relevant and fault-irrelevant factors. The ELBO is reformulated to optimize the learning of the causal disentanglement Markov model. Moreover, to expand the scope of the application, we adopt unsupervised domain adaptation to transfer the learned disentangled representations to other working environments. Experiments were conducted on the CWRU dataset and IMS dataset. Relevant results validate the superiority of the proposed method.

摘要
现代产业中，故障诊断已广泛应用，目标是实现预测维护。故障诊断系统的关键问题是提取异常信号的表征特征，并准确预测故障类型。在这篇论文中，我们提出了 causal disentanglement hidden markov model（CDHM），以学习涟散机制中的 causality，并捕捉其特征，以实现更加稳定的表示。specifically，我们利用时间序列数据，逐渐分离振荡信号，分解为相关和不相关的故障因子。我们 reformulate elbo，以便学习 causal disentanglement markov model。此外，为扩大应用范围，我们采用了无监督领域适应，将学习的分离表示转移到其他工作环境。在CWRU数据集和IMS数据集上进行了实验，获得了相关的结果，证明了我们提出的方法的优越性。

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

paper_url: http://arxiv.org/abs/2308.03021
repo_url: None
paper_authors: Cheng Zhang, Yu Zhu, Qingsen Yan, Jinqiu Sun, Yanning Zhang
for: restore high-quality images from distorted ones, especially on mobile devices
methods: progressively construct a tree structure through clustering to learn degradation representation, and design a feature transform block (FTB) to align domains and refine features
results: demonstrate the effectiveness of the method and its advantages over state-of-the-art restoration methods through extensive experiments on multiple distorted datasets

Abstract
The aim of image restoration is to recover high-quality images from distorted ones. However, current methods usually focus on a single task (\emph{e.g.}, denoising, deblurring or super-resolution) which cannot address the needs of real-world multi-task processing, especially on mobile devices. Thus, developing an all-in-one method that can restore images from various unknown distortions is a significant challenge. Previous works have employed contrastive learning to learn the degradation representation from observed images, but this often leads to representation drift caused by deficient positive and negative pairs. To address this issue, we propose a novel All-in-one Multi-degradation Image Restoration Network (AMIRNet) that can effectively capture and utilize accurate degradation representation for image restoration. AMIRNet learns a degradation representation for unknown degraded images by progressively constructing a tree structure through clustering, without any prior knowledge of degradation information. This tree-structured representation explicitly reflects the consistency and discrepancy of various distortions, providing a specific clue for image restoration. To further enhance the performance of the image restoration network and overcome domain gaps caused by unknown distortions, we design a feature transform block (FTB) that aligns domains and refines features with the guidance of the degradation representation. We conduct extensive experiments on multiple distorted datasets, demonstrating the effectiveness of our method and its advantages over state-of-the-art restoration methods both qualitatively and quantitatively.

摘要
目的是 восстановить高质量的图像从扭曲的图像中。然而，当前的方法通常只关注单一任务（例如，去噪、去锐化或超解像），这些方法无法满足实际世界中多任务处理的需求，特别是在移动设备上。因此，开发一个能够从多种不知道的扭曲中恢复图像的通用方法是一项重要的挑战。先前的工作使用了对比学习来学习图像的损害表示，但这经常导致表示漂移问题，由于缺乏正确的正例和负例。为解决这个问题，我们提出了一种新的 All-in-one 多种扭曲图像恢复网络（AMIRNet），它可以有效地捕捉和利用正确的损害表示进行图像恢复。AMIRNet 通过不知道扭曲信息进行批处理，逐渐构建一棵树状结构，无需任何先验知识。这棵树状结构明确地反映了不同的扭曲之间的一致和不一致，为图像恢复提供了特定的线索。为了进一步提高图像恢复网络的性能和过渡域差，我们设计了一种特征变换块（FTB），它可以对频率域进行匹配，并通过损害表示的指导来修正特征。我们在多个扭曲数据集上进行了广泛的实验，证明了我们的方法的效iveness和其与当前状态艺术方法的优势。

Recurrent Spike-based Image Restoration under General Illumination

paper_url: http://arxiv.org/abs/2308.03018
repo_url: https://github.com/bit-vision/rsir
paper_authors: Lin Zhu, Yunlong Zheng, Mengyue Geng, Lizhi Wang, Hua Huang
for: 该研究旨在开拓基于脉冲（spike）数组的视觉感知领域，提高视觉Task的高速重建和准确率。
methods: 提出了一种基于循环神经网络的脉冲图像修复（RSIR）网络，通过physical-based脉冲噪声模型和适应脉冲转换模块、循环时间特征融合模块和频率基于脉冲噪声消除模块来修复脉冲图像。
results: 通过实验表明，该网络能够在不同的照明条件下修复清晰的图像，并且可以在脉冲数组中 recursive 地处理脉冲信息，以便更好地利用脉冲时间信息。

Abstract
Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, which is usually unavailable in many real-world scenarios such as rainy days or dusk scenes. To unlock more spike-based application scenarios, we propose a Recurrent Spike-based Image Restoration (RSIR) network, which is the first work towards restoring clear images from spike arrays under general illumination. Specifically, to accurately describe the noise distribution under different illuminations, we build a physical-based spike noise model according to the sampling process of the spike camera. Based on the noise model, we design our RSIR network which consists of an adaptive spike transformation module, a recurrent temporal feature fusion module, and a frequency-based spike denoising module. Our RSIR can process the spike array in a recursive manner to ensure that the spike temporal information is well utilized. In the training process, we generate the simulated spike data based on our noise model to train our network. Extensive experiments on real-world datasets with different illuminations demonstrate the effectiveness of the proposed network. The code and dataset are released at https://github.com/BIT-Vision/RSIR.

摘要
新型生物体鼓启发相机（Spike camera）记录光束强度为数组形式，高度满足视觉任务的高速重建。然而，现有的脉冲基本方法通常假设场景具有足够的光束强度，这通常不符合实际世界的情况，如雨天或晚上场景。为了扩展更多的脉冲基本应用场景，我们提出了一种基于脉冲图像修复（RSIR）网络，这是第一个在普通照明下修复清晰图像的脉冲基本方法。为了准确描述不同照明下的噪声分布，我们构建了基于采样过程的物理基于脉冲噪声模型。根据噪声模型，我们设计了我们的RSIR网络，它包括自适应脉冲转换模块、回归时间特征融合模块和频率基于脉冲噪声清除模块。我们的RSIR可以在恰当的回归方式下处理脉冲数组，以确保脉冲时间信息得到好用。在训练过程中，我们基于我们的噪声模型生成了模拟的脉冲数据来训练我们的网络。实际测试表明，我们的RSIR网络在不同的照明下可以有效地修复清晰图像。我们在 GitHub 上发布了代码和数据集，请参考。

Early Detection and Localization of Pancreatic Cancer by Label-Free Tumor Synthesis

paper_url: http://arxiv.org/abs/2308.03008
repo_url: https://github.com/mrgiovanni/synthetictumors
paper_authors: Bowen Li, Yu-Cheng Chou, Shuwen Sun, Hualin Qiao, Alan Yuille, Zongwei Zhou
for: 提高晚期肠癌患者5年生存率，从8.5%提高到20%。
methods: 使用人工智能（AI）模型来帮助放射学专家早期发现肠癌。
results: 我们的实验表明，通过使用我们的虚拟肿瘤 Synthesis方法，AI模型可以很好地检测晚期肠癌。此外，我们还发现，使用我们的方法可以提高肠癌检测率，特别是小肿瘤的检测率。 Finally,我们证明了我们的方法可以在不同医院的CT扫描数据上提高肠癌检测和地点化的泛化性。

Abstract
Early detection and localization of pancreatic cancer can increase the 5-year survival rate for patients from 8.5% to 20%. Artificial intelligence (AI) can potentially assist radiologists in detecting pancreatic tumors at an early stage. Training AI models require a vast number of annotated examples, but the availability of CT scans obtaining early-stage tumors is constrained. This is because early-stage tumors may not cause any symptoms, which can delay detection, and the tumors are relatively small and may be almost invisible to human eyes on CT scans. To address this issue, we develop a tumor synthesis method that can synthesize enormous examples of small pancreatic tumors in the healthy pancreas without the need for manual annotation. Our experiments demonstrate that the overall detection rate of pancreatic tumors, measured by Sensitivity and Specificity, achieved by AI trained on synthetic tumors is comparable to that of real tumors. More importantly, our method shows a much higher detection rate for small tumors. We further investigate the per-voxel segmentation performance of pancreatic tumors if AI is trained on a combination of CT scans with synthetic tumors and CT scans with annotated large tumors at an advanced stage. Finally, we show that synthetic tumors improve AI generalizability in tumor detection and localization when processing CT scans from different hospitals. Overall, our proposed tumor synthesis method has immense potential to improve the early detection of pancreatic cancer, leading to better patient outcomes.

摘要
早期检测和肿瘤位置确定普罗大肠癌可以提高病人5年存活率从8.5%提高到20%。人工智能（AI）可能能够帮助放射学家在早期检测肿瘤。训练AI模型需要庞大的标注示例，但获得早期肿瘤的CT扫描数据受限。这是因为早期肿瘤可能不会产生任何症状，这可能会延迟检测，而且肿瘤也可能很小，使其在人类眼里几乎不可见。为解决这个问题，我们开发了一种肿瘤合成方法，可以在健康的肠部中合成庞大的小肿瘤示例，无需手动标注。我们的实验表明，由AI训练的总检测率（敏感性和特异性）与实际肿瘤相比，合成肿瘤的检测率相对较高。更重要的是，我们发现合成肿瘤的检测率对小肿瘤是非常高。我们进一步研究了使用合成肿瘤和已知大肿瘤的CT扫描数据训练AI的每个像素分割性能。最后，我们证明合成肿瘤可以提高AI在不同医院的CT扫描数据处理中的普适性。总的来说，我们的肿瘤合成方法具有极大的潜力，可以提高普罗大肠癌的早期检测，从而提高病人的存活率。

High-Resolution Vision Transformers for Pixel-Level Identification of Structural Components and Damage

paper_url: http://arxiv.org/abs/2308.03006
repo_url: None
paper_authors: Kareem Eltouny, Seyedomid Sajedi, Xiao Liang
For: The paper aims to improve the efficiency and accuracy of visual inspection for civil structures, specifically using high-resolution images and deep learning techniques.* Methods: The proposed framework uses a semantic segmentation network based on vision transformers and Laplacian pyramids scaling networks to parse high-resolution visual inspection images. The network is designed to retain both local fine details and global contextual information, while improving computational efficiency.* Results: The proposed framework was evaluated through comprehensive experiments on a dataset of bridge inspection report images, using multiple metrics for pixel-wise materials detection. The results show that the framework can efficiently process high-resolution visual data and accurately detect materials in the images.Here’s the same information in Simplified Chinese:* For: 本研究旨在通过高分辨率图像和深度学习技术进行结构体视觉检查，提高检查效率和准确率。* Methods: 提议的框架使用基于视Transformers的语义分割网络和卷积 pyramids 缩放网络来解析高分辨率视觉检查图像。网络设计保留了细节和全局 semantics 信息，提高计算效率。* Results: 提议的框架通过对桥梁检查报告图像进行广泛的实验，使用多种精度来测试像素粒子材料检测。结果表明，框架可以高效处理高分辨率视觉数据，并准确地检测图像中的材料。

Abstract
Visual inspection is predominantly used to evaluate the state of civil structures, but recent developments in unmanned aerial vehicles (UAVs) and artificial intelligence have increased the speed, safety, and reliability of the inspection process. In this study, we develop a semantic segmentation network based on vision transformers and Laplacian pyramids scaling networks for efficiently parsing high-resolution visual inspection images. The massive amounts of collected high-resolution images during inspections can slow down the investigation efforts. And while there have been extensive studies dedicated to the use of deep learning models for damage segmentation, processing high-resolution visual data can pose major computational difficulties. Traditionally, images are either uniformly downsampled or partitioned to cope with computational demands. However, the input is at risk of losing local fine details, such as thin cracks, or global contextual information. Inspired by super-resolution architectures, our vision transformer model learns to resize high-resolution images and masks to retain both the valuable local features and the global semantics without sacrificing computational efficiency. The proposed framework has been evaluated through comprehensive experiments on a dataset of bridge inspection report images using multiple metrics for pixel-wise materials detection.

摘要
<>传感器Visual inspection 主要用于评估公共结构，但最近的无人飞行器（UAV）和人工智能技术的发展已经提高了检查过程的速度、安全性和可靠性。在这项研究中，我们开发了基于视transformer和Laplacian pyramids scaling networks的semantic segmentation网络，用于高效地分解视检图像。采集的大量高分辨率图像可能会拖慢调查的进程。而深度学习模型已经得到了大量的投入，但处理高分辨率视觉数据可能会带来巨大的计算困难。传统上，图像会被uniform downsample或分割，以降低计算开销。然而，输入可能会失去细小的本地缝隙或全局的语义信息。 drawing inspiration from super-resolution architectures，我们的视transformer模型学习了resize高分辨率图像和面积，以保留有价值的本地特征和全局语义信息，不需要牺牲计算效率。我们提出的框架在bridge inspection report图像 dataset上进行了广泛的实验，并使用多种度量器进行像素精度检测。>>Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2308.03005
repo_url: https://github.com/xulianuwa/mctformer
paper_authors: Lian Xu, Mohammed Bennamoun, Farid Boussaid, Hamid Laga, Wanli Ouyang, Dan Xu
for: 提高弱级Semantic Segmentation（WSSS）的准确性，通过生成准确的类型特征地图作为pseudo标签。
methods: 基于transformer模型，提出一种多类token转换器，通过多类token来实现类型特征的学习和分类。
results: 通过对多个类型token的学习和对patch token的权重调整，实现了高效的类型特征地图生成，并且与Class Activation Mapping（CAM）方法结合使用，在PASCAL VOC 2012和MS COCO 2014数据集上显著提高WSSS性能。

Abstract
This paper proposes a novel transformer-based framework that aims to enhance weakly supervised semantic segmentation (WSSS) by generating accurate class-specific object localization maps as pseudo labels. Building upon the observation that the attended regions of the one-class token in the standard vision transformer can contribute to a class-agnostic localization map, we explore the potential of the transformer model to capture class-specific attention for class-discriminative object localization by learning multiple class tokens. We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens. To achieve this, we devise a class-aware training strategy that establishes a one-to-one correspondence between the output class tokens and the ground-truth class labels. Moreover, a Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens, enabling the model to better capture the unique characteristics and properties of each class. As a result, class-discriminative object localization maps can be effectively generated by leveraging the class-to-patch attentions associated with different class tokens. To further refine these localization maps, we propose the utilization of patch-level pairwise affinity derived from the patch-to-patch transformer attention. Furthermore, the proposed framework seamlessly complements the Class Activation Mapping (CAM) method, resulting in significantly improved WSSS performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. These results underline the importance of the class token for WSSS.

摘要
To achieve this, the authors propose a Multi-Class Token transformer that incorporates multiple class tokens, and a class-aware training strategy that establishes a one-to-one correspondence between the output class tokens and the ground-truth class labels. Additionally, a Contrastive-Class-Token (CCT) module is introduced to enhance the learning of discriminative class tokens.The proposed framework also utilizes patch-level pairwise affinity derived from the patch-to-patch transformer attention to refine the localization maps. The authors show that the proposed framework significantly improves WSSS performance on the PASCAL VOC 2012 and MS COCO 2014 datasets, demonstrating the importance of the class token for WSSS.Here is the translation in Simplified Chinese:这篇论文提出了一种基于转换器的新框架，以强化弱监督Semantic Segmentation（WSSS）中的对象局部化映射。这个框架基于对一个类token的注意力来生成准确的类型特有的对象局部化映射。为了实现这一点，作者们提出了多个类token，并使用多类token来实现类 aware的交互。这使得模型能够更好地捕捉类特异的注意力，并生成类型特异的对象局部化映射。此外，作者们还提出了一种强化学习抽象类token的方法，以提高模型对每个类的学习。这种方法基于对类标签和类token之间的对应关系，以确保模型能够正确地学习每个类的特征和属性。此论文还提出了一种使用patch-to-patch transformer注意力来加强对局部化映射的级别。这种方法可以更好地调整局部化映射，以提高WSSS性能。最后，作者们表明，这种框架可以轻松地与Class Activation Mapping（CAM）方法相结合，从而提高WSSS性能。这些结果证明了类token对WSSS的重要性。

Weakly supervised segmentation of intracranial aneurysms using a 3D focal modulation UNet

paper_url: http://arxiv.org/abs/2308.03001
repo_url: None
paper_authors: Amirhossein Rasoulian, Soorena Salari, Yiming Xiao
for: 这个论文的目的是提出一种基于弱监督学习的自动液体动脉精度诊断技术，以便更好地评估和治疗动脉精度疾病。
methods: 这个论文使用了一种新的3D焦点修饰UNet和 conditional random field（CRF）后处理技术来提高UIAs的自动 segmentation。
results: 研究人员通过使用这种新技术实现了UIAs的精度 segmentation，并与现有的3D UNet和Swin-UNETR进行比较，证明了该方法的优越性。

Abstract
Accurate identification and quantification of unruptured intracranial aneurysms (UIAs) are essential for the risk assessment and treatment decisions of this cerebrovascular disorder. Current assessment based on 2D manual measures of aneurysms on 3D magnetic resonance angiography (MRA) is sub-optimal and time-consuming. Automatic 3D measures can significantly benefit the clinical workflow and treatment outcomes. However, one major issue in medical image segmentation is the need for large well-annotated data, which can be expensive to obtain. Techniques that mitigate the requirement, such as weakly supervised learning with coarse labels are highly desirable. In this paper, we leverage coarse labels of UIAs from time-of-flight MRAs to obtain refined UIAs segmentation using a novel 3D focal modulation UNet, called FocalSegNet and conditional random field (CRF) postprocessing, with a Dice score of 0.68 and 95% Hausdorff distance of 0.95 mm. We evaluated the performance of the proposed algorithms against the state-of-the-art 3D UNet and Swin-UNETR, and demonstrated the superiority of the proposed FocalSegNet and the benefit of focal modulation for the task.

摘要
正确识别和量化不ruptured intracranial aneurysms (UIAs) 是脑动脉疾病的风险评估和治疗决策的重要因素。现有的评估方法基于2D手动测量MRA影像的3D磁共振成像是次optimal和时间consuming。自动3D测量可以对临床工作流程和治疗结果产生重要的帮助。然而，医疗影像分类中一个主要的问题是需要大量高质量标注数据，这可能是贵重的。本文利用时间推射MRA中的UIAs粗略标签来实现UIAs精确分类，使用了一个新的3D静止模组UNet（FocalSegNet）和 conditional random field（CRF）后处理，具有0.68的Dice分数和0.95 mm的95% Hausdorff距离。我们评估了提案的算法与现有的3D UNet和Swin-UNETR的比较，并证明了提案的FocalSegNet的优越性和静止模组的重要性。

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

paper_url: http://arxiv.org/abs/2308.03000
repo_url: https://github.com/liuxianyi/styleedl
paper_authors: Peiguang Jing, Xianyi Liu, Ji Wang, Yinwei Wei, Liqiang Nie, Yuting Su
for: Image emotion distribution learning
methods: Style-guided high-order attention network, GRAM-based stylistic representations, adversary-constrained high-order attention mechanism, stylistic graph convolutional network
results: Effective emotion distribution learning compared to state-of-the-art methods, demonstrated through extensive experiments on several benchmark datasets.Here is the text in Simplified Chinese:
for: 图像情感分布学习
methods: 风格引导高级注意网络、GRAM基于风格表示、对抗限制高级注意机制、动态生成内容依赖情感表示
results: 比对状态艺法高效的图像情感分布学习，通过多个 benchmark 数据集的广泛实验证明。I hope this helps!

Abstract
Emotion distribution learning has gained increasing attention with the tendency to express emotions through images. As for emotion ambiguity arising from humans' subjectivity, substantial previous methods generally focused on learning appropriate representations from the holistic or significant part of images. However, they rarely consider establishing connections with the stylistic information although it can lead to a better understanding of images. In this paper, we propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL, which interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. Specifically, we consider exploring the intra- and inter-layer correlations among GRAM-based stylistic representations, and meanwhile exploit an adversary-constrained high-order attention mechanism to capture potential interactions between subtle visual parts. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations to benefit the final emotion distribution learning. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our proposed StyleEDL compared to state-of-the-art methods. The implementation is released at: https://github.com/liuxianyi/StyleEDL.

摘要
📝Emotion distribution learning has gained increasing attention with the tendency to express emotions through images. However, emotion ambiguity arising from humans' subjectivity has posed a challenge. Previous methods have focused on learning appropriate representations from the holistic or significant part of images, but rarely consider establishing connections with stylistic information. In this paper, we propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL, which interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents. Specifically, we consider exploring the intra- and inter-layer correlations among GRAM-based stylistic representations, and meanwhile exploit an adversary-constrained high-order attention mechanism to capture potential interactions between subtle visual parts. In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations to benefit the final emotion distribution learning. Extensive experiments conducted on several benchmark datasets demonstrate the effectiveness of our proposed StyleEDL compared to state-of-the-art methods. 📈The implementation is released at: .

Novel Class Discovery for Long-tailed Recognition

paper_url: http://arxiv.org/abs/2308.02989
repo_url: https://github.com/kleinzcy/ncdlr
paper_authors: Zhang Chuyu, Xu Ruijie, He Xuming
for: 本文研究了一种更加实际的新类发现问题，其中新类和已知类的分布呈长尾型。
methods: 本文提出了一种适应自动标注策略，基于类均匀代表。该方法通过解决一个松弛优化运动问题，生成高质量的假标签，有效地减少了类偏见。
results: 对CIFAR100、ImageNet100、Herbarium19和大规模iNaturalist18 dataset进行了广泛的实验，结果表明本方法具有优异性。代码可以在https://github.com/kleinzcy/NCDLR上下载。

Abstract
While the novel class discovery has recently made great progress, existing methods typically focus on improving algorithms on class-balanced benchmarks. However, in real-world recognition tasks, the class distributions of their corresponding datasets are often imbalanced, which leads to serious performance degeneration of those methods. In this paper, we consider a more realistic setting for novel class discovery where the distributions of novel and known classes are long-tailed. One main challenge of this new problem is to discover imbalanced novel classes with the help of long-tailed known classes. To tackle this problem, we propose an adaptive self-labeling strategy based on an equiangular prototype representation of classes. Our method infers high-quality pseudo-labels for the novel classes by solving a relaxed optimal transport problem and effectively mitigates the class biases in learning the known and novel classes. We perform extensive experiments on CIFAR100, ImageNet100, Herbarium19 and large-scale iNaturalist18 datasets, and the results demonstrate the superiority of our method. Our code is available at https://github.com/kleinzcy/NCDLR.

摘要
新的类发现方法在最近几年内做出了大量的进步，但现有方法通常是通过改进算法来提高类均衡的benchmark上的性能。然而，在实际识别任务中，数据集的类分布通常是不均衡的，这会导致这些方法的性能下降严重。在这篇论文中，我们考虑了更真实的新类发现问题，其中新类和已知类的分布都是长尾分布。我们的主要挑战是通过长尾已知类的帮助，找到不均衡的新类。为解决这个问题，我们提出了一种适应性自标注策略，基于类的等角度表示。我们的方法通过解决一个宽松的优化交通问题，生成高质量的 Pseudo-标签，并有效地消除知类和新类的类偏见。我们在CIFAR100、ImageNet100、Herbarium19和大规模的iNaturalist18 datasets上进行了广泛的实验，结果表明我们的方法具有优越性。我们的代码可以在https://github.com/kleinzcy/NCDLR上获取。

Introducing Feature Attention Module on Convolutional Neural Network for Diabetic Retinopathy Detection

paper_url: http://arxiv.org/abs/2308.02985
repo_url: None
paper_authors: Susmita Ghosh, Abhiroop Chatterjee
for: The paper is written for proposing a new methodology for more accurate detection of diabetic retinopathy (DR) using deep learning models.
methods: The proposed method integrates a feature attention module with a pretrained VGG19 convolutional neural network (CNN) to enhance the discriminative power of the CNN. The feature attention module selectively highlights salient features from images and fuses them with the original input, which improves the network’s ability to focus on relevant information.
results: The proposed method achieves an accuracy of 95.70% on the APTOS (Asia Pacific Tele-Ophthalmology Society) DR Dataset, which is higher than the accuracy achieved by other state-of-the-art approaches.

Abstract
Diabetic retinopathy (DR) is a leading cause of blindness among diabetic patients. Deep learning models have shown promising results in automating the detection of DR. In the present work, we propose a new methodology that integrates a feature attention module with a pretrained VGG19 convolutional neural network (CNN) for more accurate DR detection. Here, the pretrained net is fine-tuned with the proposed feature attention block. The proposed module aims to leverage the complementary information from various regions of fundus images to enhance the discriminative power of the CNN. The said feature attention module incorporates an attention mechanism which selectively highlights salient features from images and fuses them with the original input. The simultaneous learning of attention weights for the features and thereupon the combination of attention-modulated features within the feature attention block facilitates the network's ability to focus on relevant information while reducing the impact of noisy or irrelevant features. Performance of the proposed method has been evaluated on a widely used dataset for diabetic retinopathy classification e.g., the APTOS (Asia Pacific Tele-Ophthalmology Society) DR Dataset. Results are compared with/without attention module, as well as with other state-of-the-art approaches. Results confirm that the introduction of the fusion module (fusing of feature attention module with CNN) improves the accuracy of DR detection achieving an accuracy of 95.70%.

摘要
糖尿病 RETINOPATHY (DR) 是糖尿病患者中最主要的失明原因。深度学习模型已经显示了自动DR检测的承诺。在当前工作中，我们提议一种新的方法ológíque，即将特征注意模块与预训练的 VGG19 convolutional neural network (CNN) 集成以提高DR检测的准确度。这里，预训练的网络被细化并与提议的特征注意块结合。该模块的目的是利用不同区域的眼球图像中的补充信息，以提高 CNN 的拟合力。该特征注意模块包括一个注意机制，该机制选择眼球图像中的重要特征，并将其与原始输入进行组合。同时学习注意 весов для特征和组合注意模ulated特征内的特征注意块，使网络能够注重相关信息，降低干扰或无关信息的影响。我们对APTOS（亚太地区电子眼科学会）DR数据集进行了评估，并与和不含注意模块，以及其他当前领先的方法进行比较。结果表明，将特征注意模块与CNN集成（称为特征注意块）可以提高DR检测的准确度，达到95.70%的最高精度。

Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection

paper_url: http://arxiv.org/abs/2308.02983
repo_url: https://github.com/xcyao00/fod
paper_authors: Xincheng Yao, Ruoqi Li, Zefeng Qian, Yan Luo, Chongyang Zhang
for: 本文提出了一种新的异常检测方法，即 FOcus-the-Discrepancy（FOD），用于同时检测图像中异常点的patch-wise、intra-和inter-异常。
methods: 本文使用了Transformer模型，并对其进行修改，以便更好地捕捉图像中异常点的patch-wise和inter-image correlations。特别是，本文提出了一种新的自我相关映射修改方法，称为Intra-Inter-Correlation（I2Correlation），用于同时建立图像中patch-wise和inter-image的相关性。
results: 本文的实验结果表明，FOD方法可以在三个Unsupervised Real-World AD benchmark上达到非常高的异常检测性能。 Code将会在https://github.com/xcyao00/FOD上提供。

Abstract
Humans recognize anomalies through two aspects: larger patch-wise representation discrepancies and weaker patch-to-normal-patch correlations. However, the previous AD methods didn't sufficiently combine the two complementary aspects to design AD models. To this end, we find that Transformer can ideally satisfy the two aspects as its great power in the unified modeling of patch-wise representations and patch-to-patch correlations. In this paper, we propose a novel AD framework: FOcus-the-Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies. The major characteristic of our method is that we renovate the self-attention maps in transformers to Intra-Inter-Correlation (I2Correlation). The I2Correlation contains a two-branch structure to first explicitly establish intra- and inter-image correlations, and then fuses the features of two-branch to spotlight the abnormal patterns. To learn the intra- and inter-correlations adaptively, we propose the RBF-kernel-based target-correlations as learning targets for self-supervised learning. Besides, we introduce an entropy constraint strategy to solve the mode collapse issue in optimization and further amplify the normal-abnormal distinguishability. Extensive experiments on three unsupervised real-world AD benchmarks show the superior performance of our approach. Code will be available at https://github.com/xcyao00/FOD.

摘要
人类通过两个方面识别异常：一是较大的补丁层表示差异，二是较弱的补丁与正常补丁之间的相关性。然而，先前的异常检测方法没有足够融合这两个补充性方面来设计异常检测模型。为了解决这问题，我们发现 transformer 可以 идеаль地满足这两个方面，因为它可以强大地统一补丁层表示和补丁之间的相关性。在这篇文章中，我们提出了一种新的异常检测框架：FOcus-the-Discrepancy（FOD），它可以同时检测补丁层、内部和外部异常。我们的方法的主要特点是将 transformer 的自我注意力地图改为 Intra-Inter-Correlation（I2Correlation）。I2Correlation 包括两个分支结构，首先显式建立内部和外部图像相关性，然后将两个分支的特征进行合并，以检测异常模式。为了学习内部和外部相关性适应性地，我们提出了基于 RBF kernel 的目标相关性学习目标，并引入了一种对数分布约束策略，以解决优化过程中的模式覆盖问题。我们的方法在三个未经过超参数的实际异常检测标准benchmark上进行了广泛的实验，并显示了我们的方法的超过其他方法的性能。代码将在 https://github.com/xcyao00/FOD 上提供。

paper_url: http://arxiv.org/abs/2308.02982
repo_url: https://github.com/mr-neko/jm3d
paper_authors: Haowei Wang, Jiji Tang, Jiayi Ji, Xiaoshuai Sun, Rongsheng Zhang, Yiwei Ma, Minda Zhao, Lincheng Li, zeng zhao, Tangjie Lv, Rongrong Ji
for: 本文提出了一种多视图共同模式匹配方法，以解决现有方法所导致的信息损失和共同匹配问题。
methods: 本文提出了一种新的结构多模态组织器（SMO）和一种联合多模态匹配（JMA）两部分。SMO通过将多视图图像和层次文本纳入一起，提高了视觉和语言模式的表示。JMA通过将语言知识 incorporated into the visual modality，实现了多模态之间的共同匹配。
results: 对于ModelNet40和ScanObjectNN两个数据集，本文的提出的方法JM3D实现了零shot 3D分类的最佳性能。与ULIP相比，JM3D在PointMLP上提高了约4.3%的性能，在PointNet++上提高了最大1.5%的性能。

Abstract
In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3% on PointMLP and achieves an improvement of up to 6.5% accuracy on PointNet++ in top-1 accuracy for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/JM3D.

摘要
在最近的几年中，3D表示学习转向了2D视觉语言预训模型，以解决数据缺乏问题。然而，现有的方法只是将2D对齐策略转移到3D表示上，将3D表示与单视图2D图像和粗粒度的父类别文本进行对齐。这些方法会导致信息损失和不足的共聚问题，从而导致性能下降。信息损失的原因在于忽略了3D表示应该与多视图图像和更细化的子类别文本相对应。不足的共聚问题是由于忽略了视觉语言空间的 JOINT 模型，而不是独立地对每个模式进行对齐。在本文中，我们提出了一种多视图集成模型，称为JM3D，以获得包含点云、文本和图像的统一表示。具体来说，我们提出了一种新的结构化多模式组织器（SMO），以解决信息损失问题，并在视觉和语言模式中引入连续多视图图像和层次文本，以激活表示的多模式。此外，我们还提出了一种联合多模式对齐（JMA），以解决不足的共聚问题，将语言知识integrated到视觉模式中。我们在ModelNet40和ScanObjectNN上进行了广泛的实验，并证明了我们提出的方法JM3D在零shot 3D分类中达到了状态机器人的性能。JM3D在PointMLP和PointNet++上比ULIP高出约4.3%的性能，并在ModelNet40上实现了顶部1个最大值的提高率为6.5%。我们在https://github.com/Mr-Neko/JM3D上提供了所有实验代码和训练模型。

Robust estimation of exposure ratios in multi-exposure image stacks

paper_url: http://arxiv.org/abs/2308.02968
repo_url: https://github.com/gfxdisp/hdrutils
paper_authors: Param Hanji, Rafał K. Mantiuk
for:The paper is written for those who need to merge multi-exposure image stacks into a high dynamic range (HDR) image, and want to eliminate banding artifacts caused by inaccurate exposure times.methods:The paper proposes a method to estimate exposure ratios directly from the input images, using an optimization problem to minimize estimation error caused by camera noise. The method uses a linear solver and is robust to pixel misalignment caused by camera or object motion.results:The proposed method eliminates banding artifacts in popular datasets and is essential for applications that require physically accurate reconstructions, such as measuring the modulation transfer function of a display. The code for the method is available.

Abstract
Merging multi-exposure image stacks into a high dynamic range (HDR) image requires knowledge of accurate exposure times. When exposure times are inaccurate, for example, when they are extracted from a camera's EXIF metadata, the reconstructed HDR images reveal banding artifacts at smooth gradients. To remedy this, we propose to estimate exposure ratios directly from the input images. We derive the exposure time estimation as an optimization problem, in which pixels are selected from pairs of exposures to minimize estimation error caused by camera noise. When pixel values are represented in the logarithmic domain, the problem can be solved efficiently using a linear solver. We demonstrate that the estimation can be easily made robust to pixel misalignment caused by camera or object motion by collecting pixels from multiple spatial tiles. The proposed automatic exposure estimation and alignment eliminates banding artifacts in popular datasets and is essential for applications that require physically accurate reconstructions, such as measuring the modulation transfer function of a display. The code for the method is available.

摘要
合并多张曝光图像拼接成高动态范围（HDR）图像需要准确的曝光时间知识。如果曝光时间不准确，例如从相机的EXIF元数据中提取出来的，则重构的HDR图像中会出现扫描 artifacts 在平滑的渐变区域。为解决这问题，我们提议直接从输入图像中估算曝光比率。我们将曝光时间估算作为一个优化问题，在其中选择了对应的曝光图像中的像素，以最小化由相机噪声引起的估算误差。当像素值表示在对数域中时，问题可以通过一个线性解决器效率地解决。我们示出了对摄像头或物体运动引起的像素不对的情况下，可以使用多个空间块收集像素来使估算成为 robust。我们的自动曝光估算和对齐方法可以消除各种流行的数据集中的扫描 artifacts，并是需要物理准确重建的应用，如测量显示器的模板传输函数。我们的代码可以获得。

Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models

paper_url: http://arxiv.org/abs/2308.02963
repo_url: https://github.com/hanbyel0105/diff-hmr
paper_authors: Hanbyel Cho, Junmo Kim
for: diff-hmr 是一种用于从 2D 图像中重建 3D 人体网格的方法，并且能够处理多个可能性的挑战。methods: diff-hmr 使用了扩散过程来处理从真实参数衍生的模型，并在训练阶段将 SMPL 参数从真实参数扩散到随机分布。results: diff-hmr 可以从同一个输入图像中产生多种不同的结果，因为它可以处理不同的输入噪声。实验结果显示，提案的框架可以有效地模型人体网格重建任务中的潜在多个可能性。

Abstract
This work focuses on the problem of reconstructing a 3D human body mesh from a given 2D image. Despite the inherent ambiguity of the task of human mesh recovery, most existing works have adopted a method of regressing a single output. In contrast, we propose a generative approach framework, called "Diffusion-based Human Mesh Recovery (Diff-HMR)" that takes advantage of the denoising diffusion process to account for multiple plausible outcomes. During the training phase, the SMPL parameters are diffused from ground-truth parameters to random distribution, and Diff-HMR learns the reverse process of this diffusion. In the inference phase, the model progressively refines the given random SMPL parameters into the corresponding parameters that align with the input image. Diff-HMR, being a generative approach, is capable of generating diverse results for the same input image as the input noise varies. We conduct validation experiments, and the results demonstrate that the proposed framework effectively models the inherent ambiguity of the task of human mesh recovery in a probabilistic manner. The code is available at https://github.com/hanbyel0105/Diff-HMR

摘要
这个工作关注于从给定的2D图像中重建3D人体模型的问题。尽管人体模型恢复任务本身具有内在的模糊性，大多数现有的工作都采取了单输出回归的方法。而我们提议的是一种生成方法框架，称为“Diffusion-based Human Mesh Recovery（Diff-HMR）”，它利用杂化扩散过程来考虑多个可能的结果。在训练阶段，SMPL参数从真实参数扩散到随机分布，Diff-HMR学习反向的这种扩散过程。在推理阶段，模型逐渐将给定的随机SMPL参数转化为与输入图像匹配的参数。由于Diff-HMR是一种生成方法，因此它能够根据输入图像的噪声变化生成多种结果。我们进行了验证实验，结果表明，我们提议的框架能够在 probabilistic 的方式模型人体模型恢复任务中的内在模糊性。代码可以在 https://github.com/hanbyel0105/Diff-HMR 上获取。

DermoSegDiff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation

paper_url: http://arxiv.org/abs/2308.02959
repo_url: https://github.com/mindflow-institue/dermosegdiff
paper_authors: Afshin Bozorgpour, Yousef Sadegheih, Amirhossein Kazerouni, Reza Azad, Dorit Merhof
for: 静脉皮肤病诊断 Early detection and accurate diagnosis of dermatological conditions.
methods: 使用 Denoising Diffusion Probabilistic Models (DDPMs) 和 U-Net 网络，在学习过程中引入边缘信息，逐渐减少其他区域的重要性。
results: 在多个皮肤分割数据集上实现了比 CNN、transformer 和 diffusion-based 方法更高的效果和泛化能力。

Abstract
Skin lesion segmentation plays a critical role in the early detection and accurate diagnosis of dermatological conditions. Denoising Diffusion Probabilistic Models (DDPMs) have recently gained attention for their exceptional image-generation capabilities. Building on these advancements, we propose DermoSegDiff, a novel framework for skin lesion segmentation that incorporates boundary information during the learning process. Our approach introduces a novel loss function that prioritizes the boundaries during training, gradually reducing the significance of other regions. We also introduce a novel U-Net-based denoising network that proficiently integrates noise and semantic information inside the network. Experimental results on multiple skin segmentation datasets demonstrate the superiority of DermoSegDiff over existing CNN, transformer, and diffusion-based approaches, showcasing its effectiveness and generalization in various scenarios. The implementation is publicly accessible on \href{https://github.com/mindflow-institue/dermosegdiff}{GitHub}

摘要
皮肤梗混杂 Segmentation 在早期检测和精准诊断皮肤疾病中扮演了关键角色。 latest 的 Denoising Diffusion Probabilistic Models (DDPMs) 在图像生成方面吸引了广泛的关注。基于这些进步，我们提出了 DermoSegDiff，一种新的皮肤梗混杂 Segmentation 框架，在学习过程中加入边界信息。我们的方法引入了一种新的损失函数，在训练过程中优先级化边界，逐渐减少其他区域的重要性。我们还引入了一种基于 U-Net 的净化网络，能够高效地 инте integrate 噪声和semantic信息内网络。多个皮肤分割数据集的实验结果表明 DermoSegDiff 在不同的场景下具有优秀的效果和泛化能力，超越了现有的 CNN、transformer 和扩散基于的方法。实现在 \href{https://github.com/mindflow-institue/dermosegdiff}{GitHub} 上公开 accessible。

K-band: Self-supervised MRI Reconstruction via Stochastic Gradient Descent over K-space Subsets

paper_url: http://arxiv.org/abs/2308.02958
repo_url: https://github.com/mikgroup/k-band
paper_authors: Frederic Wang, Han Qi, Alfredo De Goyeneche, Reinhard Heckel, Michael Lustig, Efrat Shimron
for: 用于使深度学习方法可以使用有限分辨率的k空间数据进行训练。
methods: 引入了一种新的数学框架，称为k带，允许使用有限分辨率的k空间数据进行训练深度学习模型。特别是，我们使用在每次训练迭代中，只使用一小部分的k空间数据来计算梯度的方法。
results: 对于 Raw MRI 数据进行了数值实验，并证明了k带方法可以超过其他基于有限分辨率数据的方法，并与当前最佳方法（SoTA）的性能相当。k带方法因此实现了SoTA性能，而不需要高质量的训练数据。

Abstract
Although deep learning (DL) methods are powerful for solving inverse problems, their reliance on high-quality training data is a major hurdle. This is significant in high-dimensional (dynamic/volumetric) magnetic resonance imaging (MRI), where acquisition of high-resolution fully sampled k-space data is impractical. We introduce a novel mathematical framework, dubbed k-band, that enables training DL models using only partial, limited-resolution k-space data. Specifically, we introduce training with stochastic gradient descent (SGD) over k-space subsets. In each training iteration, rather than using the fully sampled k-space for computing gradients, we use only a small k-space portion. This concept is compatible with different sampling strategies; here we demonstrate the method for k-space "bands", which have limited resolution in one dimension and can hence be acquired rapidly. We prove analytically that our method stochastically approximates the gradients computed in a fully-supervised setup, when two simple conditions are met: (i) the limited-resolution axis is chosen randomly-uniformly for every new scan, hence k-space is fully covered across the entire training set, and (ii) the loss function is weighed with a mask, derived here analytically, which facilitates accurate reconstruction of high-resolution details. Numerical experiments with raw MRI data indicate that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art (SoTA) methods trained on high-resolution data. k-band hence obtains SoTA performance, with the advantage of training using only limited-resolution data. This work hence introduces a practical, easy-to-implement, self-supervised training framework, which involves fast acquisition and self-supervised reconstruction and offers theoretical guarantees.

摘要
although deep learning (DL) 方法是强大的解决反向问题的工具，它们的依赖于高质量训练数据是一个主要的障碍。在高维（动态/体积）磁共振成像（MRI）中，获得高分辨率完全掌握的k空间数据是不实际。我们介绍了一种新的数学框架，称为k空间，允许使用只有部分、有限分辨率k空间数据来训练DL模型。具体来说，我们引入了使用杂次下降（SGD）在k空间subset中训练。在每个训练轮次中，而不是使用完全掌握的k空间来计算导数，我们只使用一小部分k空间。这个概念 compatible with different sampling strategies，我们在这里采用k空间“带”，它们有限制分辨率的一个维度，可以快速获得。我们数学上证明，我们的方法可以随机选择限制分辨率轴，以 Ensure k-space是在整个训练集中完全覆盖。此外，我们还使用一个mask，在这里分析性 derivation，以便准确地重建高分辨率细节。numerical experiments with raw MRI data indicate that k-band outperforms two other methods trained on limited-resolution data and performs comparably to state-of-the-art (SoTA) methods trained on high-resolution data. k-band hence obtains SoTA performance, with the advantage of training using only limited-resolution data. this work hence introduces a practical, easy-to-implement, self-supervised training framework, which involves fast acquisition and self-supervised reconstruction and offers theoretical guarantees.

Multispectral Quantitative Phase Imaging Using a Diffractive Optical Network

paper_url: http://arxiv.org/abs/2308.02952
repo_url: None
paper_authors: Che-Yung Shen, Jingxi Li, Deniz Mengu, Aydogan Ozcan
for: 这种 diffractive processor 可以实现高通量、低功耗的量子phasic 成像，用于生物、材料科学和工程等领域的 transparent 样品的成像。
methods: 使用了 spatially 工程的 diffractive layers，通过 deep learning 优化，将输入对象的phasic 特征编码到输出平面上的INTENSITY variations中，实现多spectral QPI 的同时进行。
results: 通过数字 simulations，我们展示了这种 diffractive multispectral processor 可以同时进行9和16个target spectral band的量子phasic 成像，并且在所有激光通道上保持uniform performance，显示出高质量的 QPI 性能。

Abstract
As a label-free imaging technique, quantitative phase imaging (QPI) provides optical path length information of transparent specimens for various applications in biology, materials science, and engineering. Multispectral QPI measures quantitative phase information across multiple spectral bands, permitting the examination of wavelength-specific phase and dispersion characteristics of samples. Here, we present the design of a diffractive processor that can all-optically perform multispectral quantitative phase imaging of transparent phase-only objects in a snapshot. Our design utilizes spatially engineered diffractive layers, optimized through deep learning, to encode the phase profile of the input object at a predetermined set of wavelengths into spatial intensity variations at the output plane, allowing multispectral QPI using a monochrome focal plane array. Through numerical simulations, we demonstrate diffractive multispectral processors to simultaneously perform quantitative phase imaging at 9 and 16 target spectral bands in the visible spectrum. These diffractive multispectral processors maintain uniform performance across all the wavelength channels, revealing a decent QPI performance at each target wavelength. The generalization of these diffractive processor designs is validated through numerical tests on unseen objects, including thin Pap smear images. Due to its all-optical processing capability using passive dielectric diffractive materials, this diffractive multispectral QPI processor offers a compact and power-efficient solution for high-throughput quantitative phase microscopy and spectroscopy. This framework can operate at different parts of the electromagnetic spectrum and be used for a wide range of phase imaging and sensing applications.

摘要
为了实现标注自由的成像技术，量子阶段成像（QPI）提供了透明样品的光路径信息，用于生物、材料科学和工程各种应用。多spectral QPI测量了透明样品的波长特征，允许样品的波长特征和分散特征的检测。在这里，我们提出了一种可以在一步中完成多spectral量子阶段成像的干涉处理器设计。我们的设计利用了工程干涉层，通过深度学习优化，将输入样品的相位特征编码到预先确定的波长谱中，并通过干涉处理器将其转换为具有空间强度变化的输出平面。这种方法可以通过单色焦点平面数组实现多spectral QPI。我们通过数值 simulations 表明，我们的设计可以同时在9和16个目标频谱中进行多spectral QPI，并且在所有波长通道中保持均衡性，从而实现高质量的QPI性能。我们的总结是，通过深度学习优化的干涉处理器可以在不同的电磁谱спектrum中实现高效的量子阶段成像和感知应用。这种方法可以用于各种透明样品的成像和感知应用，并且可以在不同的波长谱中运行。

MomentaMorph: Unsupervised Spatial-Temporal Registration with Momenta, Shooting, and Correction

paper_url: http://arxiv.org/abs/2308.02949
repo_url: None
paper_authors: Zhangxing Bian, Shuwen Wei, Yihao Liu, Junyu Chen, Jiachen Zhuo, Fangxu Xing, Jonghye Woo, Aaron Carass, Jerry L. Prince
For: This paper aims to address the challenges of registering magnetic resonance imaging (MRI) data with large motion and repetitive patterns, which can lead to motion estimation errors.* Methods: The proposed method uses a “momenta, shooting, and correction” framework grounded in Lie algebra and Lie group principles. This framework accumulates momenta in the tangent vector space and employs exponential mapping in the diffeomorphic space for rapid approximation towards true optima, circumventing local optima.* Results: The method is demonstrated to be efficient in estimating accurate, dense, and diffeomorphic 2D/3D motion fields amidst large motion and repetitive patterns on both a 2D synthetic dataset and a real 3D tMRI dataset.

Abstract
Tagged magnetic resonance imaging (tMRI) has been employed for decades to measure the motion of tissue undergoing deformation. However, registration-based motion estimation from tMRI is difficult due to the periodic patterns in these images, particularly when the motion is large. With a larger motion the registration approach gets trapped in a local optima, leading to motion estimation errors. We introduce a novel "momenta, shooting, and correction" framework for Lagrangian motion estimation in the presence of repetitive patterns and large motion. This framework, grounded in Lie algebra and Lie group principles, accumulates momenta in the tangent vector space and employs exponential mapping in the diffeomorphic space for rapid approximation towards true optima, circumventing local optima. A subsequent correction step ensures convergence to true optima. The results on a 2D synthetic dataset and a real 3D tMRI dataset demonstrate our method's efficiency in estimating accurate, dense, and diffeomorphic 2D/3D motion fields amidst large motion and repetitive patterns.

摘要
测量类型核磁共振成像（tMRI）已经用数十年来量化体内组织运动的变形。然而，从tMRI中使用登录基本的动作估计受到频繁的图样和大动作的限制，导致动作估计出错。我们介绍了一个新的“动量、射线和调整”框架，用于在具有循环图样和大动作的情况下，高精度地估计Lagrangian动作场。这个框架基于李代数和李群原理，在射线空间寄存动量，并使用对称空间的对易映射来快速地对真的极点进行快速趋向，避免本地极点。接着的调整步骤确保了对真的极点的对准。实验结果显示，我们的方法可以高效地在具有大动作和循环图样的2D/3D tMRI数据中估计精度高、密集的动作场。

paper_url: http://arxiv.org/abs/2308.02947
repo_url: https://github.com/guillermocarbajal/j-mkpd
paper_authors: Guillermo Carbajal, Patricia Vitoria, José Lezama, Pablo Musé
for: 提高摄像头拍摄图像的运动模糊除卸。
methods: 使用深度学习方法，首先估计杂质函数，然后使用非盲解减法。
results: 实现了对真实模糊图像的高质量修复，与现有的端到端深度学习方法相当或更好。Here’s the full Chinese text in simplified format:* 为：提高摄像头拍摄图像的运动模糊除卸。* 方法：使用深度学习方法，首先估计杂质函数，然后使用非盲解减法。* 结果：实现了对真实模糊图像的高质量修复，与现有的端到端深度学习方法相当或更好。

Abstract
In recent years, the removal of motion blur in photographs has seen impressive progress in the hands of deep learning-based methods, trained to map directly from blurry to sharp images. For this reason, approaches that explicitly use a forward degradation model received significantly less attention. However, a well-defined specification of the blur genesis, as an intermediate step, promotes the generalization and explainability of the method. Towards this goal, we propose a learning-based motion deblurring method based on dense non-uniform motion blur estimation followed by a non-blind deconvolution approach. Specifically, given a blurry image, a first network estimates the dense per-pixel motion blur kernels using a lightweight representation composed of a set of image-adaptive basis motion kernels and the corresponding mixing coefficients. Then, a second network trained jointly with the first one, unrolls a non-blind deconvolution method using the motion kernel field estimated by the first network. The model-driven aspect is further promoted by training the networks on sharp/blurry pairs synthesized according to a convolution-based, non-uniform motion blur degradation model. Qualitative and quantitative evaluation shows that the kernel prediction network produces accurate motion blur estimates, and that the deblurring pipeline leads to restorations of real blurred images that are competitive or superior to those obtained with existing end-to-end deep learning-based methods. Code and trained models are available at https://github.com/GuillermoCarbajal/J-MKPD/.

摘要
Recent years have seen significant progress in removing motion blur from photographs using deep learning-based methods, which map directly from blurry to sharp images. As a result, methods that explicitly use a forward degradation model have received less attention. However, a well-defined specification of the blur genesis, as an intermediate step, can promote the generalization and explainability of the method. To address this, we propose a learning-based motion deblurring method based on dense non-uniform motion blur estimation followed by a non-blind deconvolution approach.Given a blurry image, the first network estimates the dense per-pixel motion blur kernels using a lightweight representation composed of a set of image-adaptive basis motion kernels and the corresponding mixing coefficients. Then, a second network trained jointly with the first one unrolls a non-blind deconvolution method using the motion kernel field estimated by the first network. The model-driven aspect is further promoted by training the networks on sharp/blurry pairs synthesized according to a convolution-based, non-uniform motion blur degradation model.Qualitative and quantitative evaluation shows that the kernel prediction network produces accurate motion blur estimates, and that the deblurring pipeline leads to restorations of real blurred images that are competitive or superior to those obtained with existing end-to-end deep learning-based methods. Code and trained models are available at .

paper_url: http://arxiv.org/abs/2308.02917
repo_url: None
paper_authors: Florentin Liebmann, Marco von Atzigen, Dominik Stütz, Julian Wolf, Lukas Zingg, Daniel Suter, Laura Leoty, Hooman Esfandiari, Jess G. Snedeker, Martin R. Oswald, Marc Pollefeys, Mazda Farshad, Philipp Fürnstahl
for: 骨相干 Navigation 系统的开发，以提高 lumbar 脊椎连接手术中的钉 placement 精度。
methods: 使用深度学习神经网络对 lumbar 脊椎进行自动注册，并在实时更新中使用 GPU 加速。
results: 在一个公共数据集上，成功注册率为 96%，注册误差为 2.73 mm，钉轨迹误差为 1.79°，钉入点误差为 2.43 mm。在人体试验中也实现了 100% 的钉准确性和 1.20 mm 的注册精度。

Abstract
Established surgical navigation systems for pedicle screw placement have been proven to be accurate, but still reveal limitations in registration or surgical guidance. Registration of preoperative data to the intraoperative anatomy remains a time-consuming, error-prone task that includes exposure to harmful radiation. Surgical guidance through conventional displays has well-known drawbacks, as information cannot be presented in-situ and from the surgeon's perspective. Consequently, radiation-free and more automatic registration methods with subsequent surgeon-centric navigation feedback are desirable. In this work, we present an approach that automatically solves the registration problem for lumbar spinal fusion surgery in a radiation-free manner. A deep neural network was trained to segment the lumbar spine and simultaneously predict its orientation, yielding an initial pose for preoperative models, which then is refined for each vertebra individually and updated in real-time with GPU acceleration while handling surgeon occlusions. An intuitive surgical guidance is provided thanks to the integration into an augmented reality based navigation system. The registration method was verified on a public dataset with a mean of 96\% successful registrations, a target registration error of 2.73 mm, a screw trajectory error of 1.79{\deg} and a screw entry point error of 2.43 mm. Additionally, the whole pipeline was validated in an ex-vivo surgery, yielding a 100\% screw accuracy and a registration accuracy of 1.20 mm. Our results meet clinical demands and emphasize the potential of RGB-D data for fully automatic registration approaches in combination with augmented reality guidance.

摘要
We trained a deep neural network to segment the lumbar spine and simultaneously predict its orientation, yielding an initial pose for preoperative models, which is then refined for each vertebra individually and updated in real-time with GPU acceleration while handling surgeon occlusions. This approach provides an intuitive surgical guidance through an augmented reality-based navigation system.We verified our registration method on a public dataset with a mean of 96% successful registrations, a target registration error of 2.73 mm, a screw trajectory error of 1.79°, and a screw entry point error of 2.43 mm. Additionally, we validated the whole pipeline in an ex-vivo surgery, yielding a 100% screw accuracy and a registration accuracy of 1.20 mm. Our results meet clinical demands and highlight the potential of RGB-D data for fully automatic registration approaches in combination with augmented reality guidance.

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

paper_url: http://arxiv.org/abs/2308.02915
repo_url: None
paper_authors: Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, Shuicheng Yan
for: 本研究旨在生成高分辨率、长形舞蹈序列，以便与输入音乐进行有效的Alignment。
methods: 该模型采用了一种新的级联动动态模型（DiffDance），其包括一个音乐到舞蹈吸引模型和一个序列超分辨模型。在 conditional 生成过程中，DiffDance 使用了一个预训练的音频表示学习模型来提取音乐嵌入，并通过对比损失来对其嵌入空间与动作进行对齐。
results: 通过对 AIST++ 数据集进行了广泛的实验，我们展示了 DiffDance 能够生成真实的舞蹈序列，并且与输入音乐高效地进行了Alignment。这些结果与当前的 autoregressive 方法相当。

Abstract
When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.

摘要
DiffDance consists of a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance uses a pre-trained audio representation learning model to extract music embeddings and align its embedding space to motion through contrastive loss. During training, we incorporate multiple geometric losses to ensure physically plausible outputs and add a dynamic loss weight that adaptively changes over diffusion timesteps to promote sample diversity.We evaluate DiffDance on the benchmark dataset AIST++ and demonstrate that it can generate realistic dance sequences that effectively align with the input music. Our results are comparable to those achieved by state-of-the-art autoregressive methods.

Fun Paper

2023-08-06

cs.CV - 2023-08-06

Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction

PNN: From proximal algorithms to robust unfolded image denoising networks and Plug-and-Play methods

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

NNVISR: Bring Neural Network Video Interpolation and Super Resolution into Video Processing Framework

SAAM: Stealthy Adversarial Attack on Monoculor Depth Estimation

Incorporating Pre-training Data Matters in Unsupervised Domain Adaptation

ECT: Fine-grained Edge Detection with Learned Cause Tokens

Study for Performance of MobileNetV1 and MobileNetV2 Based on Breast Cancer

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Multi-scale Alternated Attention Transformer for Generalized Stereo Matching

Prototypes-oriented Transductive Few-shot Learning with Conditional Transport

Learning Fine-Grained Features for Pixel-wise Video Correspondences

FourLLIE: Boosting Low-Light Image Enhancement by Fourier Frequency Information

Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

Recurrent Spike-based Image Restoration under General Illumination

Early Detection and Localization of Pancreatic Cancer by Label-Free Tumor Synthesis

High-Resolution Vision Transformers for Pixel-Level Identification of Structural Components and Damage

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

Weakly supervised segmentation of intracranial aneurysms using a 3D focal modulation UNet

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

Novel Class Discovery for Long-tailed Recognition

Introducing Feature Attention Module on Convolutional Neural Network for Diabetic Retinopathy Detection

Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection

Robust estimation of exposure ratios in multi-exposure image stacks

Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models

DermoSegDiff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation

K-band: Self-supervised MRI Reconstruction via Stochastic Gradient Descent over K-space Subsets

Multispectral Quantitative Phase Imaging Using a Diffractive Optical Network

MomentaMorph: Unsupervised Spatial-Temporal Registration with Momenta, Shooting, and Correction

Blind Motion Deblurring with Pixel-Wise Kernel Estimation via Kernel Prediction Networks

Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

2023-08-06

Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction

PNN: From proximal algorithms to robust unfolded image denoising networks and Plug-and-Play methods

E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP

NNVISR: Bring Neural Network Video Interpolation and Super Resolution into Video Processing Framework

SAAM: Stealthy Adversarial Attack on Monoculor Depth Estimation

Incorporating Pre-training Data Matters in Unsupervised Domain Adaptation

ECT: Fine-grained Edge Detection with Learned Cause Tokens

Study for Performance of MobileNetV1 and MobileNetV2 Based on Breast Cancer

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Multi-scale Alternated Attention Transformer for Generalized Stereo Matching

Prototypes-oriented Transductive Few-shot Learning with Conditional Transport

Learning Fine-Grained Features for Pixel-wise Video Correspondences

FourLLIE: Boosting Low-Light Image Enhancement by Fourier Frequency Information

Brighten-and-Colorize: A Decoupled Network for Customized Low-Light Image Enhancement

Causal Disentanglement Hidden Markov Model for Fault Diagnosis

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

Recurrent Spike-based Image Restoration under General Illumination

Early Detection and Localization of Pancreatic Cancer by Label-Free Tumor Synthesis

High-Resolution Vision Transformers for Pixel-Level Identification of Structural Components and Damage

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

Weakly supervised segmentation of intracranial aneurysms using a 3D focal modulation UNet

StyleEDL: Style-Guided High-order Attention Network for Image Emotion Distribution Learning

Novel Class Discovery for Long-tailed Recognition

Introducing Feature Attention Module on Convolutional Neural Network for Diabetic Retinopathy Detection

Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Robust estimation of exposure ratios in multi-exposure image stacks

Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models

DermoSegDiff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation

K-band: Self-supervised MRI Reconstruction via Stochastic Gradient Descent over K-space Subsets

Multispectral Quantitative Phase Imaging Using a Diffractive Optical Network

MomentaMorph: Unsupervised Spatial-Temporal Registration with Momenta, Shooting, and Correction

Blind Motion Deblurring with Pixel-Wise Kernel Estimation via Kernel Prediction Networks

Automatic registration with continuous pose updates for marker-less surgical navigation in spine surgery

DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation