cs.CV - 2023-11-11

3DFusion, A real-time 3D object reconstruction pipeline based on streamed instance segmented data

paper_url: http://arxiv.org/abs/2311.06659
repo_url: None
paper_authors: Xi Sun, Derek Jacoby, Yvonne Coady
for:这篇论文旨在提供一个实时分割和重建系统，该系统使用RGB-D图像来生成准确和详细的indoor环境中对象的三维模型。methods:该系统采用了当今最佳实例分割技术，对RGB-D数据进行像素级分割，以分割背景和前景对象。然后，通过高性能计算平台，对分割的对象进行三维重建。results:该系统可以实现实时三维模elling，并且可以应用于各种领域，如增强现实/虚拟现实、内部设计、城市规划、公路协助、安全系统等。为了实现实时性，论文提出了一种方法，通过采样连续帧来减少网络负担，保证重建质量。此外，该系统采用了并行SLAM管道，以高效地将分割对象切割成个体。该系统使用了领先的框架YOLO进行实例分割，并对YOLO进行修改，以解决类似对象的重复或假检测问题，确保重建模型与目标对象保持一致。总之，该工作建立了一个可靠的实时系统，对indoor环境中对象的分割和重建进行了显著提高。它可能会扩展到户外场景，开启了许多实际应用的可能性。

Abstract
This paper presents a real-time segmentation and reconstruction system that utilizes RGB-D images to generate accurate and detailed individual 3D models of objects within a captured scene. Leveraging state-of-the-art instance segmentation techniques, the system performs pixel-level segmentation on RGB-D data, effectively separating foreground objects from the background. The segmented objects are then reconstructed into distinct 3D models in a high-performance computation platform. The real-time 3D modelling can be applied across various domains, including augmented/virtual reality, interior design, urban planning, road assistance, security systems, and more. To achieve real-time performance, the paper proposes a method that effectively samples consecutive frames to reduce network load while ensuring reconstruction quality. Additionally, a multi-process SLAM pipeline is adopted for parallel 3D reconstruction, enabling efficient cutting of the clustering objects into individuals. This system employs the industry-leading framework YOLO for instance segmentation. To improve YOLO's performance and accuracy, modifications were made to resolve duplicated or false detection of similar objects, ensuring the reconstructed models align with the targets. Overall, this work establishes a robust real-time system with a significant enhancement for object segmentation and reconstruction in the indoor environment. It can potentially be extended to the outdoor scenario, opening up numerous opportunities for real-world applications.

摘要
To achieve real-time performance, the paper proposes a method that effectively samples consecutive frames to reduce network load while ensuring reconstruction quality. Additionally, a multi-process SLAM pipeline is adopted for parallel 3D reconstruction, enabling efficient cutting of the clustering objects into individuals. The system employs the industry-leading framework YOLO for instance segmentation, with modifications made to resolve duplicated or false detection of similar objects, ensuring the reconstructed models align with the targets.Overall, this work establishes a robust real-time system with a significant enhancement for object segmentation and reconstruction in the indoor environment. It can potentially be extended to the outdoor scenario, opening up numerous opportunities for real-world applications.

Unsupervised and semi-supervised co-salient object detection via segmentation frequency statistics

paper_url: http://arxiv.org/abs/2311.06654
repo_url: https://github.com/sourachakra/uscosod-sscosod
paper_authors: Souradeep Chakraborty, Shujon Naha, Muhammet Bastan, Amit Kumar K C, Dimitris Samaras
for: 本文提出了一种不需要监督的对象共同突出物检测方法（CoSOD），用于检测图像集中的共同突出物。
methods: 本文使用了自动学习的自编码器和自注意力机制，将图像分割成不同类别的单个像素，并计算每个类别的对象共同突出度。
results: 本文的方法可以在不具备监督的情况下，基于图像集的频率统计学习，实现高度精度的对象共同突出物检测。 compared with existing methods, the proposed method has a significant improvement in performance.

Abstract
In this paper, we address the detection of co-occurring salient objects (CoSOD) in an image group using frequency statistics in an unsupervised manner, which further enable us to develop a semi-supervised method. While previous works have mostly focused on fully supervised CoSOD, less attention has been allocated to detecting co-salient objects when limited segmentation annotations are available for training. Our simple yet effective unsupervised method US-CoSOD combines the object co-occurrence frequency statistics of unsupervised single-image semantic segmentations with salient foreground detections using self-supervised feature learning. For the first time, we show that a large unlabeled dataset e.g. ImageNet-1k can be effectively leveraged to significantly improve unsupervised CoSOD performance. Our unsupervised model is a great pre-training initialization for our semi-supervised model SS-CoSOD, especially when very limited labeled data is available for training. To avoid propagating erroneous signals from predictions on unlabeled data, we propose a confidence estimation module to guide our semi-supervised training. Extensive experiments on three CoSOD benchmark datasets show that both of our unsupervised and semi-supervised models outperform the corresponding state-of-the-art models by a significant margin (e.g., on the Cosal2015 dataset, our US-CoSOD model has an 8.8% F-measure gain over a SOTA unsupervised co-segmentation model and our SS-CoSOD model has an 11.81% F-measure gain over a SOTA semi-supervised CoSOD model).

摘要
在这篇论文中，我们 Addresses the detection of co-occurring salient objects (CoSOD) in an image group using frequency statistics in an unsupervised manner, which further enables us to develop a semi-supervised method. Previous works have mostly focused on fully supervised CoSOD, but less attention has been allocated to detecting co-salient objects when limited segmentation annotations are available for training. Our simple yet effective unsupervised method US-CoSOD combines the object co-occurrence frequency statistics of unsupervised single-image semantic segmentations with salient foreground detections using self-supervised feature learning. For the first time, we show that a large unlabeled dataset e.g. ImageNet-1k can be effectively leveraged to significantly improve unsupervised CoSOD performance. Our unsupervised model is a great pre-training initialization for our semi-supervised model SS-CoSOD, especially when very limited labeled data is available for training. To avoid propagating erroneous signals from predictions on unlabeled data, we propose a confidence estimation module to guide our semi-supervised training. Extensive experiments on three CoSOD benchmark datasets show that both of our unsupervised and semi-supervised models outperform the corresponding state-of-the-art models by a significant margin (e.g., on the Cosal2015 dataset, our US-CoSOD model has an 8.8% F-measure gain over a SOTA unsupervised co-segmentation model and our SS-CoSOD model has an 11.81% F-measure gain over a SOTA semi-supervised CoSOD model).Here's the translation in Traditional Chinese:在这篇论文中，我们 Addresses the detection of co-occurring salient objects (CoSOD) in an image group using frequency statistics in an unsupervised manner, which further enables us to develop a semi-supervised method. 前一些工作主要集中在完全supervised CoSOD，但是对于有限的分类标注available for training时，对于检测共同突出的物件更少的注意力。我们的简单 yet effective unsupervised method US-CoSOD combines the object co-occurrence frequency statistics of unsupervised single-image semantic segmentations with salient foreground detections using self-supervised feature learning. 我们首次显示了一个大量的无标注数据集例如ImageNet-1k可以对不supervised CoSOD performance进行明显改善。我们的无supervised model是一个优秀的预训初始化 для我们的半supervised model SS-CoSOD，特别是当有很少的标注数据available for training时。为了避免对无标注数据预测中的错误信号传播，我们提出了一个信任估计模组来引导我们的半supervised训练。广泛的实验表明我们的无supervised和半supervised模型在三个CoSOD benchmark dataset上都大比前一些state-of-the-art模型进行了明显改善 (e.g., on the Cosal2015 dataset, our US-CoSOD model has an 8.8% F-measure gain over a SOTA unsupervised co-segmentation model and our SS-CoSOD model has an 11.81% F-measure gain over a SOTA semi-supervised CoSOD model).

Traffic Sign Recognition Using Local Vision Transformer

paper_url: http://arxiv.org/abs/2311.06651
repo_url: None
paper_authors: Ali Farzipour, Omid Nejati Manzari, Shahriar B. Shokouhi
for: 提高自驾车和驾手助手系统中的交通标志识别率
methods: combining convolutional blocks和 transformer-based blocks，并添加了一个局部模块以提高局部感知
results: 在德国交通标志识别benchmark和波斯尼亚交通标志数据集上达到了99.66%和99.8%的准确率，高于最佳 convolutional models，同时具有快速推理速度和实际应用场景适用性。

Abstract
Recognition of traffic signs is a crucial aspect of self-driving cars and driver assistance systems, and machine vision tasks such as traffic sign recognition have gained significant attention. CNNs have been frequently used in machine vision, but introducing vision transformers has provided an alternative approach to global feature learning. This paper proposes a new novel model that blends the advantages of both convolutional and transformer-based networks for traffic sign recognition. The proposed model includes convolutional blocks for capturing local correlations and transformer-based blocks for learning global dependencies. Additionally, a locality module is incorporated to enhance local perception. The performance of the suggested model is evaluated on the Persian Traffic Sign Dataset and German Traffic Sign Recognition Benchmark and compared with SOTA convolutional and transformer-based models. The experimental evaluations demonstrate that the hybrid network with the locality module outperforms pure transformer-based models and some of the best convolutional networks in accuracy. Specifically, our proposed final model reached 99.66% accuracy in the German traffic sign recognition benchmark and 99.8% in the Persian traffic sign dataset, higher than the best convolutional models. Moreover, it outperforms existing CNNs and ViTs while maintaining fast inference speed. Consequently, the proposed model proves to be significantly faster and more suitable for real-world applications.

摘要
自驾车和助手系统中识别交通标识是一项关键性的任务，机器视觉任务如交通标识已经吸引了广泛的关注。CNNs在机器视觉中经常被使用，但是引入视Transformers提供了一种全局特征学习的替代方法。这篇论文提议了一种新的 hybrid 模型，将 convolutional 块和 transformer-based 块结合在一起，以便捕捉本地相关性和全球依赖关系。此外，还包含了一个 locality 模块，以增强本地感知。提议的模型在 Persian Traffic Sign Dataset 和 German Traffic Sign Recognition Benchmark 上进行了实验评估，与 SOTA convolutional 和 transformer-based 模型进行比较。实验结果表明，提议的 hybrid 网络在准确率方面与最佳 convolutional 模型和一些 transformer-based 模型相当，而且具有更快的推理速度。特别是，我们的最终模型在 German traffic sign recognition benchmark 上达到了 99.66% 的准确率，而 Persian traffic sign dataset 上达到了 99.8%。此外，它还超越了现有的 CNNs 和 ViTs，而且保持了快速的推理速度。因此，我们的提议模型在实际应用中具有优势。

Back to Basics: Fast Denoising Iterative Algorithm

paper_url: http://arxiv.org/abs/2311.06634
repo_url: None
paper_authors: Deborah Pereg
for: 降低噪音，提高图像质量
methods: 使用Back to Basics（BTB）快迭代算法，不需训练或真实数据，可应用于独立噪音和相关噪音环境中
results: 对三个研究 caso进行了实验，包括自然图像噪音纠正、POisson分布图像纠正和optical coherence tomography（OCT）干涉抑制，实验结果表明提案方法可以有效提高图像质量，在噪音设定中展现出良好的性能，并提供了理论保证。

Abstract
We introduce Back to Basics (BTB), a fast iterative algorithm for noise reduction. Our method is computationally efficient, does not require training or ground truth data, and can be applied in the presence of independent noise, as well as correlated (coherent) noise, where the noise level is unknown. We examine three study cases: natural image denoising in the presence of additive white Gaussian noise, Poisson-distributed image denoising, and speckle suppression in optical coherence tomography (OCT). Experimental results demonstrate that the proposed approach can effectively improve image quality, in challenging noise settings. Theoretical guarantees are provided for convergence stability.

摘要
我们介绍Back to Basics（BTB）算法，它是一种快速迭代的噪声减少方法。我们的方法不需要训练或真实数据，可以在独立噪声和相关噪声（相对干扰）的情况下应用，而且噪声水平未知。我们在自然图像噪声 removing中使用了三个研究 caso：在添加白噪声的情况下的自然图像净化、Poisson分布图像净化和optical coherence tomography（OCT）中的斑点消除。实验结果表明，我们提出的方法可以有效地提高图像质量，在具有挑战性噪声的情况下。我们也提供了理论保证对 converges 稳定性。

A 3D Conditional Diffusion Model for Image Quality Transfer – An Application to Low-Field MRI

paper_url: http://arxiv.org/abs/2311.06631
repo_url: https://github.com/edshkim98/diffusioniqt
paper_authors: Seunghoi Kim, Henry F. J. Tregidgo, Ahmed K. Eldaly, Matteo Figini, Daniel C. Alexander
for: 提高低场磁共振成像质量
methods: 使用3Dconditional扩散模型和cross-batch机制提高自注意力和填充
results: 在HCP数据集上比较出色，远超过现有方法 both quantitatively and qualitatively

Abstract
Low-field (LF) MRI scanners (<1T) are still prevalent in settings with limited resources or unreliable power supply. However, they often yield images with lower spatial resolution and contrast than high-field (HF) scanners. This quality disparity can result in inaccurate clinician interpretations. Image Quality Transfer (IQT) has been developed to enhance the quality of images by learning a mapping function between low and high-quality images. Existing IQT models often fail to restore high-frequency features, leading to blurry output. In this paper, we propose a 3D conditional diffusion model to improve 3D volumetric data, specifically LF MR images. Additionally, we incorporate a cross-batch mechanism into the self-attention and padding of our network, ensuring broader contextual awareness even under small 3D patches. Experiments on the publicly available Human Connectome Project (HCP) dataset for IQT and brain parcellation demonstrate that our model outperforms existing methods both quantitatively and qualitatively. The code is publicly available at \url{https://github.com/edshkim98/DiffusionIQT}.

摘要
低场（LF）MRI仪器（<1T）仍然广泛用于具有限制的资源或不可靠的电力供应的设置。然而，它们经常生成图像的空间分辨率和对比度较低，从而导致临床医生的解释不准确。图像质量传输（IQT）已经被开发来提高图像质量，学习映射函数 между低质量和高质量图像。现有的IQT模型经常无法恢复高频特征，导致输出模糊不清。在这篇论文中，我们提议一种3D条件扩散模型，用于改进3D积分数据，特别是LF MR图像。此外，我们在网络中包含了跨批机制，使其在小3D片段中保持更广泛的Contextual awareness。实验结果表明，我们的模型在公共可用的人类连接组计划（HCP）数据集上的IQT和脑分割 task中，与现有方法相比，具有较高的数量和质量性能。代码可以在 \url{https://github.com/edshkim98/DiffusionIQT} 上获取。

Computer Vision for Particle Size Analysis of Coarse-Grained Soils

paper_url: http://arxiv.org/abs/2311.06613
repo_url: None
paper_authors: Sompote Youwai, Parchya Makam
for: 本研究用computer vision技术和Python编程语言进行粒子大小分析，以提高土壤物理特性的评估效率。
methods: 使用OPENCV库对普通照明条件下拍摄的土壤粒子进行检测和测量，并使用标准手持式摄像头。
results: 相比传统筛分分析方法，该方法在大于2mm粒子上表现出良好的准确性（MAPE约6%），但是小于2mm粒子的MAPE可达60%，建议使用更高分辨率的摄像头进行拍摄。

Abstract
Particle size analysis (PSA) is a fundamental technique for evaluating the physical characteristics of soils. However, traditional methods like sieving can be time-consuming and labor-intensive. In this study, we present a novel approach that utilizes computer vision (CV) and the Python programming language for PSA of coarse-grained soils, employing a standard mobile phone camera. By eliminating the need for a high-performance camera, our method offers convenience and cost savings. Our methodology involves using the OPENCV library to detect and measure soil particles in digital photographs taken under ordinary lighting conditions. For accurate particle size determination, a calibration target with known dimensions is placed on a plain paper alongside 20 different sand samples. The proposed method is compared with traditional sieve analysis and exhibits satisfactory performance for soil particles larger than 2 mm, with a mean absolute percent error (MAPE) of approximately 6%. However, particles smaller than 2 mm result in higher MAPE, reaching up to 60%. To address this limitation, we recommend using a higher-resolution camera to capture images of the smaller soil particles. Furthermore, we discuss the advantages, limitations, and potential future improvements of our method. Remarkably, the program can be executed on a mobile phone, providing immediate results without the need to send soil samples to a laboratory. This field-friendly feature makes our approach highly convenient for on-site usage, outside of a traditional laboratory setting. Ultimately, this novel method represents an initial disruption to the industry, enabling efficient particle size analysis of soil without the reliance on laboratory-based sieve analysis. KEYWORDS: Computer vision, Grain size, ARUCO

摘要
计量粒子分析（PSA）是土壤物理特性的基本技术。然而，传统方法如筛分可能是时间consuming和人力成本高。在这项研究中，我们提出了一种新的方法，利用计算机视觉（CV）和Python编程语言进行PSA，使用标准的移动电话摄像头。它消除了高性能摄像头的需求，从而提供了便利和成本节省。我们的方法ология是使用OPENCV库检测和测量在普通照明条件下拍摄的土壤粒子。为了准确地确定粒子大小，我们使用了一个标准化的检测目标，并与20个不同的砂样进行比较。我们的方法与传统筛分分析相比，对土壤粒子大于2毫米的粒子大小具有较好的性能，具有约6%的平均绝对百分比误差（MAPE）。然而，粒子小于2毫米的误差较高，可达60%。为了解决这个限制，我们建议使用更高分辨率的摄像头拍摄小粒子土壤。此外，我们还讨论了我们的方法的优缺点，以及未来可能的改进。值得注意的是，我们的方法可以在移动电话上执行，无需将土壤样本送往实验室进行分析。这种场地友好的特点使我们的方法在实验室外实现了高效的粒子分析。最后，我们的新方法代表了它在业界的初步干扰，允许不需要实验室基础的粒子分析。关键词：计算机视觉、粒子大小、ARUCO

Swin UNETR++: Advancing Transformer-Based Dense Dose Prediction Towards Fully Automated Radiation Oncology Treatments

paper_url: http://arxiv.org/abs/2311.06572
repo_url: None
paper_authors: Kuancheng Wang, Hai Siong Tan, Rafe Mcbeth
For: The paper aims to develop a deep learning model for automating the creation of radiation treatment plans for cancer therapy.* Methods: The proposed model, called Swin UNETR++, uses a lightweight 3D Dual Cross-Attention (DCA) module to capture the intra and inter-volume relationships of each patient’s unique anatomy. The model was trained, validated, and tested on the Open Knowledge-Based Planning dataset.* Results: Swin UNETR++ demonstrates near-state-of-the-art performance on the validation and test datasets, with average volume-wise acceptance rates of 88.58% and 90.50%, and average patient-wise clinical acceptance rates of 100.0% and 98.0%. The results establish a basis for future studies to translate 3D dose predictions into a deliverable treatment plan, facilitating full automation.Here are the three points in Simplified Chinese text:* For: 本文旨在开发一种深度学习模型，用于自动生成 radiation therapy 的辐射治疗计划。* Methods: 提议的模型叫做 Swin UNETR++，它使用轻量级的 3D 双重跨参量 (DCA) 模块，以捕捉每个病人唯一的 анатомиче关系。模型在 Open Knowledge-Based Planning 数据集上进行了训练、验证和测试。* Results: Swin UNETR++ 在验证数据集和测试数据集上达到了 near-state-of-the-art 性能，具体来说，average volume-wise acceptance rate 为 88.58% 和 90.50%，average patient-wise clinical acceptance rate 为 100.0% 和 98.0%。结果为未来的研究提供了一个基础，以便将 3D 剂量预测翻译成可实施的治疗计划，实现了自动化。

Abstract
The field of Radiation Oncology is uniquely positioned to benefit from the use of artificial intelligence to fully automate the creation of radiation treatment plans for cancer therapy. This time-consuming and specialized task combines patient imaging with organ and tumor segmentation to generate a 3D radiation dose distribution to meet clinical treatment goals, similar to voxel-level dense prediction. In this work, we propose Swin UNETR++, that contains a lightweight 3D Dual Cross-Attention (DCA) module to capture the intra and inter-volume relationships of each patient's unique anatomy, which fully convolutional neural networks lack. Our model was trained, validated, and tested on the Open Knowledge-Based Planning dataset. In addition to metrics of Dose Score $\overline{S_{\text{Dose}}$ and DVH Score $\overline{S_{\text{DVH}}$ that quantitatively measure the difference between the predicted and ground-truth 3D radiation dose distribution, we propose the qualitative metrics of average volume-wise acceptance rate $\overline{R_{\text{VA}}$ and average patient-wise clinical acceptance rate $\overline{R_{\text{PA}}$ to assess the clinical reliability of the predictions. Swin UNETR++ demonstrates near-state-of-the-art performance on validation and test dataset (validation: $\overline{S_{\text{DVH}}$=1.492 Gy, $\overline{S_{\text{Dose}}$=2.649 Gy, $\overline{R_{\text{VA}}$=88.58%, $\overline{R_{\text{PA}}$=100.0%; test: $\overline{S_{\text{DVH}}$=1.634 Gy, $\overline{S_{\text{Dose}}$=2.757 Gy, $\overline{R_{\text{VA}}$=90.50%, $\overline{R_{\text{PA}}$=98.0%), establishing a basis for future studies to translate 3D dose predictions into a deliverable treatment plan, facilitating full automation.

摘要
领域 Radiation Oncology 可以充分利用人工智能来自动生成抑肿治疗计划。这个时间consuming 和专业化的任务涉及到病人图像与器官和肿瘤分割，以生成符合临床治疗目标的3D辐射剂量分布。在这种工作中，我们提议Swin UNITR++模型，它包含了轻量级3D双交叉关注（DCA）模块，以捕捉每个患者独特的生物学结构关系。与普通的完全 convolutional neural networks 不同，这种模型可以更好地考虑患者的多个方面和尺度。我们的模型在Open Knowledge-Based Planning数据集上进行训练、验证和测试，并且使用了量化评价指标，包括辐射剂量分布的DOSE Score和DVH Score，以及评价预测结果的临床可靠性指标，包括平均体积级别接受率（RVA）和平均患者级别接受率（RPA）。Swin UNITR++在验证和测试数据集上达到了近似于状态之arte的性能（验证数据集：DOSE Score=1.492 Gy，DVH Score=2.649 Gy，RVA=88.58%，RPA=100.0%; 测试数据集：DOSE Score=1.634 Gy，DVH Score=2.757 Gy，RVA=90.50%，RPA=98.0%），为未来的研究提供了一个基础，以便将3D剂量预测翻译成可实施的治疗计划，实现全自动化。

OR Residual Connection Achieving Comparable Accuracy to ADD Residual Connection in Deep Residual Spiking Neural Networks

paper_url: http://arxiv.org/abs/2311.06570
repo_url: https://github.com/ym-shan/orrc-syna-natural-pruning
paper_authors: Yimeng Shan, Xuerui Qiu, Rui-jie Zhu, Ruike Li, Meng Wang, Haicheng Qu
for: This paper aims to improve the performance and energy efficiency of deep residual spiking neural networks (SNNs) for brain-like computing.
methods: The authors introduce the OR Residual connection (ORRC) and the Synergistic Attention (SynA) module to the SEW-ResNet architecture, and integrate natural pruning to reduce computational overhead.
results: The enhanced OR-Spiking ResNet achieved single-sample classification with as little as 0.8 spikes per neuron, outperforming other spike residual models in accuracy and power consumption.Here is the same information in Simplified Chinese:
for: 这篇论文目标是改进深度待遇刺激神经网络（SNNs）的性能和能效性，用于脑类计算。
methods: 作者们引入 OR Residual connection（ORRC）和Synergistic Attention（SynA）模块到 SEW-ResNet 架构中，并实现自然减少计算开销。
results: 提升后的 OR-Spiking ResNet 实现单个样本分类，只需0.8个神经元发射，与其他刺激剩余模型相比，具有更高的准确率和更低的电力消耗。

Abstract
Spiking Neural Networks (SNNs) have garnered substantial attention in brain-like computing for their biological fidelity and the capacity to execute energy-efficient spike-driven operations. As the demand for heightened performance in SNNs surges, the trend towards training deeper networks becomes imperative, while residual learning stands as a pivotal method for training deep neural networks. In our investigation, we identified that the SEW-ResNet, a prominent representative of deep residual spiking neural networks, incorporates non-event-driven operations. To rectify this, we introduce the OR Residual connection (ORRC) to the architecture. Additionally, we propose the Synergistic Attention (SynA) module, an amalgamation of the Inhibitory Attention (IA) module and the Multi-dimensional Attention (MA) module, to offset energy loss stemming from high quantization. When integrating SynA into the network, we observed the phenomenon of "natural pruning", where after training, some or all of the shortcuts in the network naturally drop out without affecting the model's classification accuracy. This significantly reduces computational overhead and makes it more suitable for deployment on edge devices. Experimental results on various public datasets confirmed that the SynA enhanced OR-Spiking ResNet achieved single-sample classification with as little as 0.8 spikes per neuron. Moreover, when compared to other spike residual models, it exhibited higher accuracy and lower power consumption. Codes are available at https://github.com/Ym-Shan/ORRC-SynA-natural-pruning.

摘要
神经网络（SNN）在脑如计算中备受关注，因其生物准确性和能效地执行脉冲驱动操作。随着深度SNN的需求增加，训练深度网络变得必要，而剩余学习成为训练深度网络的重要方法。在我们的研究中，我们发现SEW-ResNet，一种深度剩余神经网络的代表，包含非事件驱动操作。为解决这问题，我们引入了OR隐藏连接（ORRC）到架构中。此外，我们提出了协同注意（SynA）模块，它是禁忌注意模块和多维注意模块的组合，以弥补因高量化而导致的能量损失。在将SynA模块 incorporated into the network时，我们观察到了自然减少现象，即在训练后，网络中的减少减少自然而无需影响模型的分类精度。这显著减少计算开销，使其更适合边缘设备部署。实验结果表明，对多个公共数据集进行训练后，使用SynA进行增强的OR-Spiking ResNet可以在0.8脉冲每个神经元单个样本分类。此外，与其他脉冲剩余模型相比，它表现出更高的准确率和更低的能 consumption。代码可以在https://github.com/Ym-Shan/ORRC-SynA-natural-pruning中找到。

Artificial Intelligence in Assessing Cardiovascular Diseases and Risk Factors via Retinal Fundus Images: A Review of the Last Decade

paper_url: http://arxiv.org/abs/2311.07609
repo_url: None
paper_authors: Mirsaeed Abdollahi, Ali Jafarizadeh, Amirhosein Ghafouri Asbagh, Navid Sobhi, Keysan Pourmoghtader, Siamak Pedrammehr, Houshyar Asadi, Roohallah Alizadehsani, Ru-San Tan, U. Rajendra Acharya
For: The paper aims to provide an overview of the current advancements and challenges in employing retinal imaging and artificial intelligence to identify cardiovascular disorders.* Methods: The paper uses a comprehensive search of various databases, including PubMed, Medline, Google Scholar, Scopus, Web of Sciences, IEEE Xplore, and ACM Digital Library, to identify relevant publications related to cardiovascular diseases and artificial intelligence.* Results: The study includes 87 English-language publications that provide insights into the current state of research in this field and highlights the potential of AI and deep learning for early detection and prediction of cardiovascular diseases.Here are the three points in Simplified Chinese text:* For: 这篇论文目的是为了提供Cardiovascular diseases (CVDs)的现状和挑战，以及使用Retinal imaging和人工智能来识别CVDs的概述。* Methods: 这篇论文使用了多种数据库的检索，包括PubMed、Medline、Google Scholar、Scopus、Web of Sciences、IEEE Xplore和ACM Digital Library，以确定相关的Cardiovascular diseases和人工智能publications。* Results: 这篇论文包含87篇英文文献，提供了这个领域的当前进展和挑战，并指出了人工智能和深度学习在早期检测和预测Cardiovascular diseases方面的潜在潜力。

Abstract
Background: Cardiovascular diseases (CVDs) continue to be the leading cause of mortality on a global scale. In recent years, the application of artificial intelligence (AI) techniques, particularly deep learning (DL), has gained considerable popularity for evaluating the various aspects of CVDs. Moreover, using fundus images and optical coherence tomography angiography (OCTA) to diagnose retinal diseases has been extensively studied. To better understand heart function and anticipate changes based on microvascular characteristics and function, researchers are currently exploring the integration of AI with non-invasive retinal scanning. Leveraging AI-assisted early detection and prediction of cardiovascular diseases on a large scale holds excellent potential to mitigate cardiovascular events and alleviate the economic burden on healthcare systems. Method: A comprehensive search was conducted across various databases, including PubMed, Medline, Google Scholar, Scopus, Web of Sciences, IEEE Xplore, and ACM Digital Library, using specific keywords related to cardiovascular diseases and artificial intelligence. Results: A total of 87 English-language publications, selected for relevance were included in the study, and additional references were considered. This study presents an overview of the current advancements and challenges in employing retinal imaging and artificial intelligence to identify cardiovascular disorders and provides insights for further exploration in this field. Conclusion: Researchers aim to develop precise disease prognosis patterns as the aging population and global CVD burden increase. AI and deep learning are transforming healthcare, offering the potential for single retinal image-based diagnosis of various CVDs, albeit with the need for accelerated adoption in healthcare systems.

摘要
背景：心血管疾病（CVD）仍然是全球范围内最主要的死亡原因。在过去几年，人工智能（AI）技术，特别是深度学习（DL），在评估CVD多方面的方面得到了广泛的应用。此外，使用眼膜图像和光共振成像（OCTA）诊断视网膜疾病已经得到了广泛的研究。为了更好地理解心脏功能和预测基于微血管特征和功能的变化，研究人员正在探索将AI与不侵入性的眼膜扫描结合起来。利用AI助成早期检测和预测心血管疾病的大规模应用拥有很大的潜在价值，可以减少心血管事件和减轻医疗系统的负担。方法：我们对多种数据库进行了总体检索，包括PubMed、Medline、Google学术搜索、Scopus、Web of Sciences、IEEE Xplore和ACM数字图书馆，使用与心血管疾病相关的特定关键词。结果：共选择了87篇英文文献，包括其他参考文献。本研究提供了目前AI与眼膜成像在诊断心血管疾病方面的进展和挑战，以及此领域的更多可能性的探索。结论：研究人员目标是通过随着人口老龄化和全球CVD荷重的增加，发展精准的疾病诊断模式。AI和深度学习在医疗领域中发挥了重要作用，尝试通过单一的眼膜图像诊断多种CVD，尽管需要加速在医疗系统中的采用。

Identification of vortex in unstructured mesh with graph neural networks

paper_url: http://arxiv.org/abs/2311.06557
repo_url: None
paper_authors: Lianfa Wang, Yvan Fournier, Jean-Francois Wald, Youssef Mesri
for: 用于identifying flow characteristics from Computational Fluid Dynamics (CFD) databases，帮助研究者更好地理解流场，优化geometry设计和选择合适的CFD配置。
methods: 使用Graph Neural Network (GNN) with U-Net architecture，通过 algebraic multigrid method生成图并构建图层结构，对2D CFD网格中的涡旋区域进行自动标签。
results: 对CFD结果进行了vortex自动标签，并评估了GNN矩阵的分类精度、训练效率和标注结果的流场特征。最后，demonstrated the approach的可扩展性和通用性，可应用于不同的液体动力学模型和 Reynolds 数。

Abstract
Deep learning has been employed to identify flow characteristics from Computational Fluid Dynamics (CFD) databases to assist the researcher to better understand the flow field, to optimize the geometry design and to select the correct CFD configuration for corresponding flow characteristics. Convolutional Neural Network (CNN) is one of the most popular algorithms used to extract and identify flow features. However its use, without any additional flow field interpolation, is limited to the simple domain geometry and regular meshes which limits its application to real industrial cases where complex geometry and irregular meshes are usually used. Aiming at the aforementioned problems, we present a Graph Neural Network (GNN) based model with U-Net architecture to identify the vortex in CFD results on unstructured meshes. The graph generation and graph hierarchy construction using algebraic multigrid method from CFD meshes are introduced. A vortex auto-labeling method is proposed to label vortex regions in 2D CFD meshes. We precise our approach by firstly optimizing the input set on CNNs, then benchmarking current GNN kernels against CNN model and evaluating the performances of GNN kernels in terms of classification accuracy, training efficiency and identified vortex morphology. Finally, we demonstrate the adaptability of our approach to unstructured meshes and generality to unseen cases with different turbulence models at different Reynolds numbers.

摘要
深度学习已经在计算流体动力学（CFD）数据库中使用来识别流体特性，以 помо助研究人员更好地理解流场，优化geometry设计和选择相应的CFD配置。抽象神经网络（CNN）是最受欢迎的算法之一，用于提取和识别流体特征。然而，不带任何流场插值的CNN使用，受限于简单的域几何和规则的网格，因此对实际工业案例中的复杂几何和不规则网格的应用有限。为此，我们提出了基于图神经网络（GNN）的模型，使用U-Net架构来识别CFD结果中的涡。我们首先优化输入集，然后对现有GNN核 compare with CNN模型，并评估GNN核的性能。最后，我们证明我们的方法可以适用于不结构化网格和未看到的情况中的不同的湍流模型和不同的 Reynolds 数。

Visual Commonsense based Heterogeneous Graph Contrastive Learning

paper_url: http://arxiv.org/abs/2311.06553
repo_url: None
paper_authors: Zongzhao Li, Xiangyu Zhu, Xi Zhang, Zhaoxiang Zhang, Zhen Lei
for: 提高多modal应用中视语关系的理解和语言领域关系的抽象
methods: 使用heterogeneous graph contrastive learning方法，包括Visual Commonsense Information和Graph Relation Network，以提高视觉理解任务的完成
results: 对四个 benchmark 进行了广泛的实验，显示了方法的效果和通用性，可以大幅提高七种代表性 VQA 模型的性能

Abstract
How to select relevant key objects and reason about the complex relationships cross vision and linguistic domain are two key issues in many multi-modality applications such as visual question answering (VQA). In this work, we incorporate the visual commonsense information and propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task. Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods. Specifically, our model contains two key components: the Commonsense-based Contrastive Learning and the Graph Relation Network. Using contrastive learning, we guide the model concentrate more on discriminative objects and relevant visual commonsense attributes. Besides, thanks to the introduction of the Graph Relation Network, the model reasons about the correlations between homogeneous edges and the similarities between heterogeneous edges, which makes information transmission more effective. Extensive experiments on four benchmarks show that our method greatly improves seven representative VQA models, demonstrating its effectiveness and generalizability.

摘要
多Modalitate应用中，选择相关的关键对象并理解跨视听域关系是两个关键问题。在这种情况下，我们将视觉常识信息 integrate到模型中，并提出一种多态图像异构学习方法来更好地完成视觉理解任务。我们的方法设计为可插入式的方式，以便快速和方便地与各种代表性方法结合使用。具体来说，我们的模型包括两个关键组件： Commonsense-based Contrastive Learning和图像关系网络。通过对比学习，我们引导模型更多地关注特征对象和相关的视觉常识特征。此外，图像关系网络的引入使得模型可以更好地理解同型边的相互关系，使信息传递更加有效。我们在四个标准测试集上进行了广泛的实验，并证明了我们的方法可以大幅提高七种代表VQA模型的性能，表明其有效性和普适性。

Stain Consistency Learning: Handling Stain Variation for Automatic Digital Pathology Segmentation

paper_url: http://arxiv.org/abs/2311.06552
repo_url: https://github.com/mlyg/stain_consistency_learning
paper_authors: Michael Yeung, Todd Watts, Sean YW Tan, Pedro F. Ferreira, Andrew D. Scott, Sonia Nielles-Vallespin, Guang Yang
for: 本研究旨在提高机器学习方法对染色谱变化的可靠性，并对各种方法进行比较性评估，以便选择最佳方法。
methods: 本研究提出了一种新的染色协调学习框架，即染色特征归一化学习法，该法结合染色特征归一化和染色一致损失函数来学习染色颜色无关的特征。
results: 对于 Masson 染色和 H&E 染色的细胞和核lei datasets，本研究对各种染色变化处理方法进行了首次、广泛的比较，并证明了提案的方法能够获得最佳性能。

Abstract
Stain variation is a unique challenge associated with automated analysis of digital pathology. Numerous methods have been developed to improve the robustness of machine learning methods to stain variation, but comparative studies have demonstrated limited benefits to performance. Moreover, methods to handle stain variation were largely developed for H&E stained data, with evaluation generally limited to classification tasks. Here we propose Stain Consistency Learning, a novel framework combining stain-specific augmentation with a stain consistency loss function to learn stain colour invariant features. We perform the first, extensive comparison of methods to handle stain variation for segmentation tasks, comparing ten methods on Masson's trichrome and H&E stained cell and nuclei datasets, respectively. We observed that stain normalisation methods resulted in equivalent or worse performance, while stain augmentation or stain adversarial methods demonstrated improved performance, with the best performance consistently achieved by our proposed approach. The code is available at: https://github.com/mlyg/stain_consistency_learning

摘要
颜色差异是数字病理学自动分析中的一个独特挑战。许多方法已经开发来改善机器学习方法对颜色差异的Robustness，但是比较研究表明这些方法具有有限的效果。此外，处理颜色差异的方法主要是为H&E染料数据而开发，评估通常是限定为分类任务。我们提出了一种新的框架，即颜色一致学习（Stain Consistency Learning），它将颜色特异的扩充与颜色一致损失函数结合在一起，以学习颜色不变的特征。我们对 Masson的三色染料和H&E染料分别进行了细胞和核lei数据集的比较，结果表明，颜色normal化方法的性能相当或更差，而颜色扩充或颜色对抗方法的性能则得到了改善，我们的提议方法得到了最佳性能。代码可以在以下链接下获取：https://github.com/mlyg/stain_consistency_learning。

FDNet: Feature Decoupled Segmentation Network for Tooth CBCT Image

paper_url: http://arxiv.org/abs/2311.06551
repo_url: None
paper_authors: Xiang Feng, Chengkai Wang, Chengyu Wu, Yunxiang Li, Yongbo He, Shuai Wang, Yaiqi Wang
for: 本研究旨在提高CBCT影像的精确分割，以便正确评估牙齿准备治疗计划。
methods: 该研究提出了一种新的Feature Decoupled Segmentation Network（FDNet），通过结合低频波峰变换（LF-Wavelet）和SAM编码器，以提高牙齿边界的精度和细节分割的准确性。
results: 研究表明，FDNet可以在CBCT影像中提供高达85.28%的Dice分数和75.23%的IoU分数，表明该方法可以有效地减少semantic gap，提供精确的牙齿分割结果。

Abstract
Precise Tooth Cone Beam Computed Tomography (CBCT) image segmentation is crucial for orthodontic treatment planning. In this paper, we propose FDNet, a Feature Decoupled Segmentation Network, to excel in the face of the variable dental conditions encountered in CBCT scans, such as complex artifacts and indistinct tooth boundaries. The Low-Frequency Wavelet Transform (LF-Wavelet) is employed to enrich the semantic content by emphasizing the global structural integrity of the teeth, while the SAM encoder is leveraged to refine the boundary delineation, thus improving the contrast between adjacent dental structures. By integrating these dual aspects, FDNet adeptly addresses the semantic gap, providing a detailed and accurate segmentation. The framework's effectiveness is validated through rigorous benchmarks, achieving the top Dice and IoU scores of 85.28% and 75.23%, respectively. This innovative decoupling of semantic and boundary features capitalizes on the unique strengths of each element to significantly elevate the quality of segmentation performance.

摘要
精准牙齿 cone beam computed tomography（CBCT）图像分割是正确的orthodontic treatment planning的关键。在这篇论文中，我们提出了FDNet，一种特征解coupled Segmentation Network，以便在CBCT扫描中遇到的变化牙齿条件下表现出色，例如复杂的遗产物和不明确的牙齿界限。使用低频波лет变换（LF-Wavelet）可以增强牙齿的semantic内容，同时使用SAM编码器可以进一步改善边界定义，从而提高牙齿结构之间的对比度。通过这种双重方法，FDNet能够有效地bridging semantic gap，提供精确和详细的分割。该框架的效果被证明通过严格的 benchmark，实现了Dice和IoU分割分别达到85.28%和75.23%的最高分。这种创新的特征解coupling技术可以 Capitalize on Each element的特点，提高分割性能的质量。

Generation Of Colors using Bidirectional Long Short Term Memory Networks

paper_url: http://arxiv.org/abs/2311.06542
repo_url: https://github.com/chungimungi/color-prediction
paper_authors: A. Sinha
For: This paper aims to bridge the gap between human visual perception of countless shades of colours and our ability to name and describe them accurately, using a novel model based on Bidirectional Long Short-Term Memory (BiLSTM) networks with Active learning.* Methods: The paper develops a novel model that operates on a proprietary dataset curated for this study, using BiLSTM networks with Active learning to categorize and name previously unnamed colours or identify intermediate shades that elude traditional colour terminology.* Results: The findings of the study demonstrate the potential of this innovative approach in revolutionizing our understanding of colour perception and language, with the potential to extend the applications of Natural Language Processing (NLP) beyond conventional boundaries.

Abstract
Human vision can distinguish between a vast spectrum of colours, estimated to be between 2 to 7 million discernible shades. However, this impressive range does not inherently imply that all these colours have been precisely named and described within our lexicon. We often associate colours with familiar objects and concepts in our daily lives. This research endeavors to bridge the gap between our visual perception of countless shades and our ability to articulate and name them accurately. A novel model has been developed to achieve this goal, leveraging Bidirectional Long Short-Term Memory (BiLSTM) networks with Active learning. This model operates on a proprietary dataset meticulously curated for this study. The primary objective of this research is to create a versatile tool for categorizing and naming previously unnamed colours or identifying intermediate shades that elude traditional colour terminology. The findings underscore the potential of this innovative approach in revolutionizing our understanding of colour perception and language. Through rigorous experimentation and analysis, this study illuminates a promising avenue for Natural Language Processing (NLP) applications in diverse industries. By facilitating the exploration of the vast colour spectrum the potential applications of NLP are extended beyond conventional boundaries.

摘要
人类视觉可以分辨出各种颜色，估计有2到7百万个不同的颜色。然而，这一各种颜色的范围并不意味着所有的颜色都有被精确地命名和描述在我们的语言中。我们常常将颜色与日常生活中的familiar对象和概念相关联。这项研究的目的是将视觉中的 countless 颜色与我们的语言之间的差距bridged。为此，该研究开发了一种基于BiLSTM网络和活动学习的新模型。该模型运用了专门为本研究制作的专有数据集。本研究的主要目标是开发一种可以分类和命名未命名颜色或者描述不能被传统颜色术语捕捉的颜色的工具。研究结果表明这种创新的方法在改变我们对颜色识别和语言的理解方面具有潜力。通过严格的实验和分析，本研究探讨了NLP应用的新途径，扩展了NLP应用的边界。

CrashCar101: Procedural Generation for Damage Assessment

paper_url: http://arxiv.org/abs/2311.06536
repo_url: None
paper_authors: Jens Parslov, Erik Riise, Dim P. Papadopoulos
for: 本研究旨在解决汽车损害评估中的问题，包括检测损害的位置和程度以及特定的损害部分。
methods: 我们提议使用生成过程来训练计算机视觉系统，使其能够进行Semantic part和损害分 segmentation。我们使用生成的3D汽车模型和Synthetic Data来生成高度多样化的样本，并为每个样本提供高精度的像素注释。
results: 我们采用这种方法并生成了CrashCar101数据集。我们在三个实际数据集上进行了实验，并证明了在part segmentation任务上，使用实际数据和Synthetic Data进行训练的模型比使用实际数据进行训练的模型表现更好。在损害 segmentation任务上，我们证明了CrashCar101数据集的sim2real转移能力。

Abstract
In this paper, we are interested in addressing the problem of damage assessment for vehicles, such as cars. This task requires not only detecting the location and the extent of the damage but also identifying the damaged part. To train a computer vision system for the semantic part and damage segmentation in images, we need to manually annotate images with costly pixel annotations for both part categories and damage types. To overcome this need, we propose to use synthetic data to train these models. Synthetic data can provide samples with high variability, pixel-accurate annotations, and arbitrarily large training sets without any human intervention. We propose a procedural generation pipeline that damages 3D car models and we obtain synthetic 2D images of damaged cars paired with pixel-accurate annotations for part and damage categories. To validate our idea, we execute our pipeline and render our CrashCar101 dataset. We run experiments on three real datasets for the tasks of part and damage segmentation. For part segmentation, we show that the segmentation models trained on a combination of real data and our synthetic data outperform all models trained only on real data. For damage segmentation, we show the sim2real transfer ability of CrashCar101.

摘要
在这篇论文中，我们关注了汽车损害评估问题，例如汽车受损的部分和程度的识别。为了训练计算机视觉系统进行semantic部分和损害分割，我们需要手动标注图像，以获得价值的像素注释。为了缓解这个需求，我们提议使用生成数据。生成数据可以提供高度多样性的样本，高精度的像素注释，并且可以在人工干预下生成无限大的训练集。我们提出了一个生成过程，用于损害3D汽车模型，并从而获得了损害2D图像和高精度的像素注释。为了验证我们的想法，我们执行了我们的管道，并生成了CrashCar101数据集。我们在三个实际数据集上进行了实验，以评估part和损害分割任务。对于part分割任务，我们显示了将real数据和我们生成的数据混合训练的模型，与只使用实际数据训练的模型相比，具有更高的性能。对于损害分割任务，我们显示了CrashCar101数据集的sim2real传送能力。

Band-wise Hyperspectral Image Pansharpening using CNN Model Propagation

paper_url: http://arxiv.org/abs/2311.06510
repo_url: https://github.com/giu-guarino/r-pnn
paper_authors: Giuseppe Guarino, Matteo Ciotola, Gemine Vivone, Giuseppe Scarpa
for: 本研究的目的是提出一种深度学习方法，用于解决高spectral缩进问题。
methods: 该方法基于单 banda unsupervised pansharpening模型，通过在排序band-wise adaptive scheme中嵌入该模型，以适应不同 spectral band的数据。
results: 对于我们的数据集，该方法达到了非常好的结果，超过了传统和深度学习参考方法。代码实现可以在https://github.com/giu-guarino/R-PNN找到。

Abstract
Hyperspectral pansharpening is receiving a growing interest since the last few years as testified by a large number of research papers and challenges. It consists in a pixel-level fusion between a lower-resolution hyperspectral datacube and a higher-resolution single-band image, the panchromatic image, with the goal of providing a hyperspectral datacube at panchromatic resolution. Thanks to their powerful representational capabilities, deep learning models have succeeded to provide unprecedented results on many general purpose image processing tasks. However, when moving to domain specific problems, as in this case, the advantages with respect to traditional model-based approaches are much lesser clear-cut due to several contextual reasons. Scarcity of training data, lack of ground-truth, data shape variability, are some such factors that limit the generalization capacity of the state-of-the-art deep learning networks for hyperspectral pansharpening. To cope with these limitations, in this work we propose a new deep learning method which inherits a simple single-band unsupervised pansharpening model nested in a sequential band-wise adaptive scheme, where each band is pansharpened refining the model tuned on the preceding one. By doing so, a simple model is propagated along the wavelength dimension, adaptively and flexibly, with no need to have a fixed number of spectral bands, and, with no need to dispose of large, expensive and labeled training datasets. The proposed method achieves very good results on our datasets, outperforming both traditional and deep learning reference methods. The implementation of the proposed method can be found on https://github.com/giu-guarino/R-PNN

摘要
“几年前，几何spectral pansharpening已经受到了越来越多的关注，可以看到许多研究论文和挑战。它的目的是将lower-resolution的几何spectral数据 кубы和高分辨率的单色图像（panchromatic image）进行像素级融合，以获得高分辨率的几何spectral数据库。由于深度学习模型具有强大的表示能力，它们在许多通用图像处理任务上取得了无PRECEDENT的成绩。但是，当转移到域特定问题时，例如这个案例中，深度学习模型的优势与传统的模型基于方法相比较难明确地表现出来，因为一些contextual因素的限制。数据缺乏、缺乏标注、数据形态变化等因素，都会限制深度学习网络在域特定问题上的总体化能力。为了缓解这些限制，在这个工作中，我们提出了一种新的深度学习方法。这种方法基于单色图像无监督的宽渠扩充模型，通过在带宽维度上逐步进行适应式的band-wise适应方案，使得每个带都可以细化和适应，无需具备固定的 spectral 带数量，也无需具备大量、昂贵和标注的训练数据。提议的方法在我们的数据集上实现了非常好的效果，超越了传统和深度学习参考方法。实现方法可以在https://github.com/giu-guarino/R-PNN 找到。”

Self-supervised Context Learning for Visual Inspection of Industrial Defects

paper_url: http://arxiv.org/abs/2311.06504
repo_url: None
paper_authors: Peng Wang, Haiming Yao, Wenyong Yu
for: 本研究旨在提出一种基于自我监督学习的检测方法，以解决现有的无监督模型在产品表面变化大的情况下检测缺陷的问题。
methods: 我们提出一种自我监督学习算法，通过将目标图像分割成9个 patches，并让编码器预测每两个 patch 的相对位置关系，以提取丰富的 semantics。我们还提出一种帮助函数-加 augmentation 方法，以强调正常和异常的 latent 表示之间的差异。
results: 我们的方法在 widely 使用的 MVTec AD 数据集上实现了出色的检测和 segmentation 性能，即 95.8% 和 96.8% 分别，创造了当今无监督检测领域的状元标准。广泛的实验证明了我们的方法在多种工业应用中的有效性。

Abstract
The unsupervised visual inspection of defects in industrial products poses a significant challenge due to substantial variations in product surfaces. Current unsupervised models struggle to strike a balance between detecting texture and object defects, lacking the capacity to discern latent representations and intricate features. In this paper, we present a novel self-supervised learning algorithm designed to derive an optimal encoder by tackling the renowned jigsaw puzzle. Our approach involves dividing the target image into nine patches, tasking the encoder with predicting the relative position relationships between any two patches to extract rich semantics. Subsequently, we introduce an affinity-augmentation method to accentuate differences between normal and abnormal latent representations. Leveraging the classic support vector data description algorithm yields final detection results. Experimental outcomes demonstrate that our proposed method achieves outstanding detection and segmentation performance on the widely used MVTec AD dataset, with rates of 95.8% and 96.8%, respectively, establishing a state-of-the-art benchmark for both texture and object defects. Comprehensive experimentation underscores the effectiveness of our approach in diverse industrial applications.

摘要
“无监督的视觉检测工业产品上的瑕疵具有严重的挑战，因为产品表面会有很大的变化。现有的无监督模型对于检测文字和物体瑕疵具有困难，因为它们缺乏能够捕捉实际特征和细节的能力。在这篇论文中，我们提出了一个新的自类学习算法，用于从熔毙难以分辨的图像中提取有用的 semantics。我们的方法是将目标图像分成九块，让算法预测两块之间的相对位置关系，以提取丰富的 semantics。然后，我们引入了一个增强不同于正常的latent representation的方法，以提高分辨率。通过使用了经典支持向量描述算法，获得最终的检测结果。实验结果显示，我们的提案方法在广泛使用的MVTec AD dataset上实现了95.8%和96.8%的检测和分类性能，成为瑕疵和物体瑕疵检测的现代标准。实验结果显示，我们的方法在不同的工业应用中具有广泛的适用范围。”

LayoutPrompter: Awaken the Design Ability of Large Language Models

paper_url: http://arxiv.org/abs/2311.06495
repo_url: https://github.com/microsoft/layoutgeneration
paper_authors: Jiawei Lin, Jiaqi Guo, Shizhao Sun, Zijiang James Yang, Jian-Guang Lou, Dongmei Zhang
for: 这个论文是为了提出一种基于大语言模型（LLM）的 Conditional Graphic Layout Generation 方法，以解决现有方法缺乏灵活性和数据效率问题。
methods: 该方法包括三个关键组件：输入输出序列化、动态示例选择和布局排序。具体来说，输入输出序列化组件 меiculously 设计了每个布局生成任务的输入和输出格式。动态示例选择负责选择对于给定输入最有帮助的提示示例。布局排序则是用于从多个 LLM 的输出中选择最高质量的布局。
results: 经过实验表明，LayoutPrompter 可以在所有现有的布局生成任务上与或超越当前状态的方法，无需训练或调整模型。此外，对比baseline方法，LayoutPrompter 在低数据情况下表现更出色，进一步证明了该方法的数据效率。

Abstract
Conditional graphic layout generation, which automatically maps user constraints to high-quality layouts, has attracted widespread attention today. Although recent works have achieved promising performance, the lack of versatility and data efficiency hinders their practical applications. In this work, we propose LayoutPrompter, which leverages large language models (LLMs) to address the above problems through in-context learning. LayoutPrompter is made up of three key components, namely input-output serialization, dynamic exemplar selection and layout ranking. Specifically, the input-output serialization component meticulously designs the input and output formats for each layout generation task. Dynamic exemplar selection is responsible for selecting the most helpful prompting exemplars for a given input. And a layout ranker is used to pick the highest quality layout from multiple outputs of LLMs. We conduct experiments on all existing layout generation tasks using four public datasets. Despite the simplicity of our approach, experimental results show that LayoutPrompter can compete with or even outperform state-of-the-art approaches on these tasks without any model training or fine-tuning. This demonstrates the effectiveness of this versatile and training-free approach. In addition, the ablation studies show that LayoutPrompter is significantly superior to the training-based baseline in a low-data regime, further indicating the data efficiency of LayoutPrompter. Our project is available at https://github.com/microsoft/LayoutGeneration/tree/main/LayoutPrompter.

摘要
《 conditional graphic layout generation 》，即自动将用户约束映射到高质量的布局，在今天已经吸引了广泛的关注。虽然 latest works 已经实现了可观的性能，但缺乏实用性和数据效率限制了它们的实际应用。在这项工作中，我们提出了 LayoutPrompter，它利用大型语言模型（LLMs）来解决上述问题通过在线上学习。LayoutPrompter 由三个关键组成部分：输入输出序列化、动态示例选择和布局排名。具体来说，输入输出序列化部分仔细设计了每个布局生成任务的输入和输出格式。动态示例选择部分选择给定输入的最有用的推动示例。而布局排名部分则用来从多个 LLMS 的输出中选择最高质量的布局。我们在所有现有的布局生成任务上进行了实验，使用四个公共数据集。尽管我们的方法简单，但实验结果显示，LayoutPrompter 可以与或 même outperform 当前状态的方法这些任务上，无需任何模型训练或调整。这说明 LayoutPrompter 是一种灵活且无需训练的方法。此外，我们的剖析研究表明，LayoutPrompter 在低数据情况下表现 significatively 优于基eline，这再次证明了 LayoutPrompter 的数据效率。您可以在 https://github.com/microsoft/LayoutGeneration/tree/main/LayoutPrompter 上查看我们的项目。

PECoP: Parameter Efficient Continual Pretraining for Action Quality Assessment

paper_url: http://arxiv.org/abs/2311.07603
repo_url: https://github.com/plrbear/pecop
paper_authors: Amirhossein Dadashzadeh, Shuchao Duan, Alan Whone, Majid Mirmehdi
for: 本研究的目的是提高Action Quality Assessment（AQA）中的模型表现，特别是在预测过程中处理具有域对错的资料时。
methods: 我们提出了一个新的、效率高的普遍预训架构，名为PECoP，并在其中引入3D-Adapters来学习预测当中的体域特征。在PECoP中，仅对Adapter模组的参数进行更新，以减少域对错的影响。
results: 我们在几个benchmark dataset上进行了实验，包括JIGSAWS、MTL-AQA和FineDiving，并取得了较好的成绩（比如JIGSAWS中提高6.0%）。此外，我们还提供了一个新的Parkinson’s Disease dataset，PD4T，并在其上进行了比较，并与之前的最佳成绩进行了比较（提高3.56%）。

Abstract
The limited availability of labelled data in Action Quality Assessment (AQA), has forced previous works to fine-tune their models pretrained on large-scale domain-general datasets. This common approach results in weak generalisation, particularly when there is a significant domain shift. We propose a novel, parameter efficient, continual pretraining framework, PECoP, to reduce such domain shift via an additional pretraining stage. In PECoP, we introduce 3D-Adapters, inserted into the pretrained model, to learn spatiotemporal, in-domain information via self-supervised learning where only the adapter modules' parameters are updated. We demonstrate PECoP's ability to enhance the performance of recent state-of-the-art methods (MUSDL, CoRe, and TSA) applied to AQA, leading to considerable improvements on benchmark datasets, JIGSAWS ($\uparrow6.0\%$), MTL-AQA ($\uparrow0.99\%$), and FineDiving ($\uparrow2.54\%$). We also present a new Parkinson's Disease dataset, PD4T, of real patients performing four various actions, where we surpass ($\uparrow3.56\%$) the state-of-the-art in comparison. Our code, pretrained models, and the PD4T dataset are available at https://github.com/Plrbear/PECoP.

摘要
因为Action Quality Assessment（AQA）的标注数据有限，previous works通常是 fine-tune 在大规模的领域通用数据集上预训练的模型。这种常见的方法会导致弱化泛化，特别是当领域shift很大时。我们提出了一种新的、效率的 continual pretraining 框架，PECoP，以减少领域shift。在 PECoP 中，我们引入了 3D-Adapters，用于在预训练模型中学习空间时间领域信息，通过自我超vised learning，只有 adapter modules 的参数被更新。我们证明了 PECoP 能够提高最近state-of-the-art方法（MUSDL、CoRe 和 TSA）在 AQA 中的性能，在 benchmark 数据集（JIGSAWS、MTL-AQA 和 FineDiving）上实现了显著提高（$\uparrow6.0\%$, $\uparrow0.99\%$ 和 $\uparrow2.54\%$）。我们还发布了一个新的 Parkinson's Disease 数据集，PD4T，包含了四种不同的动作，我们在 comparison 中超过了 state-of-the-art（$\uparrow3.56\%$）。我们的代码、预训练模型和 PD4T 数据集可以在 GitHub 上获取：https://github.com/Plrbear/PECoP。

Polarimetric PatchMatch Multi-View Stereo

paper_url: http://arxiv.org/abs/2311.07600
repo_url: None
paper_authors: Jinyu Zhao, Jumpei Oishi, Yusuke Monno, Masatoshi Okutomi
for: 这paper是为了提高多视图ステレオ（MVS）的准确性和完整性而设计的。
methods: 这paper使用的方法是PatchMatch multi-view Stereo（PatchMatch MVS），该方法通过生成深度和法向假设，并效率地在多视图图像中寻找最佳假设，以确定物体的三维模型。此外，这paper还引入了抗licht极化信息来评估假设的正确性。
results: 实验结果表明，对比现有PatchMatch MVS方法，PolarPMS可以提高三维模型的准确性和完整性，特别是对于无文本表面。

Abstract
PatchMatch Multi-View Stereo (PatchMatch MVS) is one of the popular MVS approaches, owing to its balanced accuracy and efficiency. In this paper, we propose Polarimetric PatchMatch multi-view Stereo (PolarPMS), which is the first method exploiting polarization cues to PatchMatch MVS. The key of PatchMatch MVS is to generate depth and normal hypotheses, which form local 3D planes and slanted stereo matching windows, and efficiently search for the best hypothesis based on the consistency among multi-view images. In addition to standard photometric consistency, our PolarPMS evaluates polarimetric consistency to assess the validness of a depth and normal hypothesis, motivated by the physical property that the polarimetric information is related to the object's surface normal. Experimental results demonstrate that our PolarPMS can improve the accuracy and the completeness of reconstructed 3D models, especially for texture-less surfaces, compared with state-of-the-art PatchMatch MVS methods.

摘要
patchmatch多视图雷达（PatchMatch MVS）是一种受欢迎的MVS方法，因为它的平衡准确性和效率。在这篇论文中，我们提出了抗 polarimetric PatchMatch多视图雷达（PolarPMS），这是第一种利用抗 polarimetric 信号来PatchMatch MVS的方法。patchmatch MVS的关键在于生成深度和法向假设，形成局部三维平面和斜视匹配窗口，然后高效地搜索最佳假设，基于多视图图像的一致性。除了标准光度一致性外，我们的PolarPMS还评估抗 polarimetric一致性，以评估假设的有效性，这是因为物体表面法向的物理特性和抗 polarimetric信号之间存在关系。实验结果表明，我们的PolarPMS可以提高准确性和完整性的三维模型重建，特别是面粗糙表面，相比现有的PatchMatch MVS方法。

CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer

paper_url: http://arxiv.org/abs/2311.06443
repo_url: https://github.com/HowieMa/CVTHead
paper_authors: Haoyu Ma, Tong Zhang, Shanlin Sun, Xiangyi Yan, Kun Han, Xiaohui Xie
for: 本研究旨在 reconstruction 个性化动画人头模型，以便在 AR/VR 领域中实现真实时间的人脸动画。
methods: 本研究使用 point-based 神经渲染技术，从单个参考图像中生成可控的神经头像。该方法利用 mesh 中稀疏的顶点点集，并采用提出的 Vertex-feature Transformer 来学习每个顶点的本地特征描述符。这使得可以模型所有顶点之间的长距离依赖关系。
results: 实验结果表明，CVTHead 可以与现状的图形学基于方法相比，实现相似的性能。此外，它还允许在不同的表情、头部姿态和摄像头视图下，高效地渲染出新的人头模型。这些属性可以通过 3DMM 的偏置系数进行控制，以实现多样化和真实的动画在真实时间enario中。

Abstract
Reconstructing personalized animatable head avatars has significant implications in the fields of AR/VR. Existing methods for achieving explicit face control of 3D Morphable Models (3DMM) typically rely on multi-view images or videos of a single subject, making the reconstruction process complex. Additionally, the traditional rendering pipeline is time-consuming, limiting real-time animation possibilities. In this paper, we introduce CVTHead, a novel approach that generates controllable neural head avatars from a single reference image using point-based neural rendering. CVTHead considers the sparse vertices of mesh as the point set and employs the proposed Vertex-feature Transformer to learn local feature descriptors for each vertex. This enables the modeling of long-range dependencies among all the vertices. Experimental results on the VoxCeleb dataset demonstrate that CVTHead achieves comparable performance to state-of-the-art graphics-based methods. Moreover, it enables efficient rendering of novel human heads with various expressions, head poses, and camera views. These attributes can be explicitly controlled using the coefficients of 3DMMs, facilitating versatile and realistic animation in real-time scenarios.

摘要
<>将个性化动画头模型重建为有关AR/VR的研究领域有着重要意义。现有的实现方法通常需要多视图图像或视频，这使得重建过程变得复杂。另外，传统的渲染管道时间consuming，限制了实时动画的可能性。在这篇论文中，我们介绍CVTHead，一种新的方法，可以从单个参考图像中生成可控的神经头模型。CVTHead使用点集为网格的稀疏顶点来学习本地特征描述符，这使得模型可以学习所有顶点之间的长距离依赖关系。实验结果表明，CVTHead可以与现有的图形学基于方法相比，在VOXCELEB数据集上实现相似的性能。此外，它可以高效地渲染 novel human head 模型，包括不同的表情、头部姿态和摄像头视角。这些特性可以通过3DMM的系数来控制，从而实现有效的实时动画。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.