2023-10-31

cs.CV

cs.CV - 2023-10-31

Decodable and Sample Invariant Continuous Object Encoder

paper_url: http://arxiv.org/abs/2311.00187
repo_url: https://github.com/dhyuan99/hdfe
paper_authors: Dehao Yuan, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos
for: 本文提出了一种名为超维度函数编码（HDFE）的技术，用于将连续变量转化为可输入到神经网络中的可编码表示。
methods: HDFE使用样本分布和密度不变性来生成连续变量的显式向量表示，不需要任何训练。它可以将连续变量编码到一个有序的嵌入空间中，并且可以在下游任务训练中提高性能。
results: HDFE在函数到函数映射任务中达到了竞争性的性能，并在点云表面法向量估计任务中实现了12%和15%的错误减少。此外，通过将HDFEintegrated到基eline网络中，我们提高了基eline网络的SOTA标准偏好值2.5%和1.7%。

Abstract
We propose Hyper-Dimensional Function Encoding (HDFE). Given samples of a continuous object (e.g. a function), HDFE produces an explicit vector representation of the given object, invariant to the sample distribution and density. Sample distribution and density invariance enables HDFE to consistently encode continuous objects regardless of their sampling, and therefore allows neural networks to receive continuous objects as inputs for machine learning tasks, such as classification and regression. Besides, HDFE does not require any training and is proved to map the object into an organized embedding space, which facilitates the training of the downstream tasks. In addition, the encoding is decodable, which enables neural networks to regress continuous objects by regressing their encodings. Therefore, HDFE serves as an interface for processing continuous objects. We apply HDFE to function-to-function mapping, where vanilla HDFE achieves competitive performance as the state-of-the-art algorithm. We apply HDFE to point cloud surface normal estimation, where a simple replacement from PointNet to HDFE leads to immediate 12% and 15% error reductions in two benchmarks. In addition, by integrating HDFE into the PointNet-based SOTA network, we improve the SOTA baseline by 2.5% and 1.7% in the same benchmarks.

摘要
我们提议使用超dimensional函数编码（HDFE）。给定一个连续的函数（例如），HDFE可以生成这个函数的明确的 вектор表示，不受样本分布和概率的影响。这意味着HDFE可以一样有效地编码不同的样本，从而允许神经网络作为机器学习任务的输入。此外，HDFE不需要任何训练，并且证明可以将函数映射到有组织的嵌入空间中，从而便于下游任务的训练。此外，编码是可解码的，这使得神经网络可以通过解码来重构连续函数。因此，HDFE可以作为连续函数的接口。我们在函数到函数映射中应用HDFE，其中vanilla HDFE可以与当前状态的最佳算法竞争。我们在点云表面法向量估计中应用HDFE，将PointNet换为HDFE后，直接下降12%和15%的错误率在两个标准 benchmar k。此外，通过将HDFE集成到PointNet-based SOTA网络中，我们提高了SOTA标准 benchmar k的基eline by 2.5%和1.7%。

Image Restoration with Point Spread Function Regularization and Active Learning

paper_url: http://arxiv.org/abs/2311.00186
repo_url: None
paper_authors: Peng Jia, Jiameng Lv, Runyu Ning, Yu Song, Nan Li, Kaifan Ji, Chenzhou Cui, Shanshan Li
for: 这篇论文是为了提高天文学研究中图像处理的精度和效率，使用深度学习算法和高精度天文望远镜模拟器连接。
methods: 该算法使用深度学习算法和高精度天文望远镜模拟器连接，在训练阶段使用模拟器生成不同水平的噪声和模糊图像，以训练神经网络。
results: 该算法可以有效地提高天文图像中细节的重建，提高观测图像的质量，并可以应用于大规模天文 sky 探测数据，如 LSST、Euclid 和 CSST 等。

Abstract
Large-scale astronomical surveys can capture numerous images of celestial objects, including galaxies and nebulae. Analysing and processing these images can reveal intricate internal structures of these objects, allowing researchers to conduct comprehensive studies on their morphology, evolution, and physical properties. However, varying noise levels and point spread functions can hamper the accuracy and efficiency of information extraction from these images. To mitigate these effects, we propose a novel image restoration algorithm that connects a deep learning-based restoration algorithm with a high-fidelity telescope simulator. During the training stage, the simulator generates images with different levels of blur and noise to train the neural network based on the quality of restored images. After training, the neural network can directly restore images obtained by the telescope, as represented by the simulator. We have tested the algorithm using real and simulated observation data and have found that it effectively enhances fine structures in blurry images and increases the quality of observation images. This algorithm can be applied to large-scale sky survey data, such as data obtained by LSST, Euclid, and CSST, to further improve the accuracy and efficiency of information extraction, promoting advances in the field of astronomical research.

摘要
大规模天文观测可以捕捉到许多天体对象的图像，包括星系和云气。分析和处理这些图像可以揭示天体对象的内部细节， allowing researchers to conduct comprehensive studies on their morphology, evolution, and physical properties. 然而，天文图像中的噪声和点扩散函数可能会妨碍信息提取的准确性和效率。为了解决这些问题，我们提出了一种新的图像修复算法，该算法将深度学习基于的修复算法与高精度天文望远镜模拟器相连接。在训练阶段，模拟器生成了不同水平的噪声和扩散函数来训练神经网络，根据修复图像的质量来评估神经网络的性能。一旦训练完成，神经网络可以直接修复天文望远镜所获得的图像，如由模拟器所表示。我们已经对实际和模拟观测数据进行了测试，发现该算法可以有效地增强杂化图像中的细节，并提高观测图像的质量。这种算法可以应用于大规模天文观测数据，如LSST、Euclid和CSST等，以进一步提高信息提取的准确性和效率，推动天文研究领域的进步。

Object-centric Video Representation for Long-term Action Anticipation

paper_url: http://arxiv.org/abs/2311.00180
repo_url: https://github.com/brown-palm/objectprompt
paper_authors: Ce Zhang, Changcheng Fu, Shijie Wang, Nakul Agarwal, Kwonjoon Lee, Chiho Choi, Chen Sun
for: 这 paper 的目的是建立长期行为预测视频中的对象中心表示。
methods: 这 paper 使用了可见语言预测模型来提取任务特定的对象中心表示，而不需要培okeducated对象检测器或全部弱类型的视频识别框架。
results: 这 paper 的结果表明，通过使用 transformer 型神经网络，可以在不同的时间尺度 retrieve 相关的对象来预测人物与对象之间的互动。 extent 评估在 Ego4D、50Salads 和 EGTEA Gaze+ 测试集上，结果表明该方法的有效性。

Abstract
This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background" object could be used by the human actor in the future. We observe that existing object-based video recognition frameworks either assume the existence of in-domain supervised object detectors or follow a fully weakly-supervised pipeline to infer object locations from action labels. We propose to build object-centric video representations by leveraging visual-language pretrained models. This is achieved by "object prompts", an approach to extract task-specific object-centric representations from general-purpose pretrained models without finetuning. To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales. We conduct extensive evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both quantitative and qualitative results confirm the effectiveness of our proposed method.

摘要
Translation into Simplified Chinese:这篇论文关注建立对象中心的表示方法，以便在视频中预测长期行为。我们的关键动机是，对象提供了重要的准备信息，用于识别和预测人物-物体之间的互动，特别是当预测更长期时，观察到的背景对象可能会被人物使用。我们发现现有的对象基于视频识别框架 either 假设存在域内指导的对象检测器或者采用完全无监督的管道来从动作标签中推断对象位置。我们提议使用可视语言预训练模型来建立对象中心的视频表示。这是通过 "对象提示" 的方法来从通用预训练模型中提取任务特定的对象中心表示而实现的。用于识别和预测人物-物体互动的Transformer型神经网络，允许在不同的时间尺度上进行对象检索。我们在 Ego4D、50Salads 和 EGTEA Gaze+ 测试benchmark上进行了广泛的评估。both quantitative和qualitative的结果证明了我们提出的方法的有效性。

Multi-task Deep Convolutional Network to Predict Sea Ice Concentration and Drift in the Arctic Ocean

paper_url: http://arxiv.org/abs/2311.00167
repo_url: None
paper_authors: Younghyun Koo, Maryam Rahnemoonfar
For: This paper aims to improve the prediction of sea ice concentration (SIC) and sea ice drift (SID) in the Arctic Ocean using a novel multi-task fully convolutional network architecture called hierarchical information-sharing U-net (HIS-Unet).* Methods: The HIS-Unet model uses weighting attention modules (WAMs) to allow the SIC and SID layers to share information and assist each other’s prediction. The model is trained on a large dataset of satellite images and is compared to other statistical approaches, sea ice physical models, and neural networks without information-sharing units.* Results: The HIS-Unet model outperforms other methods in predicting both SIC and SID, particularly in areas with seasonal sea ice changes. The weight values of the WAMs suggest that SIC information is more important for SID prediction, and information sharing is more active in sea ice edges than in the central Arctic.

Abstract
Forecasting sea ice concentration (SIC) and sea ice drift (SID) in the Arctic Ocean is of great significance as the Arctic environment has been changed by the recent warming climate. Given that physical sea ice models require high computational costs with complex parameterization, deep learning techniques can effectively replace the physical model and improve the performance of sea ice prediction. This study proposes a novel multi-task fully conventional network architecture named hierarchical information-sharing U-net (HIS-Unet) to predict daily SIC and SID. Instead of learning SIC and SID separately at each branch, we allow the SIC and SID layers to share their information and assist each other's prediction through the weighting attention modules (WAMs). Consequently, our HIS-Unet outperforms other statistical approaches, sea ice physical models, and neural networks without such information-sharing units. The improvement of HIS-Unet is obvious both for SIC and SID prediction when and where sea ice conditions change seasonally, which implies that the information sharing through WAMs allows the model to learn the sudden changes of SIC and SID. The weight values of the WAMs imply that SIC information plays a more critical role in SID prediction, compared to that of SID information in SIC prediction, and information sharing is more active in sea ice edges (seasonal sea ice) than in the central Arctic (multi-year sea ice).

摘要
预测北极海洋中的海冰浓度（SIC）和海冰流动（SID）具有极大的重要性，因为最近的气候变化已经对北极环境产生了深远的影响。 Physical sea ice模型需要高度的计算成本和复杂的参数化，深度学习技术可以有效地取代物理模型并提高海冰预测的性能。本研究提出了一种新的多任务全连接网络架构，即层次信息共享U-Net（HIS-Unet），用于每天预测SIC和SID。而不是在每个分支中独立地学习SIC和SID，我们允许SIC和SID层共享信息，通过重量注意模块（WAMs）来帮助对方的预测。因此，我们的HIS-Unet在SIC和SID预测中表现出色，超越了其他统计方法、物理海冰模型和无此信息共享单元的神经网络。HIS-Unet在SIC和SID预测中的改进明显，特别是在季节性变化的海冰Conditions下，这表明信息共享通过WAMs使得模型能够学习海冰的快速变化。WAMs的权值表明SIC信息在SID预测中扮演更重要的角色，相比之下，SID信息在SIC预测中的作用较弱，而信息共享更活跃在海冰 Edge（季节性海冰）than in the central Arctic（多年海冰）。

Medi-CAT: Contrastive Adversarial Training for Medical Image Classification

paper_url: http://arxiv.org/abs/2311.00154
repo_url: None
paper_authors: Pervaiz Iqbal Khan, Andreas Dengel, Sheraz Ahmed
for: 这篇论文是为了解决医疗影像资料集中的下降和过滤问题而写的。
methods: 这篇论文提出了一种训练策略 Medi-CAT，以解决医疗影像资料集中的下降和过滤问题。特别是，提出的训练方法使用大型预训条件掌握器来解决下降问题，并使用对抗和对比学习技术来预防过滤问题。
results: 这篇论文的实验结果显示，提出的训练方法可以在四个医疗影像分类资料集上提高精度，与已知方法相比提高精度达2%，并且与基eline方法相比提高精度达4.1%。

Abstract
There are not many large medical image datasets available. For these datasets, too small deep learning models can't learn useful features, so they don't work well due to underfitting, and too big models tend to overfit the limited data. As a result, there is a compromise between the two issues. This paper proposes a training strategy Medi-CAT to overcome the underfitting and overfitting phenomena in medical imaging datasets. Specifically, the proposed training methodology employs large pre-trained vision transformers to overcome underfitting and adversarial and contrastive learning techniques to prevent overfitting. The proposed method is trained and evaluated on four medical image classification datasets from the MedMNIST collection. Our experimental results indicate that the proposed approach improves the accuracy up to 2% on three benchmark datasets compared to well-known approaches, whereas it increases the performance up to 4.1% over the baseline methods.

摘要
“医疗图像Dataset的大小不多，这使得深度学习模型过于简单时不能学习有用的特征，导致过拟合问题，而过拟合的限制数据也使得模型过度适应。为了解决这两个问题，本文提出了一种培训策略 Medi-CAT。Specifically，提出的培训方法使用大型预训练视transformer来解决过拟合问题，并使用对抗和对比学习技术来避免过度适应。本方法在MedMNIST数据集上进行了四个医疗图像分类任务的训练和评估，实验结果表明，相比已知方法，我们的方法可以在三个 benchark 数据集上提高准确率达2%，而在基eline方法上提高性能达4.1%。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Joint Depth Prediction and Semantic Segmentation with Multi-View SAM

paper_url: http://arxiv.org/abs/2311.00134
repo_url: None
paper_authors: Mykhailo Shvets, Dongxu Zhao, Marc Niethammer, Roni Sengupta, Alexander C. Berg
for: 这个论文是为了提高多视图深度预测和 semantic segmentation 的性能而写的。
methods: 这个论文使用了多视图斯tereo（MVS）技术，并且利用了Segment Anything Model（SAM）的 semantics feature来提高深度预测。然后，这个扩展的深度预测作为 semantic segmentation decoder的提示。
results: 该方法在ScanNet dataset上的量化和质量研究中，与单个任务MVS和 semantic segmentation模型以及多视图监控方法相比，具有了更高的性能。

Abstract
Multi-task approaches to joint depth and segmentation prediction are well-studied for monocular images. Yet, predictions from a single-view are inherently limited, while multiple views are available in many robotics applications. On the other end of the spectrum, video-based and full 3D methods require numerous frames to perform reconstruction and segmentation. With this work we propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM). This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder. We report the mutual benefit that both tasks enjoy in our quantitative and qualitative studies on the ScanNet dataset. Our approach consistently outperforms single-task MVS and segmentation models, along with multi-task monocular methods.

摘要
多任务方法对单视图图像的深度预测和分割预测已经广泛研究过，然而单视图图像的预测受限，而多视图图像在机器人应用中却有很多可用的视图。在这种情况下，我们提出了一种基于多视图ステレオ（MVS）技术的深度预测方法，该方法利用了segment anything模型（SAM）的丰富semantic特征。这种增强的深度预测，然后作为我们基于Transformer的semantic分割解码器的引用。我们在ScanNet数据集上进行了量化和质量研究，发现我们的方法在单任务MVS和分割模型以及多任务monocular方法上表现出了优势。

Spuriosity Rankings for Free: A Simple Framework for Last Layer Retraining Based on Object Detection

paper_url: http://arxiv.org/abs/2311.00079
repo_url: None
paper_authors: Mohammad Azizmalayeri, Reza Abbasi, Amir Hosein Haji Mohammad rezaie, Reihaneh Zohrabi, Mahdi Amiri, Mohammad Taghi Manzuri, Mohammad Hossein Rohban
for: 这篇论文是为了解决深度神经网络模型中的假对象问题，并提出了一个新的排名框架，以便为这些模型进行最后层重训。
methods: 这篇论文使用了一个开 vocabulary 物件检测技术，将图像排序基于物件检测得分，然后将最高得分的图像用于最后层重训。
results: 论文的实验结果显示，这个排名框架可以有效地排序图像根据假对象程度，并且可以将这些图像用于最后层重训，以提高模型的可靠性。

Abstract
Deep neural networks have exhibited remarkable performance in various domains. However, the reliance of these models on spurious features has raised concerns about their reliability. A promising solution to this problem is last-layer retraining, which involves retraining the linear classifier head on a small subset of data without spurious cues. Nevertheless, selecting this subset requires human supervision, which reduces its scalability. Moreover, spurious cues may still exist in the selected subset. As a solution to this problem, we propose a novel ranking framework that leverages an open vocabulary object detection technique to identify images without spurious cues. More specifically, we use the object detector as a measure to score the presence of the target object in the images. Next, the images are sorted based on this score, and the last-layer of the model is retrained on a subset of the data with the highest scores. Our experiments on the ImageNet-1k dataset demonstrate the effectiveness of this ranking framework in sorting images based on spuriousness and using them for last-layer retraining.

摘要
深度神经网络在不同领域表现出色，但它们对假性特征的依赖引起了可靠性问题。一种有 Promise 的解决方案是最后层重新训练，即在一小部分数据上重新训练线性分类头。然而，选择这小部分数据需要人工监督，这限制了其扩展性。此外，假性特征仍然可能存在于选择的小部分数据中。为解决这问题，我们提出了一种新的排名框架，利用开放词汇对象检测技术来识别没有假性特征的图像。更具体来说，我们使用对象检测器作为评分对图像中目标对象的存在进行分数。然后对图像进行排名，并将最后层模型重新训练在排名最高的数据上。我们在ImageNet-1k dataset上进行实验，并证明了这种排名框架可以准确地排序图像根据假性特征，并使用它们进行最后层重新训练。

YOLOv8-Based Visual Detection of Road Hazards: Potholes, Sewer Covers, and Manholes

paper_url: http://arxiv.org/abs/2311.00073
repo_url: None
paper_authors: Om M. Khare, Shubham Gandhi, Aditya M. Rahalkar, Sunil Mane
for: 本研究用YOLOv8对道路危险物体进行了全面的评估，以提高道路维护和安全性。
methods: 本研究使用YOLOv8对象检测模型，并进行了比较分析与前一代YOLOv5和YOLOv7模型。图像预处理技术和模型参数调整都是为了提高检测精度。
results: 研究表明，YOLOv8在不同的照明、路况、危险物体大小和类型等多种情况下具有优秀的检测精度和通用性。

Abstract
Effective detection of road hazards plays a pivotal role in road infrastructure maintenance and ensuring road safety. This research paper provides a comprehensive evaluation of YOLOv8, an object detection model, in the context of detecting road hazards such as potholes, Sewer Covers, and Man Holes. A comparative analysis with previous iterations, YOLOv5 and YOLOv7, is conducted, emphasizing the importance of computational efficiency in various applications. The paper delves into the architecture of YOLOv8 and explores image preprocessing techniques aimed at enhancing detection accuracy across diverse conditions, including variations in lighting, road types, hazard sizes, and types. Furthermore, hyperparameter tuning experiments are performed to optimize model performance through adjustments in learning rates, batch sizes, anchor box sizes, and augmentation strategies. Model evaluation is based on Mean Average Precision (mAP), a widely accepted metric for object detection performance. The research assesses the robustness and generalization capabilities of the models through mAP scores calculated across the diverse test scenarios, underlining the significance of YOLOv8 in road hazard detection and infrastructure maintenance.

摘要
effet de road hazards detection plays a crucial role in maintaining road infrastructure and ensuring road safety. This research paper provides a comprehensive evaluation of YOLOv8, an object detection model, in the context of detecting road hazards such as potholes, Sewer Covers, and Man Holes. A comparative analysis with previous iterations, YOLOv5 and YOLOv7, is conducted, emphasizing the importance of computational efficiency in various applications. The paper delves into the architecture of YOLOv8 and explores image preprocessing techniques aimed at enhancing detection accuracy across diverse conditions, including variations in lighting, road types, hazard sizes, and types. Furthermore, hyperparameter tuning experiments are performed to optimize model performance through adjustments in learning rates, batch sizes, anchor box sizes, and augmentation strategies. Model evaluation is based on Mean Average Precision (mAP), a widely accepted metric for object detection performance. The research assesses the robustness and generalization capabilities of the models through mAP scores calculated across the diverse test scenarios, underlining the significance of YOLOv8 in road hazard detection and infrastructure maintenance.Here's the translation breakdown:* "Effective detection of road hazards" is translated as " effet de road hazards detection" ( effet is a French word that means "effect" or "influence", and de is a preposition that indicates the subject of the sentence)* "plays a crucial role" is translated as "plays a crucial role" (literally "plays a crucial part")* "in maintaining road infrastructure and ensuring road safety" is translated as "in maintaining road infrastructure and ensuring road safety" (literally "in maintaining the road infrastructure and ensuring the road safety")* "This research paper" is translated as "this research paper" (literally "this research report")* "provides a comprehensive evaluation of YOLOv8" is translated as "provides a comprehensive evaluation of YOLOv8" (literally "provides a comprehensive assessment of YOLOv8")* "in the context of detecting road hazards" is translated as "in the context of detecting road hazards" (literally "in the context of detecting the road hazards")* "such as potholes, Sewer Covers, and Man Holes" is translated as "such as potholes, Sewer Covers, and Man Holes" (literally "such as potholes, Sewer Covers, and Man Holes")* "A comparative analysis with previous iterations, YOLOv5 and YOLOv7" is translated as "A comparative analysis with previous iterations, YOLOv5 and YOLOv7" (literally "A comparative analysis with previous versions, YOLOv5 and YOLOv7")* "is conducted" is translated as "is conducted" (literally "is carried out")* "emphasizing the importance of computational efficiency" is translated as "emphasizing the importance of computational efficiency" (literally "emphasizing the importance of computational efficiency")* "in various applications" is translated as "in various applications" (literally "in various applications")* "The paper delves into the architecture of YOLOv8" is translated as "The paper delves into the architecture of YOLOv8" (literally "The paper explores the architecture of YOLOv8")* "and explores image preprocessing techniques" is translated as "and explores image preprocessing techniques" (literally "and explores the image preprocessing techniques")* "aimed at enhancing detection accuracy across diverse conditions" is translated as "aimed at enhancing detection accuracy across diverse conditions" (literally "aimed at improving the detection accuracy across diverse conditions")* "including variations in lighting, road types, hazard sizes, and types" is translated as "including variations in lighting, road types, hazard sizes, and types" (literally "including variations in lighting, road types, hazard sizes, and types")* "Furthermore" is translated as "furthermore" (literally "furthermore")* "hyperparameter tuning experiments" is translated as "hyperparameter tuning experiments" (literally "hyperparameter tuning experiments")* "are performed" is translated as "are performed" (literally "are carried out")* "to optimize model performance" is translated as "to optimize model performance" (literally "to optimize the model performance")* "through adjustments in learning rates, batch sizes, anchor box sizes, and augmentation strategies" is translated as "through adjustments in learning rates, batch sizes, anchor box sizes, and augmentation strategies" (literally "through adjustments in the learning rates, batch sizes, anchor box sizes, and augmentation strategies")* "Model evaluation is based on Mean Average Precision (mAP)" is translated as "Model evaluation is based on Mean Average Precision (mAP)" (literally "Model evaluation is based on the Mean Average Precision (mAP)")* "a widely accepted metric for object detection performance" is translated as "a widely accepted metric for object detection performance" (literally "a widely accepted indicator for object detection performance")* "The research assesses the robustness and generalization capabilities" is translated as "The research assesses the robustness and generalization capabilities" (literally "The research evaluates the robustness and generalization capabilities")* "of the models through mAP scores" is translated as "of the models through mAP scores" (literally "of the models through the mAP scores")* "calculated across the diverse test scenarios" is translated as "calculated across the diverse test scenarios" (literally "calculated across the various test scenarios")* "underlining the significance of YOLOv8" is translated as "underlining the significance of YOLOv8" (literally "underlining the importance of YOLOv8")* "in road hazard detection and infrastructure maintenance" is translated as "in road hazard detection and infrastructure maintenance" (literally "in the detection of road hazards and the maintenance of the road infrastructure")

View Classification and Object Detection in Cardiac Ultrasound to Localize Valves via Deep Learning

paper_url: http://arxiv.org/abs/2311.00068
repo_url: None
paper_authors: Derya Gol Gungor, Bimba Rao, Cynthia Wolverton, Ismayil Guracar
for: 这篇论文旨在提供一种基于深度学习的echocardiography图像分类和定位方法，以便为临床医生提供实时、低成本、无辐射的心脏功能观察工具。
methods: 该论文提出了一个机器学习管道，包括分类和定位两个步骤。在第一步中，我们使用视图分类来分类echocardiogram中的十个一般观察视图。在第二步中，我们使用深度学习基于对象检测来both地址和识别心 valves。
results: 我们的对象检测实验表明，可以使用深度学习来精确地定位和识别多个心 valves。

Abstract
Echocardiography provides an important tool for clinicians to observe the function of the heart in real time, at low cost, and without harmful radiation. Automated localization and classification of heart valves enables automatic extraction of quantities associated with heart mechanical function and related blood flow measurements. We propose a machine learning pipeline that uses deep neural networks for separate classification and localization steps. As the first step in the pipeline, we apply view classification to echocardiograms with ten unique anatomic views of the heart. In the second step, we apply deep learning-based object detection to both localize and identify the valves. Image segmentation based object detection in echocardiography has been shown in many earlier studies but, to the best of our knowledge, this is the first study that predicts the bounding boxes around the valves along with classification from 2D ultrasound images with the help of deep neural networks. Our object detection experiments applied to the Apical views suggest that it is possible to localize and identify multiple valves precisely.

摘要
echocardiography 提供了一种重要的工具，帮助临床医生在实时、低成本、无辐射的情况下观察心脏的功能。自动识别和分类心脏阀门可以自动提取心脏机械功能相关的量和血流测量。我们提议一个基于深度学习的机器学习管道，其中首先应用视图分类onto echocardiograms中的十个特有的心脏视图。其次，我们应用深度学习基于对象检测来both地址和识别阀门。在echocardiography中，使用深度学习对象检测已经在许多之前的研究中得到了证明，但是，根据我们所知，这是第一个通过深度学习网络预测阀门 bounding box 以及分类的2D ultrasound 图像的研究。我们的对象检测实验在Apical View中表明，可以准确地Localize和识别多个阀门。

FPO++: Efficient Encoding and Rendering of Dynamic Neural Radiance Fields by Analyzing and Enhancing Fourier PlenOctrees

paper_url: http://arxiv.org/abs/2310.20710
repo_url: None
paper_authors: Saskia Rabich, Patrick Stotko, Reinhard Klein
for: 本研究旨在提高动态神经辐射场（NeRF）的实时渲染效果，并且使用Fourier Pleoctrees进行压缩。
methods: 本研究使用了一种新的密度编码方法，该方法根据转移函数的特点进行压缩，从而减少了压缩引入的artefacts。此外，研究者还提出了一种增强训练数据的方法，以适应压缩的期望性。
results: 研究者通过量化和质量评估来证明了提高后的Fourier Pleoctrees的效果，并且在Synthetic和实际场景中都有出色的表现。

Abstract
Fourier PlenOctrees have shown to be an efficient representation for real-time rendering of dynamic Neural Radiance Fields (NeRF). Despite its many advantages, this method suffers from artifacts introduced by the involved compression when combining it with recent state-of-the-art techniques for training the static per-frame NeRF models. In this paper, we perform an in-depth analysis of these artifacts and leverage the resulting insights to propose an improved representation. In particular, we present a novel density encoding that adapts the Fourier-based compression to the characteristics of the transfer function used by the underlying volume rendering procedure and leads to a substantial reduction of artifacts in the dynamic model. Furthermore, we show an augmentation of the training data that relaxes the periodicity assumption of the compression. We demonstrate the effectiveness of our enhanced Fourier PlenOctrees in the scope of quantitative and qualitative evaluations on synthetic and real-world scenes.

摘要
“傅立叶数字化（Fourier PlenOctrees）已经证明是实时渲染动态神经辉度场（NeRF）的有效表现方法。 despite its many advantages， this method suffers from the artifacts introduced by the involved compression when combining it with recent state-of-the-art techniques for training the static per-frame NeRF models. In this paper, we perform an in-depth analysis of these artifacts and leverage the resulting insights to propose an improved representation. In particular, we present a novel density encoding that adapts the Fourier-based compression to the characteristics of the transfer function used by the underlying volume rendering procedure and leads to a substantial reduction of artifacts in the dynamic model. Furthermore, we show an augmentation of the training data that relaxes the periodicity assumption of the compression. We demonstrate the effectiveness of our enhanced Fourier PlenOctrees in the scope of quantitative and qualitative evaluations on synthetic and real-world scenes.”Note that Simplified Chinese is used here, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

DDAM-PS: Diligent Domain Adaptive Mixer for Person Search

paper_url: http://arxiv.org/abs/2310.20706
repo_url: https://github.com/mustansarfiaz/ddam-ps
paper_authors: Mohammed Khaleed Almansoori, Mustansar Fiaz, Hisham Cholakkal
for: 本文提出了一种针对人检索（PS）问题的域 adaptive mixer（DDAM）框架，以提高知识传递自源域到目标域。
methods: 本文引入了一个新的DDAM模块，该模块将源域和目标域表示相结合，以生成中等混合域表示。DDAM模块鼓励域间混合，以降低ReID任务的难度。为此，本文引入了两个桥接损失和一个偏差损失。两个桥接损失的目标是使中等混合域表示保持与源域和目标域表示的合适距离，以避免过拟合。偏差损失的目标是避免中等混合域表示倾斜向任一个域，以避免过拟合。
results: 本文通过实验证明了提议的效果。在PRW和CUHK-SYSU等两个复杂的数据集上，我们的方法表现出了优秀的性能。

Abstract
Person search (PS) is a challenging computer vision problem where the objective is to achieve joint optimization for pedestrian detection and re-identification (ReID). Although previous advancements have shown promising performance in the field under fully and weakly supervised learning fashion, there exists a major gap in investigating the domain adaptation ability of PS models. In this paper, we propose a diligent domain adaptive mixer (DDAM) for person search (DDAP-PS) framework that aims to bridge a gap to improve knowledge transfer from the labeled source domain to the unlabeled target domain. Specifically, we introduce a novel DDAM module that generates moderate mixed-domain representations by combining source and target domain representations. The proposed DDAM module encourages domain mixing to minimize the distance between the two extreme domains, thereby enhancing the ReID task. To achieve this, we introduce two bridge losses and a disparity loss. The objective of the two bridge losses is to guide the moderate mixed-domain representations to maintain an appropriate distance from both the source and target domain representations. The disparity loss aims to prevent the moderate mixed-domain representations from being biased towards either the source or target domains, thereby avoiding overfitting. Furthermore, we address the conflict between the two subtasks, localization and ReID, during domain adaptation. To handle this cross-task conflict, we forcefully decouple the norm-aware embedding, which aids in better learning of the moderate mixed-domain representation. We conduct experiments to validate the effectiveness of our proposed method. Our approach demonstrates favorable performance on the challenging PRW and CUHK-SYSU datasets. Our source code is publicly available at \url{https://github.com/mustansarfiaz/DDAM-PS}.

摘要
<>将文本翻译成简化中文。<>人体搜索（PS）是一个computer vision中的挑战任务，旨在实现人体检测和重新识别（ReID）的共同优化。尽管前一些进展有示出在不监督和弱监督学习的形式下表现出色，但是存在一个主要的领域适应能力问题的PS模型。在这篇论文中，我们提出了一种努力适应频率mixer（DDAM） для人体搜索（DDAP-PS）框架，以减少source域和target域之间的距离，从而提高ReID任务。具体来说，我们引入了一种新的DDAM模块，该模块将source域和target域的表示结合在一起。我们的DDAM模块鼓励频率混合，以降低ReID任务中的预测误差。为此，我们引入了两个桥接损失和一个偏差损失。两个桥接损失的目标是使 moderate mixed-domain representation保持适度的距离于source域和target域的表示，以避免过拟合。而偏差损失则是避免 moderate mixed-domain representation偏向于source域或target域，以避免过拟合。此外，我们解决了在预测和ReID子任务之间的冲突。我们强制协调norm-aware embedding，以便更好地学习 moderate mixed-domain representation。我们进行了实验来证明我们的方法的有效性。我们的方法在PRW和CUHK-SYSU datasets上表现出色。我们的源代码公开在 GitHub上，可以通过 \url{https://github.com/mustansarfiaz/DDAM-PS} 访问。

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

paper_url: http://arxiv.org/abs/2310.20700
repo_url: None
paper_authors: Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu
for: 这个研究旨在生成高质量的长视频，以具有流畅和创新的过渡效果，并且能够根据文本描述自动生成过渡。
methods: 该模型使用随机mask videodiffusion模型，通过提供不同场景图像以及文本控制，生成具有协调性和视觉质量的过渡视频。
results: 经过广泛的实验 validate了该方法的效果，能够创造出流畅和创新的长视频，并且可以扩展到其他任务，如图像到视频动画和自适应视频预测。I hope that helps!

Abstract
Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .

摘要

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

paper_url: http://arxiv.org/abs/2310.20685
repo_url: https://github.com/mikacuy/PL-NeRF
paper_authors: Mikaela Angelina Uy, Kiyohiro Nakayama, Guandao Yang, Rahul Krishna Thomas, Leonidas Guibas, Ke Li
for: 本研究旨在解决基于NeRF的光度场Rendering中的稳定性问题，提高图像的精度和深度指导。
methods: 本文提出了一种基于数学原理的解决方案，通过修改样本基于Rendering方程，使其对于划分常数体积密度的情况成本正确。这解决了多个问题，包括样本之间冲突、层次 sampling 不准确和模型参数对准量梯度的不导数性。
results: 比如研究表明，使用我们提出的方法可以获得更加锐利的Texture、更好的几何重建和更强的深度指导。此外，我们的方法可以很容易地替换现有NeRF方法的体积Rendering方程，不需要更改现有的实现。详细的实验结果和项目页面可以在pl-nerf.github.io上找到。

Abstract
Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w.r.t. the choice of samples along the ray, a phenomenon that we dub quadrature instability. We propose a mathematically principled solution by reformulating the sample-based rendering equation so that it corresponds to the exact integral under piecewise linear volume density. This simultaneously resolves multiple issues: conflicts between samples along different rays, imprecise hierarchical sampling, and non-differentiability of quantiles of ray termination distances w.r.t. model parameters. We demonstrate several benefits over the classical sample-based rendering equation, such as sharper textures, better geometric reconstruction, and stronger depth supervision. Our proposed formulation can be also be used as a drop-in replacement to the volume rendering equation of existing NeRF-based methods. Our project page can be found at pl-nerf.github.io.

摘要

StairNet: Visual Recognition of Stairs for Human-Robot Locomotion

paper_url: http://arxiv.org/abs/2310.20666
repo_url: None
paper_authors: Andrew Garrett Kurbis, Dmytro Kuzmenko, Bogdan Ivanyuk-Skulskiy, Alex Mihailidis, Brokoslaw Laschowski
for: 这篇论文主要旨在开发新的深度学习模型，以便通过视觉感知和识别楼梯来提高人机合作步行。
methods: 该论文使用了大规模的手动标注图像数据集，并开发了不同的深度学习模型（如2D和3D CNN、混合 CNN和LSTM、ViT网络）和训练方法（如监督学习、时间数据集和无标签图像）。
results: 研究人员在不同的设计中得到了高精度分类结果（最高达98.8%），并且在移动设备上使用GPU和NPU加速器实现了最快的推理速度（达2.8ms）。然而，由于嵌入式硬件的限制，在自定义CPU电子眼镜上部署模型时，推理速度为1.5秒，表现出了人 centered设计和性能之间的负面选择。

Abstract
Human-robot walking with prosthetic legs and exoskeletons, especially over complex terrains such as stairs, remains a significant challenge. Egocentric vision has the unique potential to detect the walking environment prior to physical interactions, which can improve transitions to and from stairs. This motivated us to create the StairNet initiative to support the development of new deep learning models for visual sensing and recognition of stairs, with an emphasis on lightweight and efficient neural networks for onboard real-time inference. In this study, we present an overview of the development of our large-scale dataset with over 515,000 manually labeled images, as well as our development of different deep learning models (e.g., 2D and 3D CNN, hybrid CNN and LSTM, and ViT networks) and training methods (e.g., supervised learning with temporal data and semi-supervised learning with unlabeled images) using our new dataset. We consistently achieved high classification accuracy (i.e., up to 98.8%) with different designs, offering trade-offs between model accuracy and size. When deployed on mobile devices with GPU and NPU accelerators, our deep learning models achieved inference speeds up to 2.8 ms. We also deployed our models on custom-designed CPU-powered smart glasses. However, limitations in the embedded hardware yielded slower inference speeds of 1.5 seconds, presenting a trade-off between human-centered design and performance. Overall, we showed that StairNet can be an effective platform to develop and study new visual perception systems for human-robot locomotion with applications in exoskeleton and prosthetic leg control.

摘要
人机步行使用腔 prosthetic legs和外部骨架，特别是在复杂的地形上，如楼梯，仍然是一项 significante challenges。 egocentric vision有独特的潜在力量，可以在physical interactions之前探测步行环境，这种能力可以提高楼梯之间的过渡。这个激励我们创建StairNet计划，支持开发新的深度学习模型，用于视觉感知和识别楼梯，强调轻量级和高效的神经网络，以实现在线实时推理。在这篇文章中，我们介绍了我们的大规模数据集（包括515,000个手动标注的图像）的开发，以及我们开发的不同的深度学习模型（如2D和3D CNN、混合 CNN和LSTM网络）和训练方法（如监督学习和无标签图像）。我们在不同的设计中一直达到了高精度（达98.8%），提供了模型精度和大小之间的负面选择。当我们的深度学习模型在移动设备上（具有GPU和NPU加速器）进行了实时推理时，我们实现了最快的推理速度（达2.8ms）。我们还将我们的模型部署到自定义CPU驱动的智能眼镜上，但由于嵌入式硬件的限制，推理速度为1.5秒，这presented a trade-off between human-centered design和性能。总之，我们证明了StairNet可以是一个有效的平台，用于开发和研究新的视觉感知系统，以应对人机步行控制的应用，包括肢体外科和肢体 prosthetic。

Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.20650
repo_url: None
paper_authors: Luca Cultrera, Federico Becattini, Lorenzo Seidenari, Pietro Pala, Alberto Del Bimbo
for: 本研究旨在解决自适应驾驶代理人训练中的两个问题：一是慢速偏好问题，即驾驶器 mistakenly correlates low speed with no acceleration；二是在线和离线性能的低相关性，由于小错误的积累而导致驾驶器处于未seen状态。
methods: 本研究提出了一种基于多任务学习的多阶段视transformer抽象器，将驾驶器的状态和环境的表示作特殊токен进行传播。通过这种方法，我们可以从不同的角度解决上述两个问题：通过学习停止/继续信息导向驾驶策略，直接在驾驶器的状态上进行数据扩展，并且可以视觉地解释模型的决策。
results: 我们的实验结果显示，使用这种方法可以减少慢速偏好问题，并且在线和离线性能之间的相关性得到了显著提高。

Abstract
Conditional Imitation learning is a common and effective approach to train autonomous driving agents. However, two issues limit the full potential of this approach: (i) the inertia problem, a special case of causal confusion where the agent mistakenly correlates low speed with no acceleration, and (ii) low correlation between offline and online performance due to the accumulation of small errors that brings the agent in a previously unseen state. Both issues are critical for state-aware models, yet informing the driving agent of its internal state as well as the state of the environment is of crucial importance. In this paper we propose a multi-task learning agent based on a multi-stage vision transformer with state token propagation. We feed the state of the vehicle along with the representation of the environment as a special token of the transformer and propagate it throughout the network. This allows us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model's decisions. We report a drastic decrease in inertia and a high correlation between offline and online metrics.

摘要
<>输入文本翻译成简化中文。<>条件模仿学习是自驾车智能代理的常见和有效方法。然而，两个问题限制了这种方法的全面潜力：（i）抗力问题，特殊情况的 causal confusion where the agent mistakenly correlates low speed with no acceleration，和（ii）在线和离线性能之间的低相关性，由小误差的积累导致 Agent 在未看过的状态下。这两个问题对状态意识模型非常重要，但是通过告诉驾车代理其内部状态以及环境状态的重要性。在这篇论文中，我们提出了基于多任务学习的多阶段视transformer agent，我们将驾车器的状态和环境的表示作为特殊token传播到网络中。这allow us to tackle the aforementioned issues from different angles: guiding the driving policy with learned stop/go information, performing data augmentation directly on the state of the vehicle and visually explaining the model's decisions。我们发现了剂量的减少和在线和离线指标之间的高相关性。

Dynamic Batch Norm Statistics Update for Natural Robustness

paper_url: http://arxiv.org/abs/2310.20649
repo_url: None
paper_authors: Shahbaz Rezaei, Mohammad Sadegh Norouzzadeh
for: 提高 deep neural network (DNN) 对受损样本的性能
methods: 利用 fourier 频域来检测受损类型，并更新 batch normalization (BN) 统计信息
results: 在 CIFAR10-C 和 ImageNet-C 上实现了约 8% 和 4% 的性能提高，并且可以进一步提高现有的 robust 模型的性能，如 AugMix 和 DeepAug。

Abstract
DNNs trained on natural clean samples have been shown to perform poorly on corrupted samples, such as noisy or blurry images. Various data augmentation methods have been recently proposed to improve DNN's robustness against common corruptions. Despite their success, they require computationally expensive training and cannot be applied to off-the-shelf trained models. Recently, it has been shown that updating BatchNorm (BN) statistics of an off-the-shelf model on a single corruption improves its accuracy on that corruption significantly. However, adopting the idea at inference time when the type of corruption is unknown and changing decreases the effectiveness of this method. In this paper, we harness the Fourier domain to detect the corruption type, a challenging task in the image domain. We propose a unified framework consisting of a corruption-detection model and BN statistics update that improves the corruption accuracy of any off-the-shelf trained model. We benchmark our framework on different models and datasets. Our results demonstrate about 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively. Furthermore, our framework can further improve the accuracy of state-of-the-art robust models, such as AugMix and DeepAug.

摘要
深度神经网络（DNN）在天然清晰样本上训练后，在受损样本上表现不佳，如噪音或模糊图像。 Various 数据增强方法已经被提出来改善 DNN 的对受损样本的Robustness。尽管它们在成功，但它们需要计算昂贵的训练。此外，这些方法无法应用于现有训练的模型。Recently, it has been shown that updating BatchNorm（BN）统计值 of an off-the-shelf model on a single corruption significantly improves its accuracy on that corruption. However, adopting this idea at inference time when the type of corruption is unknown and changing decreases the effectiveness of this method. In this paper, we harness the Fourier domain to detect the corruption type, a challenging task in the image domain. We propose a unified framework consisting of a corruption-detection model and BN statistics update that improves the corruption accuracy of any off-the-shelf trained model. We benchmark our framework on different models and datasets. Our results demonstrate about 8% and 4% accuracy improvement on CIFAR10-C and ImageNet-C, respectively. Furthermore, our framework can further improve the accuracy of state-of-the-art robust models, such as AugMix and DeepAug.

Using Higher-Order Moments to Assess the Quality of GAN-generated Image Features

paper_url: http://arxiv.org/abs/2310.20636
repo_url: None
paper_authors: Lorenzo Luzi, Helen Jenne, Ryan Murray, Carlos Ortiz Marrero
for: 评估生成 adversarial networks (GANs) 的 robustness
methods: 使用 Fréchet Inception Distance (FID) 和 Skew Inception Distance (SID) 两种评估指标，并提出了一种新的评估方法
results: 数据中图像特征的第三 moments 的信息可以用于定义一种新的评估指标，并且该指标可以更好地反映人类对图像的识别Here is the simplified Chinese translation of the three key points:
for: 评估生成 adversarial networks (GANs) 的Robustness
methods: 使用 Fréchet Inception Distance (FID) 和 Skew Inception Distance (SID) 两种评估指标，并提出了一种新的评估方法
results: 数据中图像特征的第三 moments 的信息可以用于定义一种新的评估指标，并且该指标可以更好地反映人类对图像的识别

Abstract
The rapid advancement of Generative Adversarial Networks (GANs) necessitates the need to robustly evaluate these models. Among the established evaluation criteria, the Fr\'{e}chet Inception Distance (FID) has been widely adopted due to its conceptual simplicity, fast computation time, and strong correlation with human perception. However, FID has inherent limitations, mainly stemming from its assumption that feature embeddings follow a Gaussian distribution, and therefore can be defined by their first two moments. As this does not hold in practice, in this paper we explore the importance of third-moments in image feature data and use this information to define a new measure, which we call the Skew Inception Distance (SID). We prove that SID is a pseudometric on probability distributions, show how it extends FID, and present a practical method for its computation. Our numerical experiments support that SID either tracks with FID or, in some cases, aligns more closely with human perception when evaluating image features of ImageNet data.

摘要
“生成冲突网络（GAN）的快速发展需要对这些模型进行坚实的评估。现有的评估标准之一是Fréchet Inception Distance（FID），它的概念简单，计算速度快，与人类感知有强相关性。然而，FID存在一些局限性，主要是它假设特征嵌入follows Gaussian distribution，因此可以通过其首两个矩阵来定义。然而，在实践中，这并不成立。因此，我们在这篇论文中研究了图像特征数据中第三个矩阵的重要性，并使用这些信息定义一个新的度量，我们称之为Skew Inception Distance（SID）。我们证明了SID是一个 pseudometric 在概率分布上，并证明了它与FID一起扩展。我们还提供了计算实用方法。我们的数据统计结果表明，SID在ImageNet数据上 either track with FID 或者，在一些情况下，与人类感知更加一致。”

Deepfake detection by exploiting surface anomalies: the SurFake approach

paper_url: http://arxiv.org/abs/2310.20621
repo_url: None
paper_authors: Andrea Ciamarra, Roberto Caldelli, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
for:这篇论文的目的是探讨深伪检测技术，以避免在不同领域的日常生活中广泛传播修改过的消息。methods:这篇论文提出了一种基于场景特征的深伪检测方法，即通过分析图像中表现的表面特征来生成一个可用于训练深度学习模型的描述符。results:实验结果表明，这种方法可以准确地分辨假像和真实图像，并且可以与视觉数据结合使用以提高检测精度。

Abstract
The ever-increasing use of synthetically generated content in different sectors of our everyday life, one for all media information, poses a strong need for deepfake detection tools in order to avoid the proliferation of altered messages. The process to identify manipulated content, in particular images and videos, is basically performed by looking for the presence of some inconsistencies and/or anomalies specifically due to the fake generation process. Different techniques exist in the scientific literature that exploit diverse ad-hoc features in order to highlight possible modifications. In this paper, we propose to investigate how deepfake creation can impact on the characteristics that the whole scene had at the time of the acquisition. In particular, when an image (video) is captured the overall geometry of the scene (e.g. surfaces) and the acquisition process (e.g. illumination) determine a univocal environment that is directly represented by the image pixel values; all these intrinsic relations are possibly changed by the deepfake generation process. By resorting to the analysis of the characteristics of the surfaces depicted in the image it is possible to obtain a descriptor usable to train a CNN for deepfake detection: we refer to such an approach as SurFake. Experimental results carried out on the FF++ dataset for different kinds of deepfake forgeries and diverse deep learning models confirm that such a feature can be adopted to discriminate between pristine and altered images; furthermore, experiments witness that it can also be combined with visual data to provide a certain improvement in terms of detection accuracy.

摘要
随着人工生成内容在不同领域的日常生活中越来越广泛使用，特别是媒体信息领域，需要深刻检测深伪工具以避免扩散修改的消息。寻找修改后的内容特征是通过检查修改过程中的一些不一致和异常来完成的。现有的科学文献中有多种方法利用特定的特征来推断修改的可能性。在这篇论文中，我们提出了研究深伪创造对场景中的全部特征的影响。具体来说，当图像（视频）被捕捉时，场景中的整体几何（例如表面）以及捕捉过程（例如照明）会直接确定一个唯一的环境，这个环境由图像像素值直接表达出来。深伪生成过程可能会改变这些内在关系。通过分析图像中表现的表面特征，我们可以获得一个可以用于训练深度学习模型的描述符，我们称之为SurFake。实验结果表明，使用SurFake特征可以在FF++数据集上对不同类型的深伪假造和多种深度学习模型进行检测，并且可以与视觉数据结合使用以提高检测精度。

Diffusion Reconstruction of Ultrasound Images with Informative Uncertainty

paper_url: http://arxiv.org/abs/2310.20618
repo_url: https://github.com/Yuxin-Zhang-Jasmine/DRUS-v2
paper_authors: Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus
for: 提高超声音像质量，解决超声音像噪声和杂音的问题。
methods: combining model-based and learning-based approaches，使用扩散模型进行增强。
results: 实验表明，该方法可以在各种数据集上实现高质量的音像重建，并且比STATE-OF-THE-ART方法更高效。同时，通过对单个样本和多个样本重建的统计特性进行深入分析，实验证明了模型的可靠性和信息含量。

Abstract
Despite its wide use in medicine, ultrasound imaging faces several challenges related to its poor signal-to-noise ratio and several sources of noise and artefacts. Enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. In recent years, there has been progress both in model-based and learning-based approaches to improve ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid approach leveraging advances in diffusion models. To this end, we adapt Denoising Diffusion Restoration Models (DDRM) to incorporate ultrasound physics through a linear direct model and an unsupervised fine-tuning of the prior diffusion model. We conduct comprehensive experiments on simulated, in-vitro, and in-vivo data, demonstrating the efficacy of our approach in achieving high-quality image reconstructions from a single plane wave input and in comparison to state-of-the-art methods. Finally, given the stochastic nature of the method, we analyse in depth the statistical properties of single and multiple-sample reconstructions, experimentally show the informativeness of their variance, and provide an empirical model relating this behaviour to speckle noise. The code and data are available at: (upon acceptance).

摘要
虽然ultrasound imaging在医学中广泛使用，但它还面临着一些相关的信号噪声和噪声和artefacts的问题。提高ultrasound图像质量需要平衡同时的因素，如对比、分辨率和杂点保持。在过去几年，有所进步在基于模型和学习方法来改进ultrasound图像重建方面。我们提议一种hybrid方法，利用了diffusion模型的进步。为此，我们适应了Denosing Diffusion Restoration Models（DDRM），并通过线性直接模型和无监督精度适应来吸收ultrasound物理。我们对 simulate、in vitro和in vivo数据进行了广泛的实验，并证明了我们的方法可以从单个扩散波输入获得高质量的图像重建，并与当前的方法进行比较。此外，由于方法的随机性，我们进行了深入的统计分析，实验表明了单个和多个样本重建的方差的信息丰富性，并提供了一个实验性的模型，将这种行为与杂点噪声关联起来。代码和数据在接受后可以在：（upon acceptance）获取。

Enhanced Synthetic MRI Generation from CT Scans Using CycleGAN with Feature Extraction

paper_url: http://arxiv.org/abs/2310.20604
repo_url: None
paper_authors: Saba Nikbakhsh, Lachin Naghashyar, Morteza Valizadeh, Mehdi Chehel Amirani
for:This paper aims to address the challenges of multimodal alignment in radiotherapy planning by introducing an approach for enhanced monomodal registration using synthetic MRI images.methods:The proposed method uses unpaired data and combines CycleGANs and feature extractors to produce synthetic MRI images from CT scans.results:The approach outperforms several state-of-the-art methods and shows promising results, validated by multiple comparison metrics.Here’s the text in Simplified Chinese:for:这篇论文目标是解决多模态对接问题，用增强单模态registrations的方法来进行放疗规划。methods:这种方法使用不匹配的数据，并结合CycleGANs和特征提取器来生成CT扫描图像。results:该方法比许多现有方法表现出色，通过多个比较指标验证其效果。

Abstract
In the field of radiotherapy, accurate imaging and image registration are of utmost importance for precise treatment planning. Magnetic Resonance Imaging (MRI) offers detailed imaging without being invasive and excels in soft-tissue contrast, making it a preferred modality for radiotherapy planning. However, the high cost of MRI, longer acquisition time, and certain health considerations for patients pose challenges. Conversely, Computed Tomography (CT) scans offer a quicker and less expensive imaging solution. To bridge these modalities and address multimodal alignment challenges, we introduce an approach for enhanced monomodal registration using synthetic MRI images. Utilizing unpaired data, this paper proposes a novel method to produce these synthetic MRI images from CT scans, leveraging CycleGANs and feature extractors. By building upon the foundational work on Cycle-Consistent Adversarial Networks and incorporating advancements from related literature, our methodology shows promising results, outperforming several state-of-the-art methods. The efficacy of our approach is validated by multiple comparison metrics.

摘要
在放射治疗领域，准确的成像和图像 регистрация是至关重要的，以确定精准的治疗规划。核磁共振成像（MRI）可提供详细的成像，无需侵入性，并且在软组织冲击上表现出色，因此成为放射治疗规划的首选方法。然而，MRI的高价格、长期获取时间以及一些健康考虑使得患者面临挑战。相反，计算机断层成像（CT）扫描可以提供更快和便宜的成像解决方案。为了bridging这两种模式和解决多模态对对比挑战，本文介绍了一种增强单模态 регистраción的方法，使用人工生成的MRI图像。这种方法基于CycleGANs和特征提取器，并在相关文献中吸取了进一步改进。我们的方法在多个比较指标中表现出色，超过了许多现有的方法。

Brain-like Flexible Visual Inference by Harnessing Feedback-Feedforward Alignment

paper_url: http://arxiv.org/abs/2310.20599
repo_url: https://github.com/toosi/feedback_feedforward_alignment
paper_authors: Tahereh Toosi, Elias B. Issa
for: This paper aims to explore the mechanisms underlying how feedback connections in the visual cortex support flexible visual functions, and to propose a learning algorithm called Feedback-Feedforward Alignment (FFA) that can co-optimize classification and reconstruction tasks.methods: The proposed FFA algorithm leverages feedback and feedforward pathways as mutual credit assignment computational graphs, enabling alignment and co-optimization of objectives.results: The study demonstrates the effectiveness of FFA in co-optimizing classification and reconstruction tasks on widely used MNIST and CIFAR10 datasets, and shows that the alignment mechanism in FFA endows feedback connections with emergent visual inference functions such as denoising, resolving occlusions, hallucination, and imagination. Additionally, FFA is found to be more bio-plausible compared to traditional backpropagation (BP) methods.

Abstract
In natural vision, feedback connections support versatile visual inference capabilities such as making sense of the occluded or noisy bottom-up sensory information or mediating pure top-down processes such as imagination. However, the mechanisms by which the feedback pathway learns to give rise to these capabilities flexibly are not clear. We propose that top-down effects emerge through alignment between feedforward and feedback pathways, each optimizing its own objectives. To achieve this co-optimization, we introduce Feedback-Feedforward Alignment (FFA), a learning algorithm that leverages feedback and feedforward pathways as mutual credit assignment computational graphs, enabling alignment. In our study, we demonstrate the effectiveness of FFA in co-optimizing classification and reconstruction tasks on widely used MNIST and CIFAR10 datasets. Notably, the alignment mechanism in FFA endows feedback connections with emergent visual inference functions, including denoising, resolving occlusions, hallucination, and imagination. Moreover, FFA offers bio-plausibility compared to traditional backpropagation (BP) methods in implementation. By repurposing the computational graph of credit assignment into a goal-driven feedback pathway, FFA alleviates weight transport problems encountered in BP, enhancing the bio-plausibility of the learning algorithm. Our study presents FFA as a promising proof-of-concept for the mechanisms underlying how feedback connections in the visual cortex support flexible visual functions. This work also contributes to the broader field of visual inference underlying perceptual phenomena and has implications for developing more biologically inspired learning algorithms.

摘要
自然视觉中，反馈连接支持多样化的视觉推理能力，如对尘埃或噪声的底部感知信息的理解或通过纯净的顶部下向过程来实现想象。然而，这些反馈路径学习如何灵活地产生这些能力的机制不清楚。我们提出，反馈路径通过与前向路径的对齐来产生上述能力。为实现这种对齐，我们提出了反馈-前向对齐（FFA）学习算法，该算法利用反馈和前向路径作为互相评估计算图，以实现对齐。在我们的研究中，我们证明了 FFA 在广泛使用的 MNIST 和 CIFAR10 数据集上可以协同优化分类和重建任务。特别是，FFA 中的对齐机制使得反馈连接获得了 emergent 视觉推理功能，包括降噪、解除尘埃、幻觉和想象。此外，FFA 提供了对传统 backpropagation（BP）方法更加生物可能性的实现，通过将计算图转换为目标驱动反馈路径，FFA 可以解决传统 BP 中的权重传输问题，从而提高生物可能性。我们的研究展示了 FFA 作为视觉推理机制的可能性，并对涉及到视觉推理的更广泛领域产生影响。

FLODCAST: Flow and Depth Forecasting via Multimodal Recurrent Architectures

paper_url: http://arxiv.org/abs/2310.20593
repo_url: None
paper_authors: Andrea Ciamarra, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
for: 预测物体的运动和空间位置，特别是在自动驾驶等安全关键场景中。
methods: 提出了一种名为FLODCAST的流和深度预测模型，利用多任务回归架构，同时预测两种不同的modalities。
results: 在Cityscapes dataset上测试，模型得到了最佳的result для两种预测任务，并且还获得了下游任务 segmentation forecasting 的好处。

Abstract
Forecasting motion and spatial positions of objects is of fundamental importance, especially in safety-critical settings such as autonomous driving. In this work, we address the issue by forecasting two different modalities that carry complementary information, namely optical flow and depth. To this end we propose FLODCAST a flow and depth forecasting model that leverages a multitask recurrent architecture, trained to jointly forecast both modalities at once. We stress the importance of training using flows and depth maps together, demonstrating that both tasks improve when the model is informed of the other modality. We train the proposed model to also perform predictions for several timesteps in the future. This provides better supervision and leads to more precise predictions, retaining the capability of the model to yield outputs autoregressively for any future time horizon. We test our model on the challenging Cityscapes dataset, obtaining state of the art results for both flow and depth forecasting. Thanks to the high quality of the generated flows, we also report benefits on the downstream task of segmentation forecasting, injecting our predictions in a flow-based mask-warping framework.

摘要
预测物体的运动和空间位置是基本重要的，特别是在自动驾驶等安全关键的应用场景中。在这种工作中，我们解决这个问题，通过预测两种不同的模式，即光流和深度。为此，我们提议了FLODCAST模型，它利用多任务感知架构，同时预测两种模式。我们认为，在训练时，通过将流和深度图像一起使用，可以提高模型的性能。我们还训练了模型，以便在未来几个时间步预测。这提供了更好的监督，导致更精准的预测，并且模型仍然保留了对未来时间距离的autoregressive预测能力。我们在Cityscapes dataset上测试了我们的模型，并获得了流和深度预测中的状态对应记录。此外，由于生成的流量质量高，我们还报告了基于流动掩蔽框架的 segmentation 预测的改进。

Long-Tailed Learning as Multi-Objective Optimization

paper_url: http://arxiv.org/abs/2310.20490
repo_url: None
paper_authors: Weiqi Li, Fan Lyu, Fanhua Shang, Liang Wan, Wei Feng
for: 提高长尾分类模型的性能，解决长尾分类问题中的权衡问题。
methods: 基于多目标优化的方法，使用Gradient-Balancing Grouping策略来减少类别间的权衡问题。
results: 在常用的长尾学习benchmark上，与现有SOTA方法进行比较，显示了我们的方法的超越性。

Abstract
Real-world data is extremely imbalanced and presents a long-tailed distribution, resulting in models that are biased towards classes with sufficient samples and perform poorly on rare classes. Recent methods propose to rebalance classes but they undertake the seesaw dilemma (what is increasing performance on tail classes may decrease that of head classes, and vice versa). In this paper, we argue that the seesaw dilemma is derived from gradient imbalance of different classes, in which gradients of inappropriate classes are set to important for updating, thus are prone to overcompensation or undercompensation on tail classes. To achieve ideal compensation, we formulate the long-tailed recognition as an multi-objective optimization problem, which fairly respects the contributions of head and tail classes simultaneously. For efficiency, we propose a Gradient-Balancing Grouping (GBG) strategy to gather the classes with similar gradient directions, thus approximately make every update under a Pareto descent direction. Our GBG method drives classes with similar gradient directions to form more representative gradient and provide ideal compensation to the tail classes. Moreover, We conduct extensive experiments on commonly used benchmarks in long-tailed learning and demonstrate the superiority of our method over existing SOTA methods.

摘要
In this paper, we argue that the seesaw dilemma is caused by an imbalance in the gradients of the different classes. The gradients of the minority classes are often set to be more important for updating the model, which can lead to overcompensation or undercompensation of these classes. To address this problem, we formulate the long-tailed recognition as a multi-objective optimization problem, which aims to fairly respect the contributions of both the head and tail classes simultaneously.To efficiently solve this problem, we propose a Gradient-Balancing Grouping (GBG) strategy. This strategy gathers classes with similar gradient directions together, and approximately makes every update under a Pareto descent direction. This allows the classes with similar gradient directions to form more representative gradients, and provide ideal compensation to the tail classes.We conduct extensive experiments on commonly used benchmarks in long-tailed learning, and demonstrate the superiority of our method over existing state-of-the-art (SOTA) methods.

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

paper_url: http://arxiv.org/abs/2310.20446
repo_url: https://github.com/YYX666660/LAVSS
paper_authors: Yuxin Ye, Wenming Yang, Yapeng Tian
for: 这篇论文旨在提高虚拟/扩展现实（VR/AR）场景中的音频视频分离（MAVS）技术，使得 listener可以更好地分辨不同方向的音频来源。
methods: 该论文提出了一种基于空间声音和视觉位置的声音视频分离方法（LAVSS），借鉴了声音视觉位置之间的相关性。它使用了声音双耳扬声器的相位差作为空间指示，并利用声音发生对象的位势表示作为多模态指导。此外，它还利用了多级跨模态注意力来进行视觉位置协作。
results: 实验结果表明，LAVSS比既有的音频视频分离方法（AVS）更高效，并且可以在不同的声音环境下保持高效。

Abstract
Existing machine learning research has achieved promising results in monaural audio-visual separation (MAVS). However, most MAVS methods purely consider what the sound source is, not where it is located. This can be a problem in VR/AR scenarios, where listeners need to be able to distinguish between similar audio sources located in different directions. To address this limitation, we have generalized MAVS to spatial audio separation and proposed LAVSS: a location-guided audio-visual spatial audio separator. LAVSS is inspired by the correlation between spatial audio and visual location. We introduce the phase difference carried by binaural audio as spatial cues, and we utilize positional representations of sounding objects as additional modality guidance. We also leverage multi-level cross-modal attention to perform visual-positional collaboration with audio features. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation. This exploits the correlation between monaural and binaural channels. Experiments on the FAIR-Play dataset demonstrate the superiority of the proposed LAVSS over existing benchmarks of audio-visual separation. Our project page: https://yyx666660.github.io/LAVSS/.

摘要
现有的机器学习研究已经取得了鼓舞人的成绩在单声音频分离（MAVS）领域。然而，大多数MAVS方法仅考虑声音源的特征，而不考虑其位置。这可以是VR/AR场景中的问题，listeners需要能够分辨位于不同方向的类似声音源。为解决这些限制，我们扩展了MAVS，并提出了位置导向的音频视频空间分离器（LAVSS）。LAVSS受到听道和视觉位置之间的相关性启发，并利用声道相位差作为空间提示，以及使用声音发生对象的位置表示为额外模式指导。此外，我们还利用多级跨模态注意力来实现视觉位置协作。此外，我们还利用预训练的单声音分离器来传递知识，从富有的单声音中提高空间声音分离。这利用了单声音和双声音之间的相关性。FAIR-Play数据集的实验表明，我们提出的LAVSS已经超越了现有的音频视频分离标准。我们的项目页面：https://yyx666660.github.io/LAVSS/.

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

paper_url: http://arxiv.org/abs/2310.20436
repo_url: https://github.com/ZhengdiYu/SignAvatars
paper_authors: Zhengdi Yu, Shaoli Huang, Yongkang Cheng, Tolga Birdal
for: bridging the communication gap for hearing-impaired individuals by providing a large-scale multi-prompt 3D sign language (SL) motion dataset.
methods: compiling and curating the SignAvatars dataset, which includes 70,000 videos from 153 signers, and introducing an automated annotation pipeline to yield 3D holistic annotations.
results: facilitating various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs, and providing a unified benchmark for evaluating the potential of SignAvatars.Here’s the summary in the format you requested:
for: bridging the communication gap for hearing-impaired individuals
methods: compiling and curating SignAvatars dataset, introducing automated annotation pipeline
results: facilitating 3D SLR and SLP, providing unified benchmark

Abstract
In this paper, we present SignAvatars, the first large-scale multi-prompt 3D sign language (SL) motion dataset designed to bridge the communication gap for hearing-impaired individuals. While there has been an exponentially growing number of research regarding digital communication, the majority of existing communication technologies primarily cater to spoken or written languages, instead of SL, the essential communication method for hearing-impaired communities. Existing SL datasets, dictionaries, and sign language production (SLP) methods are typically limited to 2D as the annotating 3D models and avatars for SL is usually an entirely manual and labor-intensive process conducted by SL experts, often resulting in unnatural avatars. In response to these challenges, we compile and curate the SignAvatars dataset, which comprises 70,000 videos from 153 signers, totaling 8.34 million frames, covering both isolated signs and continuous, co-articulated signs, with multiple prompts including HamNoSys, spoken language, and words. To yield 3D holistic annotations, including meshes and biomechanically-valid poses of body, hands, and face, as well as 2D and 3D keypoints, we introduce an automated annotation pipeline operating on our large corpus of SL videos. SignAvatars facilitates various tasks such as 3D sign language recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs like text scripts, individual words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars, we further propose a unified benchmark of 3D SL holistic motion production. We believe that this work is a significant step forward towards bringing the digital world to the hearing-impaired communities. Our project page is at https://signavatars.github.io/

摘要
在这篇论文中，我们提出了SignAvatars，首个大规模多提示3D手语（SL）动作数据集，旨在bridging hearing-impaired individuals的communication gap。 existing digital communication technologies primarily focus on spoken or written languages,而忽视手语，是耳残社区的 essencial communication method。 existing SL datasets, dictionaries, and sign language production（SLP）方法通常是2D的，因为annotating 3D模型和avatar for SL是一个 entirely manual and labor-intensive process，Frequently resulting in unnatural avatars。 In response to these challenges, we compile and curate the SignAvatars dataset，包括70,000个视频，共计8.34 million frames，covering isolated signs and continuous, co-articulated signs，with multiple prompts including HamNoSys, spoken language, and words。 To yield 3D holistic annotations，包括Body, hands, and face的biomechanically-valid poses，as well as 2D and 3D keypoints，we introduce an automated annotation pipeline operating on our large corpus of SL videos。 SignAvatars facilitates various tasks such as 3D手语 recognition（SLR）and the novel 3D SL production（SLP）from diverse inputs like text scripts, individual words, and HamNoSys notation。 Therefore, we further propose a unified benchmark of 3D SL holistic motion production to evaluate the potential of SignAvatars。 We believe that this work is a significant step forward towards bringing the digital world to the hearing-impaired communities。More information can be found on our project page at .

Assessing and Enhancing Robustness of Deep Learning Models with Corruption Emulation in Digital Pathology

paper_url: http://arxiv.org/abs/2310.20427
repo_url: None
paper_authors: Peixiang Huang, Songtao Zhang, Yulu Gan, Rui Xu, Rongqi Zhu, Wenkang Qin, Limei Guo, Shan Jiang, Lin Luo
for: 本研究旨在评估和提高深度学习模型在数字 PATHOLOGY 中的Robustness，以便在临床诊断中获得可靠的结果。
methods: 本研究使用物理原因分析和 Omni-Corruption Emulation（OmniCE）方法来评估深度学习模型的Robustness。OmniCE方法可以生成21种类型的质量问题，并将其分为5级的严重程度。
results: 研究发现，通过使用OmniCE方法生成的质量问题作为训练数据，可以提高深度学习模型的泛化能力。此外，研究还发现，通过在训练和测试阶段使用OmniCE-质量问题作为数据增强，可以进一步提高模型的泛化能力。

Abstract
Deep learning in digital pathology brings intelligence and automation as substantial enhancements to pathological analysis, the gold standard of clinical diagnosis. However, multiple steps from tissue preparation to slide imaging introduce various image corruptions, making it difficult for deep neural network (DNN) models to achieve stable diagnostic results for clinical use. In order to assess and further enhance the robustness of the models, we analyze the physical causes of the full-stack corruptions throughout the pathological life-cycle and propose an Omni-Corruption Emulation (OmniCE) method to reproduce 21 types of corruptions quantified with 5-level severity. We then construct three OmniCE-corrupted benchmark datasets at both patch level and slide level and assess the robustness of popular DNNs in classification and segmentation tasks. Further, we explore to use the OmniCE-corrupted datasets as augmentation data for training and experiments to verify that the generalization ability of the models has been significantly enhanced.

摘要
深度学习在数字 PATHOLOGY 中带来了智能和自动化作为重要提高的诊断标准。然而，从组织准备到滤镜成像的多个步骤引入了多种图像损害，使得深度神经网络（DNN）模型难以在临床应用中获得稳定的诊断结果。为了评估和进一步增强模型的可靠性，我们分析了全栈损害的物理原因并提出了全面损害模拟（OmniCE）方法，可以生成21种类型的损害，并将其分为5级严重程度。然后，我们构建了三个 OmniCE-损害的标准数据集，包括负荷级和整个报告级，并评估了流行的 DNN 在分类和 segmentation 任务中的可靠性。此外，我们还 explore 了使用 OmniCE-损害数据集作为训练和实验数据，以验证模型的通用能力得到了显著提高。

Thermal-Infrared Remote Target Detection System for Maritime Rescue based on Data Augmentation with 3D Synthetic Data

paper_url: http://arxiv.org/abs/2310.20412
repo_url: None
paper_authors: Sungjin Cheong, Wonho Jung, Yoon Seop Lim, Yong-Hwa Park
for: 这篇论文是为了设计一个使用深度学习和数据增强的海上搜救冷光探测系统。
methods: 论文使用了自己收集的多场景的冷光数据集（FLIR），并使用了一款3D游戏（ARMA3）生成的 sintetic 数据集来增强数据。为了解决数据集的域性问题，论文提出了一种基于生成模型的域 adaptation算法。
results: 实验结果表明，使用了增强数据的网络比只使用实际冷光数据进行训练表现出色，并且提出的 segmentation 模型超过了现有的 segmentation 方法的性能。

Abstract
This paper proposes a thermal-infrared (TIR) remote target detection system for maritime rescue using deep learning and data augmentation. We established a self-collected TIR dataset consisting of multiple scenes imitating human rescue situations using a TIR camera (FLIR). Additionally, to address dataset scarcity and improve model robustness, a synthetic dataset from a 3D game (ARMA3) to augment the data is further collected. However, a significant domain gap exists between synthetic TIR and real TIR images. Hence, a proper domain adaptation algorithm is essential to overcome the gap. Therefore, we suggest a domain adaptation algorithm in a target-background separated manner from 3D game-to-real, based on a generative model, to address this issue. Furthermore, a segmentation network with fixed-weight kernels at the head is proposed to improve the signal-to-noise ratio (SNR) and provide weak attention, as remote TIR targets inherently suffer from unclear boundaries. Experiment results reveal that the network trained on augmented data consisting of translated synthetic and real TIR data outperforms that trained on only real TIR data by a large margin. Furthermore, the proposed segmentation model surpasses the performance of state-of-the-art segmentation methods.

摘要
Translation note:* "TIR" is translated as " thermal-infrared" ( thermaled 红外)* "remote" is translated as "远程" (yuèjìng)* "target" is translated as "目标" (mùdiāo)* "detection" is translated as "检测" (jiǎn cí)* "system" is translated as "系统" (xiàngzì)* "using deep learning" is translated as "使用深度学习" (shǐyòu shēngrán xuéxí)* "data augmentation" is translated as "数据增强" (shùjì zhòngqiáng)* "dataset" is translated as "数据集" (shùjì)* "synthetic" is translated as "制造的" (zhì zhàng de)* "3D game" is translated as "3D 游戏" (3D yóuxì)* "domain adaptation" is translated as "领域适应" (fāngyù tiěbìng)* "generative model" is translated as "生成模型" (shēngchǎn módelì)* "segmentation network" is translated as "分割网络" (fēnzhang wǎngluò)* "signal-to-noise ratio" is translated as "信号噪声比" (xìng jiào zhōng shēng bǐ)* "SNR" is translated as "SNR" (SNR)* "state-of-the-art" is translated as "当前领域的最佳方法" (dàng qián fāngyù de zhèngjiā fāngfǎ)

High-Resolution Reference Image Assisted Volumetric Super-Resolution of Cardiac Diffusion Weighted Imaging

paper_url: http://arxiv.org/abs/2310.20389
repo_url: None
paper_authors: Yinzhe Wu, Jiahao Huang, Fanwen Wang, Pedro Ferreira, Andrew Scott, Sonia Nielles-Vallespin, Guang Yang
for: 这项研究的目的是提高DT-CMR的图像质量，以便更好地了解心脏的微结构和它们与心脏功能之间的关系。
methods: 该研究使用了深度学习算法来提高DT-CMR图像的质量，并将高分辨率参照图像作为输入。
results: 研究表明，使用高分辨率参照图像可以提高DT-CMR图像的质量，并且模型可以在不同的b-值下进行超分辨。

Abstract
Diffusion Tensor Cardiac Magnetic Resonance (DT-CMR) is the only in vivo method to non-invasively examine the microstructure of the human heart. Current research in DT-CMR aims to improve the understanding of how the cardiac microstructure relates to the macroscopic function of the healthy heart as well as how microstructural dysfunction contributes to disease. To get the final DT-CMR metrics, we need to acquire diffusion weighted images of at least 6 directions. However, due to DWI's low signal-to-noise ratio, the standard voxel size is quite big on the scale for microstructures. In this study, we explored the potential of deep-learning-based methods in improving the image quality volumetrically (x4 in all dimensions). This study proposed a novel framework to enable volumetric super-resolution, with an additional model input of high-resolution b0 DWI. We demonstrated that the additional input could offer higher super-resolved image quality. Going beyond, the model is also able to super-resolve DWIs of unseen b-values, proving the model framework's generalizability for cardiac DWI superresolution. In conclusion, we would then recommend giving the model a high-resolution reference image as an additional input to the low-resolution image for training and inference to guide all super-resolution frameworks for parametric imaging where a reference image is available.

摘要
Diffusion Tensor Cardiac Magnetic Resonance (DT-CMR) 是人体心脏内部非侵入性地检查微结构的唯一方法。当前研究的目标是提高健康心脏微结构与macroscopic功能之间的关系认识，以及诊断疾病的微结构功能不良的贡献。为获得最终 DT-CMR 指标，我们需要获得至少6个方向的扩散束图像。然而，由于 DWI 的信号噪声比较低，标准 voxel 大小很大，不能够准确捕捉微结构。在这项研究中，我们探索了深度学习基于方法在提高图像质量方面的潜在作用。我们提出了一种新的框架，以便在所有维度上进行超分辨感图像。我们示出了在高分辨率 refer 图像作为额外输入时，模型可以提供更高的超分辨图像质量。此外，模型还可以超分辨不同的 b-值 DWI，证明了模型框架的普适性。因此，我们建议在训练和推理过程中给模型提供高分辨率 refer 图像作为额外输入，以便导引所有超分辨感图像框架。

A Low-cost Strategic Monitoring Approach for Scalable and Interpretable Error Detection in Deep Neural Networks

paper_url: http://arxiv.org/abs/2310.20349
repo_url: None
paper_authors: Florian Geissler, Syed Qutub, Michael Paulitsch, Karthik Pattabiraman
for: 本文提出了一种高度压缩的运行时监测方法，用于检测深度计算机视网络中的听力数据损害。
methods: 该方法基于Activation Distribution的峰或堆shift的特征，在 selelcted layers中检测潜在的硬件内存和输入错误。
results: 该方法可以准确地检测到silent data corruption，并且可以提供高精度（~~96%）和高准确率（~~98%）的检测结果，同时具有较少的计算开销（只需0.3%的非监测时间）。

Abstract
We present a highly compact run-time monitoring approach for deep computer vision networks that extracts selected knowledge from only a few (down to merely two) hidden layers, yet can efficiently detect silent data corruption originating from both hardware memory and input faults. Building on the insight that critical faults typically manifest as peak or bulk shifts in the activation distribution of the affected network layers, we use strategically placed quantile markers to make accurate estimates about the anomaly of the current inference as a whole. Importantly, the detector component itself is kept algorithmically transparent to render the categorization of regular and abnormal behavior interpretable to a human. Our technique achieves up to ~96% precision and ~98% recall of detection. Compared to state-of-the-art anomaly detection techniques, this approach requires minimal compute overhead (as little as 0.3% with respect to non-supervised inference time) and contributes to the explainability of the model.

摘要
我们提出了一种高度压缩的执行时间监控方法，用于深度电脑视觉网络中检测静默数据损坏。我们从只有几个（甚至只有两个）隐藏层中提取选择的知识，但可以高效地检测硬件内存和输入错误所引起的数据损坏。我们建基于网络层的活动分布中的峰值或块状迁移的假设，使用策略性地置标示来做高精度的检测。我们的方法可以实现约96%的准确率和约98%的检测 recall。与现有的异常检测技术相比，我们的方法需要非常小的计算负载（只有0.3% 的非超级学习时间），并且增加了模型的解释性。

Class Incremental Learning with Pre-trained Vision-Language Models

paper_url: http://arxiv.org/abs/2310.20348
repo_url: None
paper_authors: Xialei Liu, Xusheng Cao, Haori Lu, Jia-wen Xiao, Andrew D. Bagdanov, Ming-Ming Cheng
for: 本研究旨在利用大规模预训练模型进行连续学习场景中的适应和利用。
methods: 本文提出一种基于预训练视觉语言模型（如CLIP）的方法，允许更多的适应而不是仅仅通过零shot学习新任务。该方法包括在图像Encoder后加入附加层或文本Encoder前加入附加层，并调查三种策略：线性适应、自注意适应和提示调整。此外，我们还提出了一种参数保留方法，用于维持适应层中的稳定和柔软性在增量学习中。
results: 我们的实验表明， simplest solution – 单 Linear Adapter 层与参数保留 – produce the best results。在多个传统的benchmark上，我们的方法一直保持了与当前状态的较大的改进 margin。

Abstract
With the advent of large-scale pre-trained models, interest in adapting and exploiting them for continual learning scenarios has grown. In this paper, we propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation instead of only using zero-shot learning of new tasks. We augment a pre-trained CLIP model with additional layers after the Image Encoder or before the Text Encoder. We investigate three different strategies: a Linear Adapter, a Self-attention Adapter, each operating on the image embedding, and Prompt Tuning which instead modifies prompts input to the CLIP text encoder. We also propose a method for parameter retention in the adapter layers that uses a measure of parameter importance to better maintain stability and plasticity during incremental learning. Our experiments demonstrate that the simplest solution -- a single Linear Adapter layer with parameter retention -- produces the best results. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.

摘要
《大规模预训练模型的应用于不断学习场景的兴趣》With the advent of large-scale pre-trained models, there has been growing interest in adapting and exploiting them for continual learning scenarios. In this paper, we propose an approach to leveraging pre-trained vision-language models (e.g. CLIP) that enables further adaptation instead of only using zero-shot learning of new tasks. We augment a pre-trained CLIP model with additional layers after the Image Encoder or before the Text Encoder. We investigate three different strategies: a Linear Adapter, a Self-attention Adapter, and Prompt Tuning, each operating on the image embedding. We also propose a method for parameter retention in the adapter layers that uses a measure of parameter importance to better maintain stability and plasticity during incremental learning. Our experiments demonstrate that the simplest solution - a single Linear Adapter layer with parameter retention - produces the best results. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.

Recaptured Raw Screen Image and Video Demoiréing via Channel and Spatial Modulations

paper_url: http://arxiv.org/abs/2310.20332
repo_url: https://github.com/tju-chengyijia/vd_raw
paper_authors: Huanjing Yue, Yijia Cheng, Xin Liu, Jingyu Yang
for: 本研究旨在提高智能手机拍摄的屏幕内容捕捉图像和视频中的质量，通过解除屏幕内置的频率杂谱模糊。
methods: 提议一种针对原始输入进行图像和视频恢复网络，包括色彩分离特征分支和传统特征混合分支。特别是通过色彩分离特征的模拟增强色混合特征，以及通过特征的大感知场来调节特征的小感知场。
results: 实验表明，提议的方法可以在图像和视频恢复中达到状态之最的性能。此外，还提出了第一个准确对齐的原始视频恢复（RawVDemoir'e）数据集，并提出了一种高效的时间对齐方法，通过插入交换 patrern。代码和数据集可以在https://github.com/tju-chengyijia/VD_raw中下载。

Abstract
Capturing screen contents by smartphone cameras has become a common way for information sharing. However, these images and videos are often degraded by moir\'e patterns, which are caused by frequency aliasing between the camera filter array and digital display grids. We observe that the moir\'e patterns in raw domain is simpler than those in sRGB domain, and the moir\'e patterns in raw color channels have different properties. Therefore, we propose an image and video demoir\'eing network tailored for raw inputs. We introduce a color-separated feature branch, and it is fused with the traditional feature-mixed branch via channel and spatial modulations. Specifically, the channel modulation utilizes modulated color-separated features to enhance the color-mixed features. The spatial modulation utilizes the feature with large receptive field to modulate the feature with small receptive field. In addition, we build the first well-aligned raw video demoir\'eing (RawVDemoir\'e) dataset and propose an efficient temporal alignment method by inserting alternating patterns. Experiments demonstrate that our method achieves state-of-the-art performance for both image and video demori\'eing. We have released the code and dataset in https://github.com/tju-chengyijia/VD_raw.

摘要
Capturing screen contents by smartphone cameras has become a common way for information sharing. However, these images and videos are often degraded by moiré patterns, which are caused by frequency aliasing between the camera filter array and digital display grids. We observe that the moiré patterns in the raw domain are simpler than those in the sRGB domain, and the moiré patterns in the raw color channels have different properties. Therefore, we propose an image and video demoiré network tailored for raw inputs. We introduce a color-separated feature branch, and it is fused with the traditional feature-mixed branch via channel and spatial modulations. Specifically, the channel modulation utilizes modulated color-separated features to enhance the color-mixed features. The spatial modulation utilizes the feature with a large receptive field to modulate the feature with a small receptive field. In addition, we build the first well-aligned raw video demoiré (RawVDemoiré) dataset and propose an efficient temporal alignment method by inserting alternating patterns. Experiments demonstrate that our method achieves state-of-the-art performance for both image and video demoiré. We have released the code and dataset at https://github.com/tju-chengyijia/VD_raw.

GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data

paper_url: http://arxiv.org/abs/2310.20319
repo_url: https://github.com/dschinagl/gace
paper_authors: David Schinagl, Georg Krispel, Christian Fruhwirth-Reisinger, Horst Possegger, Horst Bischof
for: 提高黑obox3D对象检测器的置信度估计
methods: 聚合检测结果的地理信息和空间关系，以提高置信度估计
results: 在多种state-of-the-art检测器上实现了一致性的性能提升，尤其是用于潜在弱点用户类型（如行人和自行车）

Abstract
Widely-used LiDAR-based 3D object detectors often neglect fundamental geometric information readily available from the object proposals in their confidence estimation. This is mostly due to architectural design choices, which were often adopted from the 2D image domain, where geometric context is rarely available. In 3D, however, considering the object properties and its surroundings in a holistic way is important to distinguish between true and false positive detections, e.g. occluded pedestrians in a group. To address this, we present GACE, an intuitive and highly efficient method to improve the confidence estimation of a given black-box 3D object detector. We aggregate geometric cues of detections and their spatial relationships, which enables us to properly assess their plausibility and consequently, improve the confidence estimation. This leads to consistent performance gains over a variety of state-of-the-art detectors. Across all evaluated detectors, GACE proves to be especially beneficial for the vulnerable road user classes, i.e. pedestrians and cyclists.

摘要
(Simplified Chinese)广泛使用 LiDAR 技术的 3D 物体检测器经常忽视可用的基本几何信息，主要是由于架构设计的选择，通常从 2D 图像领域中采用的设计方式。在 3D 中，考虑物体属性和周围环境的整体方式非常重要，以确定true和false检测的分辨率。例如，在群体中 occluded 的行人。为解决这个问题，我们提出了 GACE，一种简单、高效的方法，可以改进给定黑盒 3D 物体检测器的信任度估计。我们将检测结果的几何特征和空间关系聚合起来，以评估其可能性，从而提高信任度估计。这会导致多种状态体检测器的表现提高。对于车道上的易受伤用户类型，例如行人和自行车手，GACE 特别有利。

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

paper_url: http://arxiv.org/abs/2310.20316
repo_url: https://github.com/aimagelab/hwd
paper_authors: Vittorio Pippi, Fabio Quattrini, Silvia Cascianelli, Rita Cucchiara
For: 这篇论文的目的是提出一个适合用于评估手写文本生成（Styled HTG）模型的评估指标，以便推动这个研究领域的发展。* Methods: 这篇论文使用了一个特定的深度学习模型，将手写文本обра像映射到一个固定长度的数据集中，并使用一个专门设计来抽取手写风格特征的网络来评估手写文本的质量。* Results: 这篇论文的实验结果显示，提出的手写距离（HWD）是一个有用的评估指标，可以帮助评估手写文本生成模型的性能。实验结果还显示，使用HWD来评估不同的字元和行元数据集的手写文本生成模型，具有良好的可靠性和可重现性。

Abstract
Styled Handwritten Text Generation (Styled HTG) is an important task in document analysis, aiming to generate text images with the handwriting of given reference images. In recent years, there has been significant progress in the development of deep learning models for tackling this task. Being able to measure the performance of HTG models via a meaningful and representative criterion is key for fostering the development of this research topic. However, despite the current adoption of scores for natural image generation evaluation, assessing the quality of generated handwriting remains challenging. In light of this, we devise the Handwriting Distance (HWD), tailored for HTG evaluation. In particular, it works in the feature space of a network specifically trained to extract handwriting style features from the variable-lenght input images and exploits a perceptual distance to compare the subtle geometric features of handwriting. Through extensive experimental evaluation on different word-level and line-level datasets of handwritten text images, we demonstrate the suitability of the proposed HWD as a score for Styled HTG. The pretrained model used as backbone will be released to ease the adoption of the score, aiming to provide a valuable tool for evaluating HTG models and thus contributing to advancing this important research area.

摘要
文本样式化生成（Styled HTG）是文档分析领域中一个重要任务，目标是生成基于给定参考图像的文本图像。在过去几年，深度学习模型在解决这个任务上做出了 significiant 的进步。能够对 HTG 模型使用可信度的表征是锻炼这个研究领域的发展的关键。然而，自然图像生成评价分数的当前采用不足以评价生成的手写文本质量。为此，我们提出了一种特定于 HTG 的评价指标——手写距离（HWD）。具体来说，它在特定的网络中提取手写样式特征，并利用几何特征进行比较。经过广泛的实验评价不同的单词和行级手写文本图像 dataset 上，我们证明了提议的 HWD 是一个适合 HTG 评价的分数。预训练的模型将被发布，以便推广 HWD，并提供一个valuable的工具来评价 HTG 模型，从而为这个重要的研究领域做出贡献。

Bilateral Network with Residual U-blocks and Dual-Guided Attention for Real-time Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.20305
repo_url: https://github.com/likelidoa/bidganet
paper_authors: Liang Liao, Liang Wan, Mingsheng Liu, Shusheng Li
for: 本研究旨在提高现代自动驾驶等应用场景中的 semantic segmentation 技术性能，即实时性和准确性之间的平衡。
methods: 本研究提出了一种新的两支网络架构，其中一支网络用于处理空间信息，另一支网络用于处理 semantics 信息。这种架构可以使得模型的计算量减少，同时保持高度的准确性。此外，我们还提出了一种基于注意计算的新协调机制，用于将两支网络的特征进行协调。
results: 我们通过对 Cityscapes 和 CamVid 数据集进行了广泛的实验，证明了我们的方法可以提高 semantic segmentation 的性能，并且和常用的多层融合相比，我们的方法具有更好的时间效率。

Abstract
When some application scenarios need to use semantic segmentation technology, like automatic driving, the primary concern comes to real-time performance rather than extremely high segmentation accuracy. To achieve a good trade-off between speed and accuracy, two-branch architecture has been proposed in recent years. It treats spatial information and semantics information separately which allows the model to be composed of two networks both not heavy. However, the process of fusing features with two different scales becomes a performance bottleneck for many nowaday two-branch models. In this research, we design a new fusion mechanism for two-branch architecture which is guided by attention computation. To be precise, we use the Dual-Guided Attention (DGA) module we proposed to replace some multi-scale transformations with the calculation of attention which means we only use several attention layers of near linear complexity to achieve performance comparable to frequently-used multi-layer fusion. To ensure that our module can be effective, we use Residual U-blocks (RSU) to build one of the two branches in our networks which aims to obtain better multi-scale features. Extensive experiments on Cityscapes and CamVid dataset show the effectiveness of our method.

摘要
Translated into Simplified Chinese:当某些应用场景需要使用semantic segmentation技术时，如自动驾驶，则首要考虑实时性而不是极高的分割精度。为实现好的时间性和准确性之间的平衡，采用了两个分支架构在过去几年。它将空间信息和 semantics信息分开处理，这使得模型可以由两个不重的网络组成。然而，将不同缩度的特征进行融合成为现在许多两分支模型的性能瓶颈。在这项研究中，我们设计了一种新的融合机制，即指导计算导向的注意力模块（DGA）。我们使用该模块取代一些多尺度变换，通过计算注意力来实现与多层融合性能相同的性能。为确保我们的模块可行，我们使用了RSU块（Residual U-block）构建一个分支网络，以获得更好的多尺度特征。我们对Cityscapes和CamVid dataset进行了广泛的实验，并证明了我们的方法的有效性。

Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation

paper_url: http://arxiv.org/abs/2310.20293
repo_url: None
paper_authors: Binhui Xie, Shuang Li, Qingju Guo, Chi Harold Liu, Xinjing Cheng
for: This paper is written for the task of LiDAR semantic segmentation, specifically addressing the challenges of annotation laboriousness and cost-prohibitiveness in large-scale point cloud data.
methods: The paper proposes a novel active learning baseline called Annotator, which utilizes a voxel-centric online selection strategy to efficiently probe and annotate salient and exemplar voxel grids within each LiDAR scan. The method also employs a voxel confusion degree (VCD) to leverage local topology relations and structures of point clouds.
results: Annotator achieves exceptional performance across various LiDAR semantic segmentation benchmarks, including active learning, active source-free domain adaptation, and active domain adaptation. Specifically, it achieves 87.8% fully-supervised performance under AL, 88.5% under ASFDA, and 94.4% under ADA, with significantly fewer annotations required (e.g., just labeling five voxels per scan in the SynLiDAR-to-SemanticKITTI task).Here’s the information in Simplified Chinese text:
for: 这篇论文是为了解决 LiDAR semantic segmentation 中的标注劳动和成本问题而写的。
methods: 论文提出了一种基于 active learning 的新基线 called Annotator，它使用了一种矢量-中心的在线选择策略来有效地探索和标注 LiDAR 扫描数据中的重要和 Representative 矢量网格。该方法还利用了矢量冲突度 (VCD) 来利用点云的地方结构和关系。
results: Annotator 在多个 LiDAR semantic segmentation benchmark 上表现出色，包括 active learning、active source-free domain adaptation 和 active domain adaptation。具体来说，它在 SynLiDAR-to-SemanticKITTI 任务上达到了 87.8% 的全监督性性能，在 ASFDA 任务上达到了 88.5%，在 ADA 任务上达到了 94.4%，并且只需要标注每个扫描数据中的 Five 个矢量。

Abstract
Active learning, a label-efficient paradigm, empowers models to interactively query an oracle for labeling new data. In the realm of LiDAR semantic segmentation, the challenges stem from the sheer volume of point clouds, rendering annotation labor-intensive and cost-prohibitive. This paper presents Annotator, a general and efficient active learning baseline, in which a voxel-centric online selection strategy is tailored to efficiently probe and annotate the salient and exemplar voxel girds within each LiDAR scan, even under distribution shift. Concretely, we first execute an in-depth analysis of several common selection strategies such as Random, Entropy, Margin, and then develop voxel confusion degree (VCD) to exploit the local topology relations and structures of point clouds. Annotator excels in diverse settings, with a particular focus on active learning (AL), active source-free domain adaptation (ASFDA), and active domain adaptation (ADA). It consistently delivers exceptional performance across LiDAR semantic segmentation benchmarks, spanning both simulation-to-real and real-to-real scenarios. Surprisingly, Annotator exhibits remarkable efficiency, requiring significantly fewer annotations, e.g., just labeling five voxels per scan in the SynLiDAR-to-SemanticKITTI task. This results in impressive performance, achieving 87.8% fully-supervised performance under AL, 88.5% under ASFDA, and 94.4% under ADA. We envision that Annotator will offer a simple, general, and efficient solution for label-efficient 3D applications. Project page: https://binhuixie.github.io/annotator-web

摘要
aktive lerning，一种标签效率的 парадигма，使得模型可以交互地查询一个 oracle 为新数据提供标签。在 LiDAR semantic segmentation 领域，挑战来自数据量的巨大和标注劳动 INTENSIVE 和成本高昂。这篇论文介绍了 Annotator，一个通用和有效的 aktive lerning 基准，其中 tailored 一种 voxel-centric 在线选择策略，可以有效地探索和标注 LiDAR 扫描中的关键和 Representative voxel 格子。具体来说，我们首先执行了一些常见的选择策略，如 Random、Entropy、Margin 的深入分析，然后开发 voxel 冲突度（VCD），以利用 LiDAR 点云的本地 то波束关系和结构。Annotator 在多种设定下表现出色，特别是 aktive lerning（AL）、aktive source-free domain adaptation（ASFDA）和 aktive domain adaptation（ADA）。它在 LiDAR semantic segmentation 的多个标准准则上具有优秀的性能，包括 simulation-to-real 和 real-to-real 场景。 surprisingly，Annotator 的效率非常出众，只需要标注每个扫描的五个 voxel，例如 SynLiDAR-to-SemanticKITTI 任务中的 87.8% 完全监督性能。这表明 Annotator 可以提供一个简单、通用、有效的解决方案 для标签效率的 3D 应用。项目页面：https://binhuixie.github.io/annotator-web

IARS SegNet: Interpretable Attention Residual Skip connection SegNet for melanoma segmentation

paper_url: http://arxiv.org/abs/2310.20292
repo_url: None
paper_authors: Shankara Narayanan V, Sikha OK, Raul Benitez
for: 这篇论文旨在提高皮肤损伤分类的精度，以便更好地诊断皮肤癌。
methods: 这篇论文提出了一个基于 SegNet 基础模型的进阶分类框架，包括了跳跃连接、重构 convolution 和全球注意力机制。这些元素有助于强调临床重要的区域，特别是皮肤损伤的边涯。
results: 这篇论文的结果显示，这个进阶分类框架可以将皮肤损伤的分类精度提高，并且提供了更好的解释力，帮助医生更好地理解模型的决策过程。

Abstract
Skin lesion segmentation plays a crucial role in the computer-aided diagnosis of melanoma. Deep Learning models have shown promise in accurately segmenting skin lesions, but their widespread adoption in real-life clinical settings is hindered by their inherent black-box nature. In domains as critical as healthcare, interpretability is not merely a feature but a fundamental requirement for model adoption. This paper proposes IARS SegNet an advanced segmentation framework built upon the SegNet baseline model. Our approach incorporates three critical components: Skip connections, residual convolutions, and a global attention mechanism onto the baseline Segnet architecture. These elements play a pivotal role in accentuating the significance of clinically relevant regions, particularly the contours of skin lesions. The inclusion of skip connections enhances the model's capacity to learn intricate contour details, while the use of residual convolutions allows for the construction of a deeper model while preserving essential image features. The global attention mechanism further contributes by extracting refined feature maps from each convolutional and deconvolutional block, thereby elevating the model's interpretability. This enhancement highlights critical regions, fosters better understanding, and leads to more accurate skin lesion segmentation for melanoma diagnosis.

摘要
皮肤损害分 segmentation在计算机辅助诊断癌症中扮演了关键性的角色。深度学习模型在精度地分 segmentation方面表现出了承诺，但它们在实际临床应用中受到了其内置的黑盒特性的限制。在如健康领域一样，可读性不仅是一个特性，而是健康领域的基本要求。这篇文章提出了IARS SegNet进一步的分 segmentation框架，基于SegNet基础模型。我们的方法包括了跳过连接、复原 convolution 和全局注意力机制，这些元素在SegNet架构上起到了关键作用，特别是强调临床相关的肤肤损害区域，包括皮肤损害的 kontour。跳过连接可以增强模型学习细节，而复原 convolution 可以建立更深的模型，保留重要的图像特征。全局注意力机制可以提取每个卷积和归一化块中的细节特征，从而提高模型的可读性，并且可以更好地理解模型的决策过程。这种提高可以帮助更好地理解模型，从而提高皮肤损害分 segmentation的准确性，为诊断癌症提供了更好的支持。

paper_url: http://arxiv.org/abs/2310.20279
repo_url: None
paper_authors: Hiroyasu Katsuno, Yuki Kimura, Tomoya Yamazaki, Ichigaku Takigawa
for:这篇论文是用于提高在liquid-cell transmission electron microscopy（LC-TEM）中获得的图像质量的机器学习（ML）技术。methods:这篇论文使用的方法包括U-Net架构和ResNet编码器，并使用了一个自定义的图像集来训练ML模型。results:通过训练ML模型，这篇论文得到了一种可以将不确定图像转换为清晰图像的技术，并且 conversions 的时间在10ms左右。此外，通过使用软件Gatan DigitalMicrograph（DM），这种技术可以在实时进行增强。

Abstract
We study a machine learning (ML) technique for refining images acquired during in situ observation using liquid-cell transmission electron microscopy (LC-TEM). Our model is constructed using a U-Net architecture and a ResNet encoder. For training our ML model, we prepared an original image dataset that contained pairs of images of samples acquired with and without a solution present. The former images were used as noisy images and the latter images were used as corresponding ground truth images. The number of pairs of image sets was $1,204$ and the image sets included images acquired at several different magnifications and electron doses. The trained model converted a noisy image into a clear image. The time necessary for the conversion was on the order of 10ms, and we applied the model to in situ observations using the software Gatan DigitalMicrograph (DM). Even if a nanoparticle was not visible in a view window in the DM software because of the low electron dose, it was visible in a successive refined image generated by our ML model.

摘要
我们研究了一种机器学习（ML）技术，用于从liquid-cell transmission electron microscopy（LC-TEM）获得的图像进行精化。我们的模型采用了U-Net架构和ResNet编码器。为了训练我们的ML模型，我们准备了一个原始图像集，其中包含了样本在不同的放大率和电子剂量下被捕捉的图像对。这些图像对中的前一个图像用作噪音图像，后一个图像用作对应的真实图像。总共有1,204对图像集，并且这些集合包含了不同放大率和电子剂量下的图像。训练模型可以将噪音图像转换成清晰图像， conversión时间约为10ms。我们使用了Gatan DigitalMicrograph（DM）软件应用该模型于室内观测。即使在DM软件中的视窗中没有显示nanoparticle因为低电子剂量，我们的ML模型可以在 successive refined image中检测到这些nanoparticle。

From Denoising Training to Test-Time Adaptation: Enhancing Domain Generalization for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.20271
repo_url: https://github.com/wenruxue/detta
paper_authors: Ruxue Wen, Hangjie Yuan, Dong Ni, Wenbo Xiao, Yaoyao Wu
for: 这个研究旨在解决对域对应医学影像分类中的域对应问题，特别是对于单一源域数据的情况下，由于医疗保健和隐私问题而无法取得多个域域数据。
methods: 这个研究提出了一个名为Denosing Y-Net（DeY-Net）的新方法，它是基于U-Net架构的，并添加了一个辅助的降噪解oder，以进行降噪训练，从而增强域对应性。此外，这个方法还可以利用无 Label 数据。
results: 实验结果显示，Compared with基于U-Net的基eline和现有方法，DeY-Net可以实现更好的域对应性，并且可以在不同的医学影像分类任务中实现更好的结果。代码可以在https://github.com/WenRuxue/DeTTA中找到。

Abstract
In medical image segmentation, domain generalization poses a significant challenge due to domain shifts caused by variations in data acquisition devices and other factors. These shifts are particularly pronounced in the most common scenario, which involves only single-source domain data due to privacy concerns. To address this, we draw inspiration from the self-supervised learning paradigm that effectively discourages overfitting to the source domain. We propose the Denoising Y-Net (DeY-Net), a novel approach incorporating an auxiliary denoising decoder into the basic U-Net architecture. The auxiliary decoder aims to perform denoising training, augmenting the domain-invariant representation that facilitates domain generalization. Furthermore, this paradigm provides the potential to utilize unlabeled data. Building upon denoising training, we propose Denoising Test Time Adaptation (DeTTA) that further: (i) adapts the model to the target domain in a sample-wise manner, and (ii) adapts to the noise-corrupted input. Extensive experiments conducted on widely-adopted liver segmentation benchmarks demonstrate significant domain generalization improvements over our baseline and state-of-the-art results compared to other methods. Code is available at https://github.com/WenRuxue/DeTTA.

摘要
医学图像分割中，领域泛化是一大挑战，主要由数据获取设备和其他因素引起的领域变化。这些变化在最常见的场景中尤为突出，只有单个源频道数据due to privacy concerns. To address this, we draw inspiration from the self-supervised learning paradigm that effectively discourages overfitting to the source domain. We propose the Denoising Y-Net (DeY-Net), a novel approach incorporating an auxiliary denoising decoder into the basic U-Net architecture. The auxiliary decoder aims to perform denoising training, augmenting the domain-invariant representation that facilitates domain generalization. Furthermore, this paradigm provides the potential to utilize unlabeled data. Building upon denoising training, we propose Denoising Test Time Adaptation (DeTTA) that further: (i) adapts the model to the target domain in a sample-wise manner, and (ii) adapts to the noise-corrupted input. Extensive experiments conducted on widely-adopted liver segmentation benchmarks demonstrate significant domain generalization improvements over our baseline and state-of-the-art results compared to other methods. Code is available at https://github.com/WenRuxue/DeTTA.Here's the translation in Traditional Chinese:医学图像分割中，领域泛化是一大挑战，主要由数据获取设备和其他因素引起的领域变化。这些变化在最常见的场景中尤为突出，只有单个源频道数据due to privacy concerns. To address this, we draw inspiration from the self-supervised learning paradigm that effectively discourages overfitting to the source domain. We propose the Denoising Y-Net (DeY-Net), a novel approach incorporating an auxiliary denoising decoder into the basic U-Net architecture. The auxiliary decoder aims to perform denoising training, augmenting the domain-invariant representation that facilitates domain generalization. Furthermore, this paradigm provides the potential to utilize unlabeled data. Building upon denoising training, we propose Denoising Test Time Adaptation (DeTTA) that further: (i) adapts the model to the target domain in a sample-wise manner, and (ii) adapts to the noise-corrupted input. Extensive experiments conducted on widely-adopted liver segmentation benchmarks demonstrate significant domain generalization improvements over our baseline and state-of-the-art results compared to other methods. Code is available at https://github.com/WenRuxue/DeTTA.

Low-Dose CT Image Enhancement Using Deep Learning

paper_url: http://arxiv.org/abs/2310.20265
repo_url: None
paper_authors: A. Demir, M. M. A. Shames, O. N. Gerek, S. Ergin, M. Fidan, M. Koc, M. B. Gulmezoglu, A. Barkana, C. Calisir
for: 降低 ionizing radiation 的使用，尤其在 computed tomography (CT) 图像系统中，以避免对身体组织造成风险。
methods: 使用 U-NET 深度学习方法进行图像提升，以 correction of low-dose artifacts。
results: 比较 full-dose CT 图像，U-NET 提升的 quarter-dose CT 图像不仅视觉上有大量改善，还可以用于诊断。

Abstract
The application of ionizing radiation for diagnostic imaging is common around the globe. However, the process of imaging, itself, remains to be a relatively hazardous operation. Therefore, it is preferable to use as low a dose of ionizing radiation as possible, particularly in computed tomography (CT) imaging systems, where multiple x-ray operations are performed for the reconstruction of slices of body tissues. A popular method for radiation dose reduction in CT imaging is known as the quarter-dose technique, which reduces the x-ray dose but can cause a loss of image sharpness. Since CT image reconstruction from directional x-rays is a nonlinear process, it is analytically difficult to correct the effect of dose reduction on image quality. Recent and popular deep-learning approaches provide an intriguing possibility of image enhancement for low-dose artifacts. Some recent works propose combinations of multiple deep-learning and classical methods for this purpose, which over-complicate the process. However, it is observed here that the straight utilization of the well-known U-NET provides very successful results for the correction of low-dose artifacts. Blind tests with actual radiologists reveal that the U-NET enhanced quarter-dose CT images not only provide an immense visual improvement over the low-dose versions, but also become diagnostically preferable images, even when compared to their full-dose CT versions.

摘要
全球各地的医疗机构都广泛应用离子化 radiation 进行诊断影像。然而，图像取得过程本身仍然是一个相对危险的操作。因此，使用最低化 ionizing radiation 剂量是非常重要的，特别是在计算机 tomography（CT）影像系统中，这里需要进行多个 x-ray 操作来重建体组织的slice。一种广泛使用的剂量减少技术是叫做 quarter-dose 技术，它可以减少 x-ray 剂量，但可能会导致图像的锐化度下降。由于 CT 图像重建是非线性的过程，因此分析上很难以修复剂量减少对图像质量的影响。现在的深度学习方法提供了一个有趣的可能性，即用于低剂量缺陷的图像提升。一些最近的工作提议了多种组合深度学习和经典方法的方法，但是这些方法会增加过程的复杂度。然而，在这里观察到的是，直接使用 Well-known U-NET 提供了非常成功的结果，用于低剂量缺陷的图像修复。盲测试中，U-NET 加工的 quarter-dose CT 图像不仅有极大的视觉改善，还比其全剂量 CT 版本更有优势，甚至在与其他诊断图像相比，也有较高的诊断价值。

Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior

paper_url: http://arxiv.org/abs/2310.20249
repo_url: None
paper_authors: Qingqing Zhao, Peizhuo Li, Wang Yifan, Olga Sorkine-Hornung, Gordon Wetzstein
for: 本研究旨在生成具有各种人物动作的计算机图形。
methods: 我们利用 pose 数据作为代替的数据源，并引入了一种神经网络Synthesis方法。
results: 我们的方法可以具有小或噪声的 pose 数据集，并且可以有效地结合来源人物的动作特征和目标人物的姿势特征。用户测试表明，大多数参与者认为我们的适应动作更有趣看、更生动，并且具有较少噪声。In English:
for: The goal of this research is to create realistic motions for various characters in computer graphics.
methods: We use pose data as an alternative data source and introduce a neural motion synthesis approach through retargeting.
results: Our method can effectively combine the motion features of the source character with the pose features of the target character, even with small or noisy pose data sets. A user study showed that most participants found our retargeted motion to be more enjoyable to watch, more lifelike, and with fewer artifacts.

Abstract
Creating believable motions for various characters has long been a goal in computer graphics. Current learning-based motion synthesis methods depend on extensive motion datasets, which are often challenging, if not impossible, to obtain. On the other hand, pose data is more accessible, since static posed characters are easier to create and can even be extracted from images using recent advancements in computer vision. In this paper, we utilize this alternative data source and introduce a neural motion synthesis approach through retargeting. Our method generates plausible motions for characters that have only pose data by transferring motion from an existing motion capture dataset of another character, which can have drastically different skeletons. Our experiments show that our method effectively combines the motion features of the source character with the pose features of the target character, and performs robustly with small or noisy pose data sets, ranging from a few artist-created poses to noisy poses estimated directly from images. Additionally, a conducted user study indicated that a majority of participants found our retargeted motion to be more enjoyable to watch, more lifelike in appearance, and exhibiting fewer artifacts. Project page: https://cyanzhao42.github.io/pose2motion

摘要
创造可信的动作 для不同的角色已经是计算机图形的长期目标。现有的学习基于动作合成方法通常需要大量的动作数据，而这些数据经常很困难或者不可能获得。相比之下，姿势数据更易 accessible，因为静止的姿势角色更容易创建，甚至可以从图像中提取使用最新的计算机视觉技术。在这篇论文中，我们利用这个替代的数据源，并介绍了一种神经动作合成方法，通过重定向。我们的方法可以将源Character的动作特征转移到目标Character的姿势特征中，即使这两个Character有极其不同的骨架。我们的实验表明，我们的方法可以有效地将源Character的动作特征和目标Character的姿势特征相结合，并在小或噪音的姿势数据集上表现稳定。此外，我们进行了一次用户调查，发现大多数参与者认为我们的重定向动作更加愉悦观看，更加生动一目，并且具有较少的artefacts。项目页面：https://cyanzhao42.github.io/pose2motion

Contrast-agent-induced deterministic component of CT-density in the abdominal aorta during routine angiography: proof of concept study

paper_url: http://arxiv.org/abs/2310.20243
repo_url: None
paper_authors: Maria R. Kodenko, Yuriy A. Vasilev, Nicholas S. Kulberg, Andrey V. Samorodov, Anton V. Vladzimirskyy, Olga V. Omelyanskaya, Roman V. Reshetnikov
for: 这个研究的目的是发展一个基于CTA数据的血液动力学模型，以便无需进行额外的血液 perfusion CT 研究，从而提高诊断和治疗的精确度。
methods: 这个研究使用了Beer-Lambert法则和血液和血液染色剂之间的化学反应的没有，提出了一个双曲函数结构的模型，这个模型包含了6个有关血液动力学的参数。
results: 这个研究分析了594个CTA图像（4个研究，每个研究有144个图像，IQR=[134; 158.5]，1:1正常:异常平衡），并证明了这个模型可以正确地模拟正常血液流和异常现象（如瘤、血栓和动脉分支）。

Abstract
Background and objective: CTA is a gold standard of preoperative diagnosis of abdominal aorta and typically used for geometric-only characteristic extraction. We assume that a model describing the dynamic behavior of the contrast agent in the vessel can be developed from the data of routine CTA studies, allowing the procedure to be investigated and optimized without the need for additional perfusion CT studies. Obtained spatial distribution of CA can be valuable for both increasing the diagnostic value of a particular study and improving the CT data processing tools. Methods: In accordance with the Beer-Lambert law and the absence of chemical interaction between blood and CA, we postulated the existence of a deterministic CA-induced component in the CT signal density. The proposed model, having a double-sigmoid structure, contains six coefficients relevant to the properties of hemodynamics. To validate the model, expert segmentation was performed using the 3D Slicer application for the CTA data obtained from publicly available source. The model was fitted to the data using the non-linear least square method with Levenberg-Marquardt optimization. Results: We analyzed 594 CTA images (4 studies with median size of 144 slices, IQR [134; 158.5]; 1:1 normal:pathology balance). Goodness-of-fit was proved by Wilcox test (p-value > 0.05 for all cases). The proposed model correctly simulated normal blood flow and hemodynamics disturbances caused by local abnormalities (aneurysm, thrombus and arterial branching). Conclusions: Proposed approach can be useful for personalized CA modeling of vessels, improvement of CTA image processing and preparation of synthetic CT training data for artificial intelligence.

摘要
背景和目标：CTA是胃肠动脉的金标准前操作诊断方法，通常用于只是 геометрических特征提取。我们假设可以从常规CTA研究中获得动态contrast agent（CA）的模型，以便无需进行额外的血流CT研究，从而对程序进行调查和优化。所获得的CA分布可以对特定研究提供有价值的诊断价值和CT数据处理工具的改进。方法：按照贝尔-拉美伯特法则和血液和CA之间的化学反应absence，我们假设存在一个 deterministic CA引起的组件在CT信号强度中。我们提出的模型具有双 сигмоида结构，其中包含6个相关血流性质的系数。为验证模型，我们使用3D Slicer应用程序对CTA数据进行专家分割，并使用非线性最小二乘法与Levenberg-Marquardt优化算法进行适应。结果：我们分析了594个CTA图像（4个研究，每个研究中的 median slice数为144，IQR=[134; 158.5]，1:1正常:疾病平衡）。好准确性被证明了由Wilcox测试（p-值>0.05 для所有 случа）。我们的模型正确模拟了正常血流和地方畸形所引起的血液和动脉异常。结论：我们的方法可以用于个性化CA模型的血管、CT图像处理的改进和人工智能 Synthetic CT 训练数据的准备。

HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds

paper_url: http://arxiv.org/abs/2310.20234
repo_url: https://github.com/zhanggang001/hednet
paper_authors: Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Xiaolin Hu
for: 3D object detection in point clouds for autonomous driving systems
methods: hierarchical encoder-decoder network (HEDNet) with encoder-decoder blocks to capture long-range dependencies among features in the spatial space
results: superior detection accuracy on the Waymo Open and nuScenes datasets compared to previous state-of-the-art methods with competitive efficiencyHere’s the full text in Simplified Chinese:
for: 3D物体检测在点云中对自动驾驶系统是非常重要的。
methods: 使用层次编码器-解码器网络（HEDNet），其中使用编码器-解码器块来捕捉点云中特征之间的长距离依赖关系，特别是对大型和远距离的物体进行检测。
results: 在 Waymo Open 和 nuScenes datasets 上，HEDNet 比前一代方法具有更高的检测精度，同时具有竞争性的效率。

Abstract
3D object detection in point clouds is important for autonomous driving systems. A primary challenge in 3D object detection stems from the sparse distribution of points within the 3D scene. Existing high-performance methods typically employ 3D sparse convolutional neural networks with small kernels to extract features. To reduce computational costs, these methods resort to submanifold sparse convolutions, which prevent the information exchange among spatially disconnected features. Some recent approaches have attempted to address this problem by introducing large-kernel convolutions or self-attention mechanisms, but they either achieve limited accuracy improvements or incur excessive computational costs. We propose HEDNet, a hierarchical encoder-decoder network for 3D object detection, which leverages encoder-decoder blocks to capture long-range dependencies among features in the spatial space, particularly for large and distant objects. We conducted extensive experiments on the Waymo Open and nuScenes datasets. HEDNet achieved superior detection accuracy on both datasets than previous state-of-the-art methods with competitive efficiency. The code is available at https://github.com/zhanggang001/HEDNet.

摘要
三维 объек体检测在点云中是自主驾驶系统中重要的一环。主要挑战在三维场景中笔点的罕见分布上。现有的高性能方法通常采用三维稀畴卷积神经网络，使用小kernel来提取特征。以减少计算成本，这些方法通常采用子毫稀畴卷积，这会阻止空间分布在不连续的特征之间的信息交换。一些最近的方法尝试了解决这个问题，通过引入大kernel卷积或自注意机制，但它们可能具有有限的准确性提升或过高的计算成本。我们提议了HEDNet，一种嵌入式编码器-解码器网络，用于三维 объек体检测。HEDNet利用编码器-解码器块来捕捉特征在空间空间中的长距离依赖关系，特别是大小远的物体。我们在Waymo开放和nuScenes数据集上进行了广泛的实验，HEDNet在两个数据集上 than previous state-of-the-art methods with competitive efficiency.代码可以在https://github.com/zhanggang001/HEDNet上获取。

UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer

paper_url: http://arxiv.org/abs/2310.20210
repo_url: None
paper_authors: Xuhang Chen, Zinuo Li, Shenghong Luo, Weiwen Chen, Shuqiang Wang, Chi-Man Pun
for: 提高水下图像质量，增强图像的多尺度增强和全球视场识别，并且适用于水下实际应用场景。
methods: 该paper使用Multi-scale Transformer-based Network（UWFormer），具有 semi-supervised learning和Nonlinear Frequency-aware Attention机制，以及Multi-Scale Fusion Feed-forward Network等技术来实现多尺度增强和全球视场识别。
results: 实验结果表明，该方法在水下彩色图像增强 tasks中表现出色，与当前状态OFTHE-ART方法相比，具有更高的质量和更多的视觉质量。

Abstract
Underwater images often exhibit poor quality, imbalanced coloration, and low contrast due to the complex and intricate interaction of light, water, and objects. Despite the significant contributions of previous underwater enhancement techniques, there exist several problems that demand further improvement: (i) Current deep learning methodologies depend on Convolutional Neural Networks (CNNs) that lack multi-scale enhancement and also have limited global perception fields. (ii) The scarcity of paired real-world underwater datasets poses a considerable challenge, and the utilization of synthetic image pairs risks overfitting. To address the aforementioned issues, this paper presents a Multi-scale Transformer-based Network called UWFormer for enhancing images at multiple frequencies via semi-supervised learning, in which we propose a Nonlinear Frequency-aware Attention mechanism and a Multi-Scale Fusion Feed-forward Network for low-frequency enhancement. Additionally, we introduce a specialized underwater semi-supervised training strategy, proposing a Subaqueous Perceptual Loss function to generate reliable pseudo labels. Experiments using full-reference and non-reference underwater benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of both quantity and visual quality.

摘要
水下图像经常具有低质量、不均匀的颜色和低对比度，这是由光、水和 объек们之间复杂的互动所致。 despite previous underwater enhancement techniques的重要贡献，还存在一些需要进一步改进的问题：（i）当前的深度学习方法依赖于卷积神经网络（CNN），它们缺乏多scale增强和全局观察领域。（ii）缺乏真实世界水下数据对的对应数据对，使用synthetic图像对poses the risk of overfitting。为解决以上问题，本文提出了一种叫做UWFormer的多scale transformer-based网络，通过半supervised学习来提高图像的多个频率。我们提出了一种非线性频率意识的注意机制和一种多Scale混合Feed-forward网络来进行低频增强。此外，我们还提出了一种特殊的水下半supervised训练策略，包括一种Subaqueous Perceptual损失函数来生成可靠的 Pseudo标签。实验表明，我们的方法在full-reference和非参照水下标准测试集上比前状态方法更高效。

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2310.20208
repo_url: https://github.com/lartpang/ZoomNeXt
paper_authors: Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, Huchuan Lu
for: 提高实际场景中涂抹物体检测的精度和效果，尤其是对于高度相似的背景和物体。
methods: 提出一种有效的协同合成pyramid网络，模仿人类观察暗缺图像和视频时的行为，例如快速拍摄和跟踪。该方法使用缩放策略学习混合尺度 semantics，并通过多头缩放集成和细腻度感知单元来全面探索物体和背景之间的微不可见征。
results: 实现了一种统一的建筑物体检测领域的高效方法，并在图像和视频CODBenchmark中表现出优于现有状态的方法。

Abstract
Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network which mimics human behavior when observing vague images and videos, \textit{i.e.}, zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame difference in spatiotemporal scenarios and adaptively ignore static representations. They provides a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks. The code will be available at \url{https://github.com/lartpang/ZoomNeXt}.

摘要
现代隐形物检测（COD）尝试分割视觉混合到背景中的物体，这是现实世界中非常复杂和困难的任务。除了高度内在相似性外，物体通常是多样化的Scale、模糊的外观和甚至严重干扰。为此，我们提出了一种有效的统一协作pyramid网络，模仿人类观察模糊图像和视频时的行为，即“缩进和缩出”。具体来说，我们的方法使用缩进策略来学习混合尺度 semantics，通过多头聚合和丰富的粒度感知单元来完全探索不可见的准确信息。前者的内在多头聚合提供更多的视觉模式。后者的路由机制可以有效地传递间帧差异和静止表示。它们提供了一个固定的基础 для实现静止和动态COD的统一架构。此外，考虑到来自不可分辨的文本的不确定性和抽象，我们构建了一个简单 yet有效的正则化，uncertainty awareness loss，以促进更高的信任度在候选区域中的预测。我们的高度任务友好的框架在图像和视频COD标准测试上一直表现出优秀的性能，代码将于 \url{https://github.com/lartpang/ZoomNeXt} 公布。

Visible to Thermal image Translation for improving visual task in low light conditions

paper_url: http://arxiv.org/abs/2310.20190
repo_url: None
paper_authors: Md Azim Khan
for: overcome the challenge of low-light conditions in visual tasks such as pedestrian detection and image-to-image translation
methods: proposed an end-to-end framework that translates RGB images into thermal ones using a generative network and a detector network
results: feasible to translate RGB training data to thermal data using GAN, which can produce thermal data more quickly and affordably for security and surveillance applications

Abstract
Several visual tasks, such as pedestrian detection and image-to-image translation, are challenging to accomplish in low light using RGB images. Heat variation of objects in thermal images can be used to overcome this. In this work, an end-to-end framework, which consists of a generative network and a detector network, is proposed to translate RGB image into Thermal ones and compare generated thermal images with real data. We have collected images from two different locations using the Parrot Anafi Thermal drone. After that, we created a two-stream network, preprocessed, augmented, the image data, and trained the generator and discriminator models from scratch. The findings demonstrate that it is feasible to translate RGB training data to thermal data using GAN. As a result, thermal data can now be produced more quickly and affordably, which is useful for security and surveillance applications.

摘要
许多视觉任务，如行人检测和图像转换，在低光照下使用RGB图像是困难的。但是热度变化的对象在热图像中可以被利用。在这项工作中，我们提出了一个综合框架，包括生成网络和检测网络，以将RGB图像翻译成热图像并与实际数据进行比较。我们从两个不同的位置收集了Parrot Anafi热摄影机飞机所拍摄的图像。然后，我们创建了两�ream网络，进行预处理、增强和图像数据的修正，并从零开始训练生成器和判断器模型。结果表明，使用GAN可以将RGB训练数据翻译成热数据，这将为安全和监控应用程序提供更快速、更有效的方法生成热数据。

LFAA: Crafting Transferable Targeted Adversarial Examples with Low-Frequency Perturbations

paper_url: http://arxiv.org/abs/2310.20175
repo_url: None
paper_authors: Kunyu Wang, Juluan Shi, Wenxuan Wang
for: 防御深度神经网络 against adversarial attacks
methods: 利用图像高频成分敏感性攻击 deep neural networks, 提出了一种名为 Low-Frequency Adversarial Attack（\name）的方法，通过增加低频成分中的敏感攻击来实现目标攻击。
results: 对于ImageNet dataset，提出的方法可以显著超越当前state-of-the-art方法，提高目标攻击成功率从3.2%到15.5%。

Abstract
Deep neural networks are susceptible to adversarial attacks, which pose a significant threat to their security and reliability in real-world applications. The most notable adversarial attacks are transfer-based attacks, where an adversary crafts an adversarial example to fool one model, which can also fool other models. While previous research has made progress in improving the transferability of untargeted adversarial examples, the generation of targeted adversarial examples that can transfer between models remains a challenging task. In this work, we present a novel approach to generate transferable targeted adversarial examples by exploiting the vulnerability of deep neural networks to perturbations on high-frequency components of images. We observe that replacing the high-frequency component of an image with that of another image can mislead deep models, motivating us to craft perturbations containing high-frequency information to achieve targeted attacks. To this end, we propose a method called Low-Frequency Adversarial Attack (\name), which trains a conditional generator to generate targeted adversarial perturbations that are then added to the low-frequency component of the image. Extensive experiments on ImageNet demonstrate that our proposed approach significantly outperforms state-of-the-art methods, improving targeted attack success rates by a margin from 3.2\% to 15.5\%.

摘要
Here is the text in Simplified Chinese:深度神经网络容易受到敌意攻击，这对实际应用中的安全性和可靠性带来了重要的威胁。最引人注目的攻击方式是转移基于攻击，其中敌对者通过制作一个攻击示例来欺骗一个模型，这个示例也可以欺骗其他模型。先前的研究已经在不argeted攻击中进行了进步，但是生成可转移的targeted攻击仍然是一项复杂的任务。在这种情况下，我们提出了一种名为低频敌意攻击（Low-Frequency Adversarial Attack）的新方法，通过利用深度神经网络对高频信息的敏感性来生成可转移的targeted攻击。我们发现，将另一个图像的高频信息添加到原始图像中可以诱导深度模型出现错误，这种情况 Motivates us to generate perturbations containing high-frequency information to achieve targeted attacks. To this end, we propose a method called Low-Frequency Adversarial Attack (\name), which trains a conditional generator to generate targeted adversarial perturbations that are then added to the low-frequency component of the image. Our extensive experiments on ImageNet show that our proposed approach significantly outperforms state-of-the-art methods, improving targeted attack success rates by a margin from 3.2\% to 15.5\%.

Synthesizing Diabetic Foot Ulcer Images with Diffusion Model

paper_url: http://arxiv.org/abs/2310.20140
repo_url: None
paper_authors: Reza Basiri, Karim Manji, Francois Harton, Alisha Poonja, Milos R. Popovic, Shehroz S. Khan
for: 本研究旨在使用扩散模型生成人工DFU图像，并评估其真实性。
methods: 研究使用了2000张DFU图像进行训练扩散模型，然后通过扩散过程生成图像。
results: 研究发现，扩散模型可以成功地生成可以与实际DFU图像无法分辨的图像。但是，临床专家对生成图像的评估表明，生成图像的真实性较低。此外，研究发现，FID和KID指标与临床专家的评估不匹配。

Abstract
Diabetic Foot Ulcer (DFU) is a serious skin wound requiring specialized care. However, real DFU datasets are limited, hindering clinical training and research activities. In recent years, generative adversarial networks and diffusion models have emerged as powerful tools for generating synthetic images with remarkable realism and diversity in many applications. This paper explores the potential of diffusion models for synthesizing DFU images and evaluates their authenticity through expert clinician assessments. Additionally, evaluation metrics such as Frechet Inception Distance (FID) and Kernel Inception Distance (KID) are examined to assess the quality of the synthetic DFU images. A dataset of 2,000 DFU images is used for training the diffusion model, and the synthetic images are generated by applying diffusion processes. The results indicate that the diffusion model successfully synthesizes visually indistinguishable DFU images. 70% of the time, clinicians marked synthetic DFU images as real DFUs. However, clinicians demonstrate higher unanimous confidence in rating real images than synthetic ones. The study also reveals that FID and KID metrics do not significantly align with clinicians' assessments, suggesting alternative evaluation approaches are needed. The findings highlight the potential of diffusion models for generating synthetic DFU images and their impact on medical training programs and research in wound detection and classification.

摘要
糖尿病足部溃疡（DFU）是一种严重的皮肤损伤，需要专业的护理。然而，实际的DFU数据集受到限制，妨碍临床培训和研究活动。在最近几年，生成对抗网络和扩散模型在许多应用场景中表现出了强大的生成能力和多样性。这篇论文探讨了扩散模型在生成DFU图像方面的潜力，并通过专业医生评估来评估生成的图像authenticity。此外，Frechet Inception Distance（FID）和Kernel Inception Distance（KID）等评估指标也被研究以评估生成的DFU图像质量。使用了2,000张DFU图像进行训练，并通过扩散过程生成了合成图像。结果表明，扩散模型成功地生成了视觉无法区分的DFU图像。70%的时间，专业医生将合成DFU图像标记为真正的DFU图像。然而，专业医生对实际图像的评估优于合成图像的评估。研究还发现，FID和KID指标与专业医生的评估不具有显著相关性，这 suggets alternative evaluation approaches are needed。这些发现高亮了扩散模型在生成合成DFU图像方面的潜力，以及它们对医疗培训计划和疾病检测和分类研究的影响。

Team I2R-VI-FF Technical Report on EPIC-KITCHENS VISOR Hand Object Segmentation Challenge 2023

paper_url: http://arxiv.org/abs/2310.20120
repo_url: None
paper_authors: Fen Fang, Yi Cheng, Ying Sun, Qianli Xu
for: 这个论文主要针对的是EPIC-KITCHENS VISOR Hand Object Segmentation Challenge中的手和活动对象分割任务，即基于单一帧输入的手和活动对象分割预测。
methods: 该论文结合基eline方法Point-based Rendering (PointRend)和Segment Anything Model (SAM)，以提高手和活动对象分割结果的准确性，同时避免检测错失。
results: 该论文的提交在EPIC-KITCHENS VISOR HOS Challenge中的评估标准上取得了第一名，表明该方法可以有效地结合现有方法的优点，并应用自己的修改，以提高手和活动对象分割的准确性。

Abstract
In this report, we present our approach to the EPIC-KITCHENS VISOR Hand Object Segmentation Challenge, which focuses on the estimation of the relation between the hands and the objects given a single frame as input. The EPIC-KITCHENS VISOR dataset provides pixel-wise annotations and serves as a benchmark for hand and active object segmentation in egocentric video. Our approach combines the baseline method, i.e., Point-based Rendering (PointRend) and the Segment Anything Model (SAM), aiming to enhance the accuracy of hand and object segmentation outcomes, while also minimizing instances of missed detection. We leverage accurate hand segmentation maps obtained from the baseline method to extract more precise hand and in-contact object segments. We utilize the class-agnostic segmentation provided by SAM and apply specific hand-crafted constraints to enhance the results. In cases where the baseline model misses the detection of hands or objects, we re-train an object detector on the training set to enhance the detection accuracy. The detected hand and in-contact object bounding boxes are then used as prompts to extract their respective segments from the output of SAM. By effectively combining the strengths of existing methods and applying our refinements, our submission achieved the 1st place in terms of evaluation criteria in the VISOR HOS Challenge.

摘要
在这份报告中，我们介绍了我们在EPIC-KITCHENS VISOR手ObjectSegmentation挑战中采取的方法，该挑战的目标是根据单一帧输入来估算手和对象之间的关系。EPIC-KITCHENS VISOR数据集提供了像素级注解，并作为 Egocentric Video中手和活动对象分割的标准准则。我们的方法结合基线方法，即Point-based Rendering（PointRend）和Segment Anything Model（SAM），以提高手和对象分割结果的准确性，同时避免错过检测的情况。我们利用基eline模型提供的准确手段图以提取更加精确的手和与之接触的对象段。我们利用SAM提供的类型不敏感分割，并应用特定的手工制约以提高结果。在基eline模型错过检测手或对象的情况下，我们将训练一个对象检测器，以提高检测精度。最后，我们使用基eline模型提供的bounding box来提取手和与之接触的对象的段。通过有效地结合现有方法和应用我们的修改，我们的提交在评估标准下得到了1st名。

Refined Equivalent Pinhole Model for Large-scale 3D Reconstruction from Spaceborne CCD Imagery

paper_url: http://arxiv.org/abs/2310.20117
repo_url: None
paper_authors: Hong Danyang, Yu Anzhu, Ji Song, Cao Xuefeng, Quan Yujun, Guo Wenyue, Qiu Chunping
for: 本研究は、Linear-array Charge-Coupled Device（CCD）卫星图像的大规模地表重建管线，以及它的physical interpretation和计算机视觉中更加新的重建管线。
methods: 我们在本研究中引入了一种方法，将Rational Functional Model（RFM）转换为Pinhole Camera Model（PCM），并derive了这个等价的pinhole模型的错误公式，以示重建精度与图像大小的关系。我们还提出了一种多项式图像精度修正模型，通过最小二乘方法来最小化等价错误。
results: 我们在四个图像集（WHU-TLC、DFC2019、ISPRS-ZY3和GF7）上进行了实验，结果显示，重建精度与图像大小成正相关，我们的多项式图像精度修正模型可以提高重建精度和完整性，特别是对于更大的图像。

Abstract
In this study, we present a large-scale earth surface reconstruction pipeline for linear-array charge-coupled device (CCD) satellite imagery. While mainstream satellite image-based reconstruction approaches perform exceptionally well, the rational functional model (RFM) is subject to several limitations. For example, the RFM has no rigorous physical interpretation and differs significantly from the pinhole imaging model; hence, it cannot be directly applied to learning-based 3D reconstruction networks and to more novel reconstruction pipelines in computer vision. Hence, in this study, we introduce a method in which the RFM is equivalent to the pinhole camera model (PCM), meaning that the internal and external parameters of the pinhole camera are used instead of the rational polynomial coefficient parameters. We then derive an error formula for this equivalent pinhole model for the first time, demonstrating the influence of the image size on the accuracy of the reconstruction. In addition, we propose a polynomial image refinement model that minimizes equivalent errors via the least squares method. The experiments were conducted using four image datasets: WHU-TLC, DFC2019, ISPRS-ZY3, and GF7. The results demonstrated that the reconstruction accuracy was proportional to the image size. Our polynomial image refinement model significantly enhanced the accuracy and completeness of the reconstruction, and achieved more significant improvements for larger-scale images.

摘要
在本研究中，我们提出了一个大规模地球表面重建管道，用于线性阵列感知元件（CCD）卫星成像。而主流卫星成像基于重建方法在质量上表现非常出色，但是理智函数模型（RFM）受到一些限制。例如，RFM没有准确的物理解释，与投射成像模型（PCM）有很大差异，因此无法直接应用于学习基于三维重建网络和计算机视觉中的更新重建管道。因此，在本研究中，我们提出了一种方法，将RFM等价于PCM，即使用内部和外部镜头参数而不是理智多项式偏参数。我们 then derivated an error formula for this equivalent pinhole model for the first time, demonstrating the influence of the image size on the accuracy of the reconstruction. In addition, we proposed a polynomial image refinement model that minimizes equivalent errors via the least squares method. Experiments were conducted using four image datasets: WHU-TLC, DFC2019, ISPRS-ZY3, and GF7. The results showed that the reconstruction accuracy was proportional to the image size. Our polynomial image refinement model significantly enhanced the accuracy and completeness of the reconstruction, and achieved more significant improvements for larger-scale images.

Medical Image Denosing via Explainable AI Feature Preserving Loss

paper_url: http://arxiv.org/abs/2310.20101
repo_url: None
paper_authors: Guanfang Dong, Anup Basu
for: 本研究旨在提出一种新的医疗图像去噪方法，可以高效地除除各种噪声，同时保留重要的医疗特征。
methods: 我们使用了一种基于梯度的EXplainable Artificial Intelligence（XAI）方法，设计了一个保持特征的损失函数。我们的特征保持损失函数受到噪声敏感性的特征，通过反射传播，在去噪过程中保持医疗图像特征的一致性。
results: 我们在三个可用的医疗图像 Dataset 上进行了广泛的实验，包括13种不同的噪声和artefacts。实验结果表明，我们的方法在去噪性、模型可解性和泛化性三个方面具有优势。

Abstract
Denoising algorithms play a crucial role in medical image processing and analysis. However, classical denoising algorithms often ignore explanatory and critical medical features preservation, which may lead to misdiagnosis and legal liabilities.In this work, we propose a new denoising method for medical images that not only efficiently removes various types of noise, but also preserves key medical features throughout the process. To achieve this goal, we utilize a gradient-based eXplainable Artificial Intelligence (XAI) approach to design a feature preserving loss function. Our feature preserving loss function is motivated by the characteristic that gradient-based XAI is sensitive to noise. Through backpropagation, medical image features before and after denoising can be kept consistent. We conducted extensive experiments on three available medical image datasets, including synthesized 13 different types of noise and artifacts. The experimental results demonstrate the superiority of our method in terms of denoising performance, model explainability, and generalization.

摘要
喷涂算法在医学影像处理和分析中扮演着关键的角色。然而，经典的喷涂算法通常忽视了重要的医学特征保留，这可能导致诊断错误和法律责任。在这种情况下，我们提出了一种新的喷涂方法，该方法不仅高效地除去了多种噪声，还保留了关键的医学特征。为 достичь这个目标，我们利用了梯度基于的可追溯性人工智能（XAI）方法设计了一个特征保持损失函数。我们的特征保持损失函数受到梯度基于的XAI的敏感性噪声。通过反射传播，医学影像特征之前和之后喷涂都能保持一致。我们在三个可用的医学影像集合上进行了广泛的实验，包括13种不同的噪声和artefacts。实验结果表明，我们的方法在喷涂性能、模型可追溯性和普适性等方面具有优势。

$p$-Poisson surface reconstruction in curl-free flow from point clouds

paper_url: http://arxiv.org/abs/2310.20095
repo_url: None
paper_authors: Yesom Park, Taekyung Lee, Jooyoung Hahn, Myungjoo Kang
for: 该 paper 的目的是从无序点云样本中重建一个略为平滑的表面，保留几何形状，没有任何其他信息。
methods: 该 paper 使用了隐式神经表示法（INR）来重建表面。它利用了部分偏微分方程的正确监督和几何形状的基本属性，以robustly重建高质量的表面。
results: 实验结果表明，提议的 INR 方法可以提供高质量和稳定的表面重建。代码可以在 \url{https://github.com/Yebbi/PINC} 上下载。

Abstract
The aim of this paper is the reconstruction of a smooth surface from an unorganized point cloud sampled by a closed surface, with the preservation of geometric shapes, without any further information other than the point cloud. Implicit neural representations (INRs) have recently emerged as a promising approach to surface reconstruction. However, the reconstruction quality of existing methods relies on ground truth implicit function values or surface normal vectors. In this paper, we show that proper supervision of partial differential equations and fundamental properties of differential vector fields are sufficient to robustly reconstruct high-quality surfaces. We cast the $p$-Poisson equation to learn a signed distance function (SDF) and the reconstructed surface is implicitly represented by the zero-level set of the SDF. For efficient training, we develop a variable splitting structure by introducing a gradient of the SDF as an auxiliary variable and impose the $p$-Poisson equation directly on the auxiliary variable as a hard constraint. Based on the curl-free property of the gradient field, we impose a curl-free constraint on the auxiliary variable, which leads to a more faithful reconstruction. Experiments on standard benchmark datasets show that the proposed INR provides a superior and robust reconstruction. The code is available at \url{https://github.com/Yebbi/PINC}.

摘要
本文的目的是从无序点云样本中重建一个平滑表面，保留几何形状，不需要任何其他信息，只使用点云样本。卷积神经表示（INR）在面重建方面最近几年得到了广泛关注。然而，现有方法的重建质量取决于真实的隐函数值或表面法向量。在这篇文章中，我们表明了对部分微分方程的监督和几何Vector场的基本属性，可以robustly重建高质量表面。我们将$p$-Poisson方程转化为学习签名函数（SDF），并将重constructed表面表示为SDF的零水平集。为了高效地训练，我们开发了变分结构，引入了梯度场作为副变量，并直接对副变量做出$p$-Poisson方程的硬约束。基于梯度场的curl-free性质，我们对副变量做出curl-free约束，从而导致更加 faithful的重建。实验表明，提议的INR可以提供高质量和Robust的重建。代码可以在 \url{https://github.com/Yebbi/PINC} 上获取。

Beyond U: Making Diffusion Models Faster & Lighter

paper_url: http://arxiv.org/abs/2310.20092
repo_url: None
paper_authors: Sergio Calvo-Ordonez, Jiahao Huang, Lipei Zhang, Guang Yang, Carola-Bibiane Schonlieb, Angelica I Aviles-Rivero
for: 这个论文的目的是提出一种基于连续动力系统的推 diffusion models 的新型降噪网络，以提高推 diffusion models 的效率和降噪能力。
methods: 这种方法利用了连续动力系统来设计一个新的降噪网络，该网络比标准 U-Net 更 parameter-efficient，更快 converge，并且更抗雨噪。
results: 实验表明，这种方法可以在 denoising probabilistic diffusion models 中实现更好的降噪性和更快的推 diffusion 速度，并且可以降低约25% 的参数和30% 的 floating point operations (FLOPs)。此外，这种模型在等条件下的推测速度比基eline模型快上到 70%。

Abstract
Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs. In this work, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with denoising probabilistic diffusion models, our framework operates with approximately a quarter of the parameters and 30% of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, our model is up to 70% faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions.

摘要
Diffusion 模型是一家 генератив的模型，在图像生成、视频生成和分子设计等任务中表现出了纪录级的能力。尽管它们具有这些能力，但在反噪处理过程中，它们的效率仍然是一个挑战，因为它们的涉及速率较慢，计算成本较高。在这项工作中，我们介绍了一种使用连续动力系统来设计一种新的反噪网络，该网络具有更好的参数效率，更快的收敛速度，并且具有更高的噪声抗性。在使用反噪概率扩散模型时，我们的框架需要约一半的参数和三分之一的浮点运算（FLOPs），相比标准 U-Net 在 Denoising Diffusion Probabilistic Models（DDPMs）中。此外，我们的模型在等条件下的执行速度比基eline模型快上至 70%。

2023-10-31

Decodable and Sample Invariant Continuous Object Encoder

Image Restoration with Point Spread Function Regularization and Active Learning

Object-centric Video Representation for Long-term Action Anticipation

Multi-task Deep Convolutional Network to Predict Sea Ice Concentration and Drift in the Arctic Ocean

Medi-CAT: Contrastive Adversarial Training for Medical Image Classification

Joint Depth Prediction and Semantic Segmentation with Multi-View SAM

Spuriosity Rankings for Free: A Simple Framework for Last Layer Retraining Based on Object Detection

YOLOv8-Based Visual Detection of Road Hazards: Potholes, Sewer Covers, and Manholes

View Classification and Object Detection in Cardiac Ultrasound to Localize Valves via Deep Learning

FPO++: Efficient Encoding and Rendering of Dynamic Neural Radiance Fields by Analyzing and Enhancing Fourier PlenOctrees

DDAM-PS: Diligent Domain Adaptive Mixer for Person Search

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

StairNet: Visual Recognition of Stairs for Human-Robot Locomotion

Addressing Limitations of State-Aware Imitation Learning for Autonomous Driving

Dynamic Batch Norm Statistics Update for Natural Robustness

Using Higher-Order Moments to Assess the Quality of GAN-generated Image Features

Deepfake detection by exploiting surface anomalies: the SurFake approach

Diffusion Reconstruction of Ultrasound Images with Informative Uncertainty

Enhanced Synthetic MRI Generation from CT Scans Using CycleGAN with Feature Extraction

Brain-like Flexible Visual Inference by Harnessing Feedback-Feedforward Alignment

FLODCAST: Flow and Depth Forecasting via Multimodal Recurrent Architectures

Long-Tailed Learning as Multi-Objective Optimization

LAVSS: Location-Guided Audio-Visual Spatial Audio Separation

SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark

Assessing and Enhancing Robustness of Deep Learning Models with Corruption Emulation in Digital Pathology

Thermal-Infrared Remote Target Detection System for Maritime Rescue based on Data Augmentation with 3D Synthetic Data

High-Resolution Reference Image Assisted Volumetric Super-Resolution of Cardiac Diffusion Weighted Imaging

A Low-cost Strategic Monitoring Approach for Scalable and Interpretable Error Detection in Deep Neural Networks

Class Incremental Learning with Pre-trained Vision-Language Models

Recaptured Raw Screen Image and Video Demoiréing via Channel and Spatial Modulations

GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data

HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Bilateral Network with Residual U-blocks and Dual-Guided Attention for Real-time Semantic Segmentation

Annotator: A Generic Active Learning Baseline for LiDAR Semantic Segmentation

IARS SegNet: Interpretable Attention Residual Skip connection SegNet for melanoma segmentation

Machine learning refinement of in situ images acquired by low electron dose LC-TEM

From Denoising Training to Test-Time Adaptation: Enhancing Domain Generalization for Medical Image Segmentation

Low-Dose CT Image Enhancement Using Deep Learning

Pose-to-Motion: Cross-Domain Motion Retargeting with Pose Prior

Contrast-agent-induced deterministic component of CT-density in the abdominal aorta during routine angiography: proof of concept study

HEDNet: A Hierarchical Encoder-Decoder Network for 3D Object Detection in Point Clouds

UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale Transformer

ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection

Visible to Thermal image Translation for improving visual task in low light conditions

LFAA: Crafting Transferable Targeted Adversarial Examples with Low-Frequency Perturbations

Synthesizing Diabetic Foot Ulcer Images with Diffusion Model

Team I2R-VI-FF Technical Report on EPIC-KITCHENS VISOR Hand Object Segmentation Challenge 2023

Refined Equivalent Pinhole Model for Large-scale 3D Reconstruction from Spaceborne CCD Imagery

Medical Image Denosing via Explainable AI Feature Preserving Loss

$p$-Poisson surface reconstruction in curl-free flow from point clouds

Beyond U: Making Diffusion Models Faster & Lighter