2023-10-09

cs.CV

cs.CV - 2023-10-09

DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

paper_url: http://arxiv.org/abs/2310.06196
repo_url: https://github.com/shakeebmurtaza/dips
paper_authors: Shakeeb Murtaza, Soufiane Belharbi, Marco Pedersoli, Aydin Sarraf, Eric Granger
for: 本研究想要解决无监督图像本地化问题，使用自适应视Transformers（SSTs）获得高质量的本地化地图，但这些地图仍然是类型不归一的。
methods: 本研究提出了Discriminative Pseudo-label Sampling（DiPS）方法，利用多个注意力地图，通过预训练的类别器确定最有特征的区域，以确保选择的ROI覆盖正确的图像对象而不是背景噪音对象。
results: 实验结果表明，我们的架构、以及我们的transformer基于的提案可以在CUB、ILSVRC、OpenImages和TelDrone等 dataset上获得更高的本地化性能，并且可以在无监督下进行图像本地化和分类任务。

Abstract
Self-supervised vision transformers (SSTs) have shown great potential to yield rich localization maps that highlight different objects in an image. However, these maps remain class-agnostic since the model is unsupervised. They often tend to decompose the image into multiple maps containing different objects while being unable to distinguish the object of interest from background noise objects. In this paper, Discriminative Pseudo-label Sampling (DiPS) is introduced to leverage these class-agnostic maps for weakly-supervised object localization (WSOL), where only image-class labels are available. Given multiple attention maps, DiPS relies on a pre-trained classifier to identify the most discriminative regions of each attention map. This ensures that the selected ROIs cover the correct image object while discarding the background ones, and, as such, provides a rich pool of diverse and discriminative proposals to cover different parts of the object. Subsequently, these proposals are used as pseudo-labels to train our new transformer-based WSOL model designed to perform classification and localization tasks. Unlike standard WSOL methods, DiPS optimizes performance in both tasks by using a transformer encoder and a dedicated output head for each task, each trained using dedicated loss functions. To avoid overfitting a single proposal and promote better object coverage, a single proposal is randomly selected among the top ones for a training image at each training step. Experimental results on the challenging CUB, ILSVRC, OpenImages, and TelDrone datasets indicate that our architecture, in combination with our transformer-based proposals, can yield better localization performance than state-of-the-art methods.

摘要
自适应视Transformers（SSTs）已经展示了很大的潜力，可以生成各种不同的对象映射，但这些映射是无监督的，因此无法区分识别对象和背景噪音。在这篇论文中，我们引入了Discriminative Pseudo-label Sampling（DiPS）方法，利用这些无监督映射来实现弱监督对象定位（WSOL），只需要图像类别标签。给出多个注意力映射后，DiPS使用预训练的分类器来确定每个注意力映射中最有挑战性的区域。这确保了选择的ROIs覆盖了正确的图像对象，而不是背景对象，并提供了丰富多样的可靠提案，以便覆盖不同部分的对象。然后，这些提案被用作 Pseudo-标签来训练我们新的 transformer 基于 WSOL 模型，用于进行分类和定位任务。不同于标准 WSOL 方法，DiPS 可以在两个任务之间优化性能，使用 transformer 编码器和专门的输出头，每个任务都使用专门的损失函数进行训练。为了避免单个提案过拟合和促进更好的对象覆盖，在每个训练图像上选择其中一个随机的提案。实验结果表明，我们的架构，与我们的 transformer 基于提案相结合，可以在复杂的 CUB、ILSVRC、OpenImages 和 TelDrone datasets 上达到更高的定位性能。

HydraViT: Adaptive Multi-Branch Transformer for Multi-Label Disease Classification from Chest X-ray Images

paper_url: http://arxiv.org/abs/2310.06143
repo_url: https://github.com/YigitTurali/HydraViT
paper_authors: Şaban Öztürk, M. Yiğit Turalı, Tolga Çukur
for: 这个论文是用于提高胸部X射线图像的多标签分类性能的研究。
methods: 该方法 synergistically combines a transformer backbone with a multi-branch output module，使用自我注意机制来适应任务关键区域，并 dedicates an independent branch to each disease label 以及一个汇总 Branch across labels。
results: 实验表明，Compared to competing attention-guided methods, region-guided methods, and semantic-guided methods, HydraViT 的多标签分类性能平均提高1.2%, 1.4%, 和1.0%。

Abstract
Chest X-ray is an essential diagnostic tool in the identification of chest diseases given its high sensitivity to pathological abnormalities in the lungs. However, image-driven diagnosis is still challenging due to heterogeneity in size and location of pathology, as well as visual similarities and co-occurrence of separate pathology. Since disease-related regions often occupy a relatively small portion of diagnostic images, classification models based on traditional convolutional neural networks (CNNs) are adversely affected given their locality bias. While CNNs were previously augmented with attention maps or spatial masks to guide focus on potentially critical regions, learning localization guidance under heterogeneity in the spatial distribution of pathology is challenging. To improve multi-label classification performance, here we propose a novel method, HydraViT, that synergistically combines a transformer backbone with a multi-branch output module with learned weighting. The transformer backbone enhances sensitivity to long-range context in X-ray images, while using the self-attention mechanism to adaptively focus on task-critical regions. The multi-branch output module dedicates an independent branch to each disease label to attain robust learning across separate disease classes, along with an aggregated branch across labels to maintain sensitivity to co-occurrence relationships among pathology. Experiments demonstrate that, on average, HydraViT outperforms competing attention-guided methods by 1.2%, region-guided methods by 1.4%, and semantic-guided methods by 1.0% in multi-label classification performance.

摘要
骨肉X光是诊断肺病的非常重要工具，可以快速发现肺部疾病的病理变化。然而，基于图像的诊断仍然是一项挑战，因为肺部疾病的形态特征可能具有不同的大小和位置，同时也可能具有相似的视觉特征和共存的疾病。由于疾病相关的区域通常占用图像的相对较小的部分，基于传统的卷积神经网络（CNN）的分类模型受到了地方性偏好的影响。在以前，人们使用了注意力地图或空间mask来引导注意力于可能关键的区域，但在疾病分布的空间分布不均匀的情况下，学习局部化导向是挑战。为了改进多标签分类性能，我们在这里提出了一种新的方法：HydraViT。HydraViT synergistically combines a transformer backbone with a multi-branch output module with learned weighting。 transformer backbone 增强了X光图像的长距离上下文敏感度，并使用自注意机制来自适应任务关键区域。 multi-branch output module 各自设置了独立的分支来每个疾病标签，以实现稳定的学习 Across separate disease classes， along with an aggregated branch across labels to maintain sensitivity to co-occurrence relationships among pathology。实验表明，HydraViT 在多标签分类性能中，平均与竞争对手注意力引导方法比出得1.2%，region-guided方法比出得1.4%，semantic-guided方法比出得1.0%。

WinSyn: A High Resolution Testbed for Synthetic Data

paper_url: http://arxiv.org/abs/2310.08471
repo_url: https://github.com/twak/winsyn_metadata
paper_authors: Tom Kelly, John Femiani, Peter Wonka
for: 研究synthetic-to-real学习和人工数据生成
methods: 使用高分辨率照片和渲染技术生成数据集（WinSyn），包括75739张照片和89318个裁剪图像，其中902个是semantic标注
results: 提供了一个域对齐的光idel模型，可以进行多种参数分布和工程方法的实验，并提供了21290个synthetic图像的第二个数据集。这个数据集是用于研究synthetic数据生成领域的一个重要的测试平台。

Abstract
We present WinSyn, a dataset consisting of high-resolution photographs and renderings of 3D models as a testbed for synthetic-to-real research. The dataset consists of 75,739 high-resolution photographs of building windows, including traditional and modern designs, captured globally. These include 89,318 cropped subimages of windows, of which 9,002 are semantically labeled. Further, we present our domain-matched photorealistic procedural model which enables experimentation over a variety of parameter distributions and engineering approaches. Our procedural model provides a second corresponding dataset of 21,290 synthetic images. This jointly developed dataset is designed to facilitate research in the field of synthetic-to-real learning and synthetic data generation. WinSyn allows experimentation into the factors that make it challenging for synthetic data to compete with real-world data. We perform ablations using our synthetic model to identify the salient rendering, materials, and geometric factors pertinent to accuracy within the labeling task. We chose windows as a benchmark because they exhibit a large variability of geometry and materials in their design, making them ideal to study synthetic data generation in a constrained setting. We argue that the dataset is a crucial step to enable future research in synthetic data generation for deep learning.

摘要
我们介绍WinSyn数据集，包含高分辨率照片和3D模型渲染的集合，用于实验式数据生成研究的测试平台。该数据集包含75739张高分辨率照片建筑窗户，包括传统和现代设计，全球摄取。这些数据包括89318张窗户截取图，其中9022个是semantically标注。此外，我们提供域匹配的高品质渲染模型，允许在多种参数分布和工程方法之间进行实验。我们的渲染模型生成了21290张 sintetic图像。这些数据集在实验式数据生成领域的研究中提供了一个jointly开发的测试平台。WinSyn允许我们对于实际数据和 sintetic数据之间的差异进行实验，并通过我们的 sintetic模型进行ablation来确定窗户渲染、材料和几何因素对于标签任务的精度有优先的作用。我们选择窗户作为标准 benchmark，因为它们在设计上具有广泛的几何和材料变化，使其成为研究 sintetic数据生成在限定的设定下的理想选择。我们认为这些数据集是未来研究 sintetic数据生成的深度学习的关键一步。

Factorized Tensor Networks for Multi-Task and Multi-Domain Learning

paper_url: http://arxiv.org/abs/2310.06124
repo_url: https://github.com/yashgarg98/FTN
paper_authors: Yash Garg, Nebiyou Yismaw, Rakib Hyder, Ashley Prater-Bennette, M. Salman Asif
for:FTN 是一种多任务多Domain学习方法，可以通过单一的网络来学习多个任务和多个Domain。methods:FTN 使用一个冻结的背景网络，然后逐渐添加任务/Domain特定的低维度tensor因子来共享的冻结网络中。results:FTN 可以在多个目标Domain和任务上达到类似于独立单任务/Domain网络的准确率，但需要的额外参数比较少。我们在多个广泛使用的多Domain和多任务数据集上进行了实验，并观察到 FTN 可以在不同的卷积架构和转换架构上达到类似的准确率。

Abstract
Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The key challenge and opportunity is to exploit shared information across tasks and domains to improve the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we propose a factorized tensor network (FTN) that can achieve accuracy comparable to independent single-task/domain networks with a small number of additional parameters. FTN uses a frozen backbone network from a source model and incrementally adds task/domain-specific low-rank tensor factors to the shared frozen network. This approach can adapt to a large number of target domains and tasks without catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. We observed that FTN achieves similar accuracy as single-task/domain methods while using only a fraction of additional parameters per task.

摘要

QR-Tag: Angular Measurement and Tracking with a QR-Design Marker

paper_url: http://arxiv.org/abs/2310.06109
repo_url: None
paper_authors: Simeng Qiu, Hadi Amata, Wolfgang Heidrich
for: 这个论文主要用于提出一种非接触式对象跟踪方法，用于应用于机器人、虚拟和增强现实以及工业计算机视觉等领域。
methods: 该方法利用误差效应和QR码设计，实时测量和跟踪对象的旋转偏移。
results: simulations show that the proposed method is computationally efficient and has high accuracy in measuring angular information.

Abstract
Directional information measurement has many applications in domains such as robotics, virtual and augmented reality, and industrial computer vision. Conventional methods either require pre-calibration or necessitate controlled environments. The state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. However, it is still not a fully QR code design. To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.

摘要
方向信息测量在 роботиCS、虚拟和增强现实以及工业计算机视觉领域有广泛的应用。传统方法 Either require pre-calibration 或者需要控制环境。 current state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. However, it is still not a fully QR code design. To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.Here's the breakdown of the translation:方向信息测量 (directional information measurement) - 方向信息测量 (directional information measurement)领域 (domains) - 领域 (domains)such as robotics, virtual and augmented reality, and industrial computer vision. - 如 robotics, 虚拟和增强现实, 和工业计算机视觉Conventional methods either require pre-calibration or necessitate controlled environments. - 传统方法 Either require pre-calibration 或者需要控制环境。The state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely. - current state-of-the-art MoireTag approach exploits the Moire effect and QR-design to continuously track the angular shift precisely.However, it is still not a fully QR code design. - However, it is still not a fully QR code design.To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate. - To overcome the above challenges, we propose a novel snapshot method for discrete angular measurement and tracking with scannable QR-design patterns that are generated by binary structures printed on both sides of a glass plate.The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera. - The QR codes, resulting from the parallax effect due to the geometry alignment between two layers, can be readily measured as angular information using a phone camera.The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy. - The simulation results show that the proposed non-contact object tracking framework is computationally efficient with high accuracy.

Developing and Refining a Multifunctional Facial Recognition System for Older Adults with Cognitive Impairments: A Journey Towards Enhanced Quality of Life

paper_url: http://arxiv.org/abs/2310.06107
repo_url: https://github.com/li-8023/multi-function-face-recognition
paper_authors: Li He
for: 该论文主要用于提供一种适用于老年人智能障碍的多功能面部识别系统（MFRS），以帮助老年人更好地完成日常任务。
methods: 该系统使用了face_recognition库，这是一个免费的开源库，可以提取、识别和处理面部特征。此外，系统还包括拍照和录音功能，以提高系统的用户性和通用性。
results: 该系统的实现和评估表明，它可以帮助老年人更好地完成日常任务，例如识别家人和朋友、记录日常活动和照片，以及帮助老年人找回失去的物品。

Abstract
In an era where the global population is aging significantly, cognitive impairments among the elderly have become a major health concern. The need for effective assistive technologies is clear, and facial recognition systems are emerging as promising tools to address this issue. This document discusses the development and evaluation of a new Multifunctional Facial Recognition System (MFRS), designed specifically to assist older adults with cognitive impairments. The MFRS leverages face_recognition [1], a powerful open-source library capable of extracting, identifying, and manipulating facial features. Our system integrates the face recognition and retrieval capabilities of face_recognition, along with additional functionalities to capture images and record voice memos. This combination of features notably enhances the system's usability and versatility, making it a more user-friendly and universally applicable tool for end-users. The source code for this project can be accessed at https://github.com/Li-8023/Multi-function-face-recognition.git.

摘要
在老龄化的时代，老年人的认知障碍成为了主要的健康问题。需要有效的助助技术，面Recognition系统正在迅速发展。这份文档介绍了一种新的多功能面Recognition系统（MFRS），特意为老年人 WITH cognitive impairments设计。MFRS利用face_recognition[1]，一个强大的开源库，可以提取、识别和修改面部特征。我们的系统结合了面Recognition和检索功能，并添加了捕捉图像和录音笔记功能。这种结合使得系统的使用和通用性得到了显著提高，使得它成为更加用户友好和普遍适用的工具。相关代码可以在https://github.com/Li-8023/Multi-function-face-recognition.git中下载。

Advancing Diagnostic Precision: Leveraging Machine Learning Techniques for Accurate Detection of Covid-19, Pneumonia, and Tuberculosis in Chest X-Ray Images

paper_url: http://arxiv.org/abs/2310.06080
repo_url: None
paper_authors: Aditya Kulkarni, Guruprasad Parasnis, Harish Balasubramanian, Vansh Jain, Anmol Chokshi, Reena Sonkusare
for: 这个研究旨在提出一种多类分类方法，用于早期诊断COVID-19、TB和肺炎等肺疾病。
methods: 该方法使用现代深度学习和图像处理技术，包括一种新的卷积神经网络和多种传输学习预训练模型。
results: 研究使用公共可用的多类Kaggle数据集和NIH数据集进行了严格测试，并取得了COVID-19的AUC值为0.95、TB的AUC值为0.99和肺炎的AUC值为0.98，以及Recall和精度等高水平。

Abstract
Lung diseases such as COVID-19, tuberculosis (TB), and pneumonia continue to be serious global health concerns that affect millions of people worldwide. In medical practice, chest X-ray examinations have emerged as the norm for diagnosing diseases, particularly chest infections such as COVID-19. Paramedics and scientists are working intensively to create a reliable and precise approach for early-stage COVID-19 diagnosis in order to save lives. But with a variety of symptoms, medical diagnosis of these disorders poses special difficulties. It is essential to address their identification and timely diagnosis in order to successfully treat and prevent these illnesses. In this research, a multiclass classification approach using state-of-the-art methods for deep learning and image processing is proposed. This method takes into account the robustness and efficiency of the system in order to increase diagnostic precision of chest diseases. A comparison between a brand-new convolution neural network (CNN) and several transfer learning pre-trained models including VGG19, ResNet, DenseNet, EfficientNet, and InceptionNet is recommended. Publicly available and widely used research datasets like Shenzen, Montogomery, the multiclass Kaggle dataset and the NIH dataset were used to rigorously test the model. Recall, precision, F1-score, and Area Under Curve (AUC) score are used to evaluate and compare the performance of the proposed model. An AUC value of 0.95 for COVID-19, 0.99 for TB, and 0.98 for pneumonia is obtained using the proposed network. Recall and precision ratings of 0.95, 0.98, and 0.97, respectively, likewise met high standards.

摘要
肺病如COVID-19、肺 tubercular (TB) 和肺炎病综合症仍然是全球健康问题，影响了数百万人。在医疗实践中，胸部X射线检测成为诊断疾病的标准方法，特别是肺部感染如COVID-19。 paramedics 和科学家在努力创造一种可靠和精准的早期COVID-19诊断方法，以拯救生命。但这些疾病的症状多样，医学诊断呈现特殊困难。因此，有必要解决其识别和早期诊断，以成功地治疗和预防这些疾病。在这项研究中，我们提出了一种多类分类方法，使用现代深度学习和图像处理技术。这种方法考虑了系统的稳定性和效率，以提高胸部疾病诊断的精度。我们建议对多个转移学习预训练模型，包括VGG19、ResNet、DenseNet、EfficientNet和InceptionNet进行比较。使用公共可用的和广泛使用的研究数据集，如深圳、Montgomery、多类Kaggle数据集和NIH数据集，对模型进行严格测试。我们使用Recall、精度、F1分数和报道曲线（AUC）分数来评估和比较提案模型的性能。我们获得了COVID-19的AUC值为0.95，TB的AUC值为0.99，肺炎病的AUC值为0.98，同时Recall和精度分数都达到了高标准。

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

paper_url: http://arxiv.org/abs/2310.05922
repo_url: None
paper_authors: Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He
for: 文章主要目标是提高文本到视频编辑 task 中的视觉一致性。
methods: 作者首次在 diffusion 模型中的 U-Net 中引入了 optical flow，以解决文本到视频编辑 task 中的一致性问题。
results: 实验结果表明，我们的方法可以减少不一致性，并在现有的文本到视频编辑 benchmark 上达到新的州OF-THE-ART 性能。

Abstract
Text-to-video editing aims to edit the visual appearance of a source video conditional on textual prompts. A major challenge in this task is to ensure that all frames in the edited video are visually consistent. Most recent works apply advanced text-to-image diffusion models to this task by inflating 2D spatial attention in the U-Net into spatio-temporal attention. Although temporal context can be added through spatio-temporal attention, it may introduce some irrelevant information for each patch and therefore cause inconsistency in the edited video. In this paper, for the first time, we introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module, thus improving the visual consistency in the edited videos. Additionally, our method is training-free and can be seamlessly integrated into any diffusion-based text-to-video editing methods and improve their visual consistency. Experiment results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance. In particular, our method excels in maintaining the visual consistency in the edited videos.

摘要
文本到视频编辑是编辑源视频的视觉外观，根据文本提示进行编辑。该任务的主要挑战是保证所有视频帧的视觉一致性。大多数最新的工作通过增加文本到图像扩散模型来解决这个问题，其中包括在U-Net中增加2D空间注意力。虽然通过空间-时间注意力可以添加时间上下文，但可能会在每个patch中添加无关信息，从而导致编辑视频中的不一致性。在这篇论文中，我们首次在U-Net中引入摩尔变换来解决文本到视频编辑中的不一致性问题。我们的方法，FLATTEN，在注意力模块中使用同一个流动路径的patch来attend于彼此，从而提高编辑视频中的视觉一致性。此外，我们的方法是训练自由的，可以轻松地与任何扩散基于文本到视频编辑方法结合使用，提高其视觉一致性。实验结果表明，我们的提议方法在现有的文本到视频编辑标准准测试集上达到了新的状态纪录性。尤其是，我们的方法在编辑视频中维持视觉一致性的表现出色。

SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation

paper_url: http://arxiv.org/abs/2310.05920
repo_url: None
paper_authors: Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek
for: This paper aims to improve object detection in images by removing the need for feature pyramids and multi-scale feature maps, which are commonly used in modern object detectors.
methods: The paper proposes a transformer-based detector with scale-aware attention, which allows the detector to operate on single-scale features.
results: The proposed method, called SimPLR, achieves strong performance compared to other object detectors, including end-to-end detectors and plain-backbone detectors, while being faster.

Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing handcrafted components using transformers, multi-scale feature maps remain a key factor for their empirical success, even with a plain backbone like the Vision Transformer (ViT). In this paper, we show that this reliance on feature pyramids is unnecessary and a transformer-based detector with scale-aware attention enables the plain detector `SimPLR' whose backbone and detection head both operate on single-scale features. The plain architecture allows SimPLR to effectively take advantages of self-supervised learning and scaling approaches with ViTs, yielding strong performance compared to multi-scale counterparts. We demonstrate through our experiments that when scaling to larger backbones, SimPLR indicates better performance than end-to-end detectors (Mask2Former) and plain-backbone detectors (ViTDet), while consistently being faster. The code will be released.

摘要
现代物体检测器的设计中，检测对象在不同比例的图像中的能力具有重要作用。尽管已经做出了许多进步，使用变换器来消除手工组件，但是多个比例特征图仍然是现有的静态成功的关键因素，即使使用简单的背bone like Vision Transformer (ViT)。在这篇论文中，我们表明这种依赖于特征阶段是不必要的，并使用缩放意识来替代特征阶段，从而实现了简单的检测器 `SimPLR`。该简单的架构允许SimPLR有效地利用自我超vised学习和缩放方法，并在ViTs上表现出优于多比例对手。我们通过实验表明，当扩大到更大的背bone时，SimPLR表现 mejor于端到端检测器（Mask2Former）和平面背bone检测器（ViTDet），而且一直快。代码将被发布。

Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input

paper_url: http://arxiv.org/abs/2310.05917
repo_url: None
paper_authors: Donglai Xiang, Fabian Prada, Zhe Cao, Kaiwen Guo, Chenglei Wu, Jessica Hodgins, Timur Bagautdinov
for: 这个论文的目的是为了创建具有协调动作的真实化人物模型，并使用RGB-D输入对裤子和衣服进行精准模拟。
methods: 该论文提出了一种基于神经网络的迭代最近点算法（N-ICP），用于高效地跟踪裤子的大致形状，并使用RGB-D输入对裤子和衣服进行精准模拟。
results: 该论文的实验结果显示，使用N-ICP算法可以高效地生成具有协调动作和真实的裤子和衣服动画，并且可以在新的测试环境中进行普适化。

Abstract
Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.

摘要
<> clothings 是人类外表的重要组成部分，但是模拟 photorealistic avatars 中的 clothings 具有挑战性。在这项工作中，我们展示了可以受到 sparse RGB-D 输入的动态摆动粗布衣的 avatars，并且可以通过 body 和 face 运动来详细描述 clothings 的形状。我们提出了一种基于 Neural Iterative Closest Point 算法（N-ICP）的方法，可以高效地跟踪粗布衣的抽象形状，只需要 sparse depth 输入。给出了车 tracking 结果后，输入 RGB-D 图像将被重新映射到 texel-aligned 特征上，然后被 fed 到 drivable avatar 模型中，以实现高质量的形状和外观重建。我们对最近的 image-driven synthesis 基准进行了评估，并进行了 N-ICP 算法的全面分析。我们示出了我们的方法可以在新的测试环境中广泛应用，同时保持高度准确的 clothings 动态和外观重建。

CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird’s-Eye View Fusion

paper_url: http://arxiv.org/abs/2310.06008
repo_url: None
paper_authors: Donghao Qiao, Farhana Zulkernine
for: 提高自动驾驶车辆的安全性和可靠性，通过共享多感器数据进行共同探测
methods: 利用LiDAR和摄像头数据的多模式融合，通过独立窗口基于交叉注意力（DWCA）模块进行融合，并将融合后的BEV特征图分享到CAV之间，使用3D卷积神经网络进行特征聚合
results: 对OPV2V数据集进行两项探测任务（BEVsemantic segmentation和3D物体检测）的实验结果表明，我们的DWCA LiDAR-camera融合模型在单模态数据和现有BEV融合模型之上具有显著性能优势，而我们的总共同探测架构CoBEVFusion也实现了与其他共同探测模型相当的性能

Abstract
Autonomous Vehicles (AVs) use multiple sensors to gather information about their surroundings. By sharing sensor data between Connected Autonomous Vehicles (CAVs), the safety and reliability of these vehicles can be improved through a concept known as cooperative perception. However, recent approaches in cooperative perception only share single sensor information such as cameras or LiDAR. In this research, we explore the fusion of multiple sensor data sources and present a framework, called CoBEVFusion, that fuses LiDAR and camera data to create a Bird's-Eye View (BEV) representation. The CAVs process the multi-modal data locally and utilize a Dual Window-based Cross-Attention (DWCA) module to fuse the LiDAR and camera features into a unified BEV representation. The fused BEV feature maps are shared among the CAVs, and a 3D Convolutional Neural Network is applied to aggregate the features from the CAVs. Our CoBEVFusion framework was evaluated on the cooperative perception dataset OPV2V for two perception tasks: BEV semantic segmentation and 3D object detection. The results show that our DWCA LiDAR-camera fusion model outperforms perception models with single-modal data and state-of-the-art BEV fusion models. Our overall cooperative perception architecture, CoBEVFusion, also achieves comparable performance with other cooperative perception models.

摘要
自动驾驶车（AV）使用多种感知器来收集它们周围环境的信息。通过Connected Autonomous Vehicles（CAVs）之间分享感知器数据，可以提高自动驾驶车的安全性和可靠性。然而，现有的合作感知方法只是分享单一感知器数据，如摄像头或LiDAR。在这项研究中，我们探索了多感知器数据源的融合，并提出了一个名为CoBEVFusion的框架。CoBEVFusion框架将LiDAR和摄像头数据融合到一起，创建一个鸟瞰视图（BEV）表示。CAVs在本地处理多模态数据，并使用双窗口基于cross-attention（DWCA）模块将LiDAR和摄像头特征融合到一起。融合后的BEV特征地图被CAVs中共享，并将其传输给3D卷积神经网络进行汇聚。我们的CoBEVFusion框架在OPV2V合作感知数据集上进行了两种感知任务的评估：BEV semantic segmentation和3D объек物检测。结果表明，我们的DWCA LiDAR-camera融合模型在单模态数据和现有BEV融合模型之上具有优势。总的来说，我们的CoBEVFusion框架在合作感知领域也实现了相对于其他合作感知模型的比较好的性能。

Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models

paper_url: http://arxiv.org/abs/2310.05873
repo_url: None
paper_authors: Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James Kwok
for: 提高Diffusion模型的生成质量，通过个性化数据集进行微调。
methods: 使用额外的可访问分类器或检测器模型，将图像中的隐式概念编码到文本领域。
results: 成功 removaling隐式概念，显示较 существенный改进于现有方法。

Abstract
Fine-tuning diffusion models through personalized datasets is an acknowledged method for improving generation quality across downstream tasks, which, however, often inadvertently generates unintended concepts such as watermarks and QR codes, attributed to the limitations in image sources and collecting methods within specific downstream tasks. Existing solutions suffer from eliminating these unintentionally learned implicit concepts, primarily due to the dependency on the model's ability to recognize concepts that it actually cannot discern. In this work, we introduce Geom-Erasing, a novel approach that successfully removes the implicit concepts with either an additional accessible classifier or detector model to encode geometric information of these concepts into text domain. Moreover, we propose Implicit Concept, a novel image-text dataset imbued with three implicit concepts (i.e., watermarks, QR codes, and text) for training and evaluation. Experimental results demonstrate that Geom-Erasing not only identifies but also proficiently eradicates implicit concepts, revealing a significant improvement over the existing methods. The integration of geometric information marks a substantial progression in the precise removal of implicit concepts in diffusion models.

摘要
fino-tuning Diffusion 模型通过个性化数据集进行改进生成质量的方法是普遍认可的，但它们经常不小心学习不良的概念，如水印和二维码，这是因为图像来源和收集方法的局限性。现有的解决方案受到模型认知概念的能力的限制，因此很难减少这些不良学习的概念。在这种情况下，我们介绍了 Geom-Erasing，一种新的方法，可以成功地从图像中除除不良的概念，通过添加一个可访问的分类器或检测器模型来编码图像中的几何信息到文本领域。此外，我们提出了 Implicit Concept，一个新的图像-文本数据集，含有三种隐式概念（即水印、二维码和文本），用于训练和评估。实验结果表明，Geom-Erasing不仅可以识别，还可以高效地除除隐式概念，显示与现有方法相比有显著的改善。图像几何信息的集成表明了在精准除除隐式概念方面升级了很大的进步。

Domain-wise Invariant Learning for Panoptic Scene Graph Generation

paper_url: http://arxiv.org/abs/2310.05867
repo_url: None
paper_authors: Li Li, You Qin, Wei Ji, Yuxiao Zhou, Roger Zimmermann
for: 提高 PSG 模型在实际应用中的实用性和可靠性，解决 predicate 注释偏见问题。
methods: 提出一种新的框架，通过在每个主体-对象对中测量 predicate 预测风险，并通过学习不变 predicate 表示嵌入来适应ively 转移偏见注释。
results: 实验显示，我们的方法可以显著提高 benchmark 模型的性能，达到新的state-of-the-art性能水平，并在 PSG 数据集上显示出优秀的泛化和效果。

Abstract
Panoptic Scene Graph Generation (PSG) involves the detection of objects and the prediction of their corresponding relationships (predicates). However, the presence of biased predicate annotations poses a significant challenge for PSG models, as it hinders their ability to establish a clear decision boundary among different predicates. This issue substantially impedes the practical utility and real-world applicability of PSG models. To address the intrinsic bias above, we propose a novel framework to infer potentially biased annotations by measuring the predicate prediction risks within each subject-object pair (domain), and adaptively transfer the biased annotations to consistent ones by learning invariant predicate representation embeddings. Experiments show that our method significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on PSG dataset.

摘要
泛opepticScene Graph生成（PSG）含有物体探测和相应关系预测（预言）。然而，存在偏见 predicate 注释会对 PSG 模型造成很大障碍，因为它阻碍了模型建立清晰的决策边界。这个问题对 PSG 模型的实际用途和应用程度产生了很大的障碍。为了解决上述内在偏见，我们提出了一种新的框架，通过在每个主体- объек 对（Domain）中测量预测 predicate 的风险，并将偏见的预测转移到一致的 predicate 表示嵌入中。实验表明，我们的方法可以显著提高基准模型的性能，实现新的状态级表现，并在 PSG 数据集上展现了优秀的泛化和效果。

A Real-time Method for Inserting Virtual Objects into Neural Radiance Fields

paper_url: http://arxiv.org/abs/2310.05837
repo_url: None
paper_authors: Keyang Ye, Hongzhi Wu, Xin Tong, Kun Zhou
for: 插入虚拟物体intoNeRF中的场景，以实现真实的照明和阴影效果，并允许交互式地 manipulate虚拟物体。
methods: 利用NeRF中的光线和几何信息，解决了虚拟物体插入 augmented reality 中的多个挑战，包括照明估计、遮挡和阴影。
results: 比革命技术 superior 的照明和阴影效果，可以在实时 inserting 虚拟物体中使用，有很大的应用前途。

Abstract
We present the first real-time method for inserting a rigid virtual object into a neural radiance field, which produces realistic lighting and shadowing effects, as well as allows interactive manipulation of the object. By exploiting the rich information about lighting and geometry in a NeRF, our method overcomes several challenges of object insertion in augmented reality. For lighting estimation, we produce accurate, robust and 3D spatially-varying incident lighting that combines the near-field lighting from NeRF and an environment lighting to account for sources not covered by the NeRF. For occlusion, we blend the rendered virtual object with the background scene using an opacity map integrated from the NeRF. For shadows, with a precomputed field of spherical signed distance field, we query the visibility term for any point around the virtual object, and cast soft, detailed shadows onto 3D surfaces. Compared with state-of-the-art techniques, our approach can insert virtual object into scenes with superior fidelity, and has a great potential to be further applied to augmented reality systems.

摘要
我们提出了首个实时方法，可以在神经辐射场中插入固定形式的虚拟物体，以生成真实的照明和阴影效果，同时允许交互式地修改物体。通过利用神经辐射场中的辐射信息和几何信息，我们的方法超越了虚拟物体插入增强现实系统中的许多挑战。对照明计算，我们生成了准确、可靠和三维空间变化的入射照明，将神经辐射中的近场照明和环境照明结合起来，以 compte для未被神经辐射覆盖的源。对遮挡，我们使用神经辐射中的 opacity map 混合背景场景，以实现虚拟物体与背景的混合。对阴影，我们使用预计算的球面正弦距离场来查询任意点附近虚拟物体的可见性，并投射出软、细节rich的阴影 onto 3D 表面。相比之前的技术，我们的方法可以在场景中插入虚拟物体的精度更高，并具有潜在的应用于增强现实系统。

Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View

paper_url: http://arxiv.org/abs/2310.05829
repo_url: None
paper_authors: Cheng Tan, Jue Wang, Zhangyang Gao, Siyuan Li, Lirong Wu, Jun Xia, Stan Z. Li
for: 这篇论文旨在探讨预测学习中的空间时间预测学习，具有广泛应用的应用领域。
methods: 论文提出了两种主流的时间模型方法，分别是回归基本的方法和回归自由的方法。
results: 论文透过实验证明，USTEP（统一空间时间预测学习）方法可以在各种预测学习任务中实现重要的改进，并成为一个可靠的解决方案。

Abstract
Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.

摘要
In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, providing a unified perspective. Building on this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that integrates both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning tasks show that USTEP achieves significant improvements over existing temporal modeling approaches, establishing it as a robust solution for a wide range of spatio-temporal applications.

Provably Convergent Data-Driven Convex-Nonconvex Regularization

paper_url: http://arxiv.org/abs/2310.05812
repo_url: None
paper_authors: Zakhar Shumaylov, Jeremy Budd, Subhadip Mukherjee, Carola-Bibiane Schönlieb
for: 解决非对易问题的新方法是通过深度学习学习一个正则化函数。
methods: 我们使用了凸非凸（CNC）框架，并引入了一种新的输入弱凸神经网络（IWCNN）构建，以应用学习反对正则化方法。
results: 我们的方法可以在数据上实现高质量的解决，而不需要证明保证。在实验中，我们发现我们的方法可以超越之前的反对方法的数值问题。

Abstract
An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.

摘要
新的一个思想是通过深度学习学习一个正则化函数从数据中学习，这会导致高质量的解决方案，但通常会额外付出可证明保证。在这个工作中，我们证明了在几何非几何（CNC）框架中的正则化和稳定性。我们介绍了一种新的输入弱有界神经网络（IWCNN）构建方法，以适应学习反对抗散射正则化的CNC框架。我们的方法在数据处理中超越了之前的反对抗散射方法的数学问题。

Joint object detection and re-identification for 3D obstacle multi-camera systems

paper_url: http://arxiv.org/abs/2310.05785
repo_url: None
paper_authors: Irene Cortés, Jorge Beltrán, Arturo de la Escalera, Fernando García
for: 提高自动驾驶系统中对物体检测和识别的精度和效率。
methods: 提出了一种基于摄像头和激光雷达信息的物体检测网络修改方法，包括一个额外的分支用于在同一辆车辆内的相邻摄像头之间重新识别物体。
results: 对比传统非最大值选择（NMS）技术，提出的方法在汽车类划分领域提高了超过5%的性能。

Abstract
In recent years, the field of autonomous driving has witnessed remarkable advancements, driven by the integration of a multitude of sensors, including cameras and LiDAR systems, in different prototypes. However, with the proliferation of sensor data comes the pressing need for more sophisticated information processing techniques. This research paper introduces a novel modification to an object detection network that uses camera and lidar information, incorporating an additional branch designed for the task of re-identifying objects across adjacent cameras within the same vehicle while elevating the quality of the baseline 3D object detection outcomes. The proposed methodology employs a two-step detection pipeline: initially, an object detection network is employed, followed by a 3D box estimator that operates on the filtered point cloud generated from the network's detections. Extensive experimental evaluations encompassing both 2D and 3D domains validate the effectiveness of the proposed approach and the results underscore the superiority of this method over traditional Non-Maximum Suppression (NMS) techniques, with an improvement of more than 5\% in the car category in the overlapping areas.

摘要

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

paper_url: http://arxiv.org/abs/2310.05773
repo_url: https://github.com/GzyAftermath/DATM
paper_authors: Ziyao Guo, Kai Wang, George Cazenavette, Hui Li, Kaipeng Zhang, Yang You
for: 这个论文的目的是怎样的？methods: 这个论文使用了哪些方法？results: 这个论文得到了什么结果？Here are the answers in Simplified Chinese:for: 这个论文的目的是寻找一种能够实现完全无损数据压缩的方法，使得一个基于小样本集的模型可以与基于实际数据集的模型相比。methods: 这个论文使用了一种基于轨迹匹配的方法，即通过优化合成数据集来让模型在训练过程中学习类似的长期趋势。results: 这个论文成功地实现了完全无损数据压缩，即基于小样本集的模型可以与基于实际数据集的模型相比。此外，论文还发现了现有方法在生成大型合成数据集时失效的原因，并提出了一种根据数据集大小调整轨迹匹配的方法。

Abstract
The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.

摘要
最终目标 OF dataset distillation 是将一个小型合成数据集 synthesized 到一个模型在这个合成集上训练后表现与实际数据集上训练后表现相同。直到现在，没有任何dataset distillation方法达到了完全无损的目标，其中一个原因是前一些方法只有在非常小的合成样本数量下保持有效。由于只有一定量的信息可以包含在这么少样本中，因此我们认为，为实现真正无损的dataset distillation，我们必须开发一种可以随合成数据集的大小而变的效果的混合方法。在这项工作中，我们提出了一种算法，并解释了现有方法失败的原因。现有的state-of-the-art方法通过轨迹匹配来实现dataset distillation，即通过优化合成数据来让教师网络在实际数据中学习的长期训练动力类似。我们实验发现，在我们选择的轨迹阶段（例如，早期或晚期）对匹配的轨迹有着很大的影响。特别是，早期的轨迹（当教师网络学习易Patterns）适用于小 cardinality的合成数据集，因为这里有 fewer examples 可以分配必要的信息。相反，晚期的轨迹（当教师网络学习困难Patterns）在更大的合成数据集中提供更好的信号，因为现在有 enough samples 来表示必要的复杂Patterns。根据我们的发现，我们提议将合成数据集中的模式难度与合成数据集的大小进行对应。通过这种方式，我们成功地扩展了轨迹匹配基于方法到更大的合成数据集，实现了无损的dataset distillation，这是第一次。可以在获取我们的代码和混合数据集。

3D tomatoes’ localisation with monocular cameras using histogram filters

paper_url: http://arxiv.org/abs/2310.05762
repo_url: None
paper_authors: Sandro Costa Magalhães, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias
for: Tomatoes position estimation in open-field environments
methods: Histogram Filters (Bayesian Discrete Filters) with square kernel and Gaussian kernel
results: Mean absolute error lower than 10mm in simulation and 20mm in laboratory testbed at assessing distance of about 0.5m, viable for real environments but need improvement at closer distances.Here’s the breakdown of each point:1. Tomatoes position estimation in open-field environments: The paper is focused on developing a method for estimating the position of tomatoes in open-field environments, which is a challenging task due to lighting interferences.2. Histogram Filters (Bayesian Discrete Filters) with square kernel and Gaussian kernel: The paper proposes using Histogram Filters (Bayesian Discrete Filters) with two different kernel functions (square kernel and Gaussian kernel) to estimate the tomatoes’ positions.3. Mean absolute error lower than 10mm in simulation and 20mm in laboratory testbed: The proposed method was tested in simulation and in a laboratory testbed, and the results showed a mean absolute error lower than 10mm in simulation and 20mm in the testbed, indicating that the method is effective for estimating tomatoes’ positions in open-field environments. However, the results also suggest that the method needs improvement at closer distances.

Abstract
Performing tasks in agriculture, such as fruit monitoring or harvesting, requires perceiving the objects' spatial position. RGB-D cameras are limited under open-field environments due to lightning interferences. Therefore, in this study, we approach the use of Histogram Filters (Bayesian Discrete Filters) to estimate the position of tomatoes in the tomato plant. Two kernel filters were studied: the square kernel and the Gaussian kernel. The implemented algorithm was essayed in simulation, with and without Gaussian noise and random noise, and in a testbed at laboratory conditions. The algorithm reported a mean absolute error lower than 10 mm in simulation and 20 mm in the testbed at laboratory conditions with an assessing distance of about 0.5 m. So, the results are viable for real environments and should be improved at closer distances.

摘要

Unleashing the power of Neural Collapse for Transferability Estimation

paper_url: http://arxiv.org/abs/2310.05754
repo_url: None
paper_authors: Yuhe Ding, Bo Jiang, Lijun Sheng, Aihua Zheng, Jian Liang
for: 这个论文的目的是提出一种新的传输可行性评估方法，以便无需 Fine-tuning 可以评估模型的适用性。
methods: 该方法基于现有 литературе中广泛使用的神经溃瘤指标，通过全面测量预训练模型的神经溃瘤程度来评估传输可行性。方法包括两个不同的项目：变差溃瘤项目，评估类别分离和类内紧凑程度；和类征公平项目，评估预训练模型对每个类别的公平性。
results: Results 表明 FaCe 在不同的预训练分类模型、不同的网络架构、源数据集和训练损失函数上都有出色的表现，并且在图像分类、 semantic segmentation 和文本分类等多种任务上达到了state-of-the-art的性能，这说明了 FaCe 的效果和普遍性。

Abstract
Transferability estimation aims to provide heuristics for quantifying how suitable a pre-trained model is for a specific downstream task, without fine-tuning them all. Prior studies have revealed that well-trained models exhibit the phenomenon of Neural Collapse. Based on a widely used neural collapse metric in existing literature, we observe a strong correlation between the neural collapse of pre-trained models and their corresponding fine-tuned models. Inspired by this observation, we propose a novel method termed Fair Collapse (FaCe) for transferability estimation by comprehensively measuring the degree of neural collapse in the pre-trained model. Typically, FaCe comprises two different terms: the variance collapse term, which assesses the class separation and within-class compactness, and the class fairness term, which quantifies the fairness of the pre-trained model towards each class. We investigate FaCe on a variety of pre-trained classification models across different network architectures, source datasets, and training loss functions. Results show that FaCe yields state-of-the-art performance on different tasks including image classification, semantic segmentation, and text classification, which demonstrate the effectiveness and generalization of our method.

摘要
<>对文本进行简化中文翻译。<>转移性估计旨在提供预训练模型下游任务适用性的启发，无需细化所有。先前的研究表明，良好预训练模型会出现神经塌陷现象。根据现有文献中广泛使用的神经塌陷指标，我们发现预训练模型和其相应的细化模型之间存在强相关性。 inspirited by this observation， we propose a novel method termed Fair Collapse (FaCe) for transferability estimation by comprehensively measuring the degree of neural collapse in the pre-trained model. Typically, FaCe comprises two different terms: the variance collapse term, which assesses the class separation and within-class compactness, and the class fairness term, which quantifies the fairness of the pre-trained model towards each class. We investigate FaCe on a variety of pre-trained classification models across different network architectures, source datasets, and training loss functions. Results show that FaCe yields state-of-the-art performance on different tasks including image classification, semantic segmentation, and text classification, which demonstrate the effectiveness and generalization of our method.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation

paper_url: http://arxiv.org/abs/2310.05720
repo_url: https://github.com/semchan/HyperLips
paper_authors: Yaosen Chen, Yu Yao, Zhiqiang Li, Wei Wang, Yanru Zhang, Han Yang, Xuming Wen
for: 这个研究旨在提高现有的音频驱动的话头生成方法，以提供更高品质的视觉化面孔生成。
methods: 本研究提出了一个两阶段框架，包括一个干扰网络（HyperNet）来控制 lip 运动，以及一个高分辨率解oder（HRDecoder）来生成高品质的视觉化面孔内容。
results: 实验结果显示，本方法可以与现有的音频驱动的话头生成方法相比，提供更真实、高品质和 lip 同步的视觉化面孔生成。

Abstract
Talking face generation has a wide range of potential applications in the field of virtual digital humans. However, rendering high-fidelity facial video while ensuring lip synchronization is still a challenge for existing audio-driven talking face generation approaches. To address this issue, we propose HyperLips, a two-stage framework consisting of a hypernetwork for controlling lips and a high-resolution decoder for rendering high-fidelity faces. In the first stage, we construct a base face generation network that uses the hypernetwork to control the encoding latent code of the visual face information over audio. First, FaceEncoder is used to obtain latent code by extracting features from the visual face information taken from the video source containing the face frame.Then, HyperConv, which weighting parameters are updated by HyperNet with the audio features as input, will modify the latent code to synchronize the lip movement with the audio. Finally, FaceDecoder will decode the modified and synchronized latent code into visual face content. In the second stage, we obtain higher quality face videos through a high-resolution decoder. To further improve the quality of face generation, we trained a high-resolution decoder, HRDecoder, using face images and detected sketches generated from the first stage as input.Extensive quantitative and qualitative experiments show that our method outperforms state-of-the-art work with more realistic, high-fidelity, and lip synchronization. Project page: https://semchan.github.io/HyperLips Project/

摘要
《带讲话脸生成》有广泛的应用前途在虚拟数字人类领域。然而，基于音频的 talking face生成方法仍然面临高精度脸部视频生成和同步脸部运动的挑战。为解决这个问题，我们提出了 HyperLips，它是一个两个阶段的框架，包括跨域网络（HyperNet）控制脸部运动的 lips 控制器和高分辨率解码器。在第一阶段，我们构建了基础脸部生成网络，使用 HyperNet 控制脸部信息的编码缓存码。首先， FaceEncoder 从视频源中提取脸部特征，然后 HyperConv 使用 HyperNet 更新参数，将编码缓存码与音频特征进行同步。最后， FaceDecoder 将修改和同步的编码缓存码解码为脸部内容。在第二阶段，我们通过使用高分辨率解码器（HRDecoder）来提高脸部生成质量。为了进一步提高脸部生成质量，我们在第一阶段使用 FaceImages 和 DetectedSketches 作为输入训练 HRDecoder。经过广泛的量化和质量测试，我们的方法在质量和同步性方面都能够超越当前的状态 искусственного智能。项目页面：https://semchan.github.io/HyperLips-Project/

EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders

paper_url: http://arxiv.org/abs/2310.05718
repo_url: None
paper_authors: Gulcin Baykal, Melih Kandemir, Gozde Unal
for: 本研究旨在解决深度生成模型在使用逻辑变量自动编码器（dVAE）时出现的抽象缺失问题。
methods: 本研究提出一种新的方法，即使用证据深度学习（EDL）取代softmax函数，以避免dVAE中的抽象缺失问题。
results: 我们的实验表明，使用EdVAE模型可以避免抽象缺失问题，提高重建性能，并提高代码库使用率，相比dVAE和VQ-VAE基于模型。

Abstract
Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models.

摘要
代码表册塌陷是训练深度生成模型时的一个常见问题，特别是在使用柱状量编码的Vector Quantized Variational Autoencoders (VQ-VAEs) 中。我们发现，同样的问题也出现在 alternatively designed discrete variational autoencoders (dVAEs) 中，其Encoder直接学习一个分布来表示数据。我们认为，使用 softmax 函数获取概率分布会导致代码表册塌陷，因为它会将最佳匹配的代码表册元素授予过于自信的概率。在这篇论文中，我们提议使用 evidential deep learning (EDL) 来代替 softmax，以战胜 dVAE 中的代码表册塌陷问题。我们在不同的数据集上进行了实验，发现我们的模型（EdVAE）可以减轻代码表册塌陷，提高重建性能，并在 VQ-VAE 和 dVAE 基于模型中提高代码表 usage。Note that Simplified Chinese is a writing system that uses Chinese characters, but omits the traditional Chinese punctuation and grammatical markers. The translation is written in the Simplified Chinese format.

Uni3DETR: Unified 3D Detection Transformer

paper_url: http://arxiv.org/abs/2310.05699
repo_url: https://github.com/zhenyuw16/uni3detr
paper_authors: Zhenyu Wang, Yali Li, Xi Chen, Hengshuang Zhao, Shengjin Wang
for: 这篇论文的目的是提出一个通用的3D探测器，可以在不同的场景中进行3D探测，包括室内和室外场景。
methods: 这篇论文使用了检测变换器和点矩阵交互来预测对象，并提出了混合查询点的方法，以便在不同的场景中进行最佳化。
results: 实验表明，Uni3DETR在室内和室外场景中都能够显示出优秀的性能，与特定场景的探测器不同，Uni3DETR具有强大的泛化能力。

Abstract
Existing point cloud based 3D detectors are designed for the particular scene, either indoor or outdoor ones. Because of the substantial differences in object distribution and point density within point clouds collected from various environments, coupled with the intricate nature of 3D metrics, there is still a lack of a unified network architecture that can accommodate diverse scenes. In this paper, we propose Uni3DETR, a unified 3D detector that addresses indoor and outdoor 3D detection within the same framework. Specifically, we employ the detection transformer with point-voxel interaction for object prediction, which leverages voxel features and points for cross-attention and behaves resistant to the discrepancies from data. We then propose the mixture of query points, which sufficiently exploits global information for dense small-range indoor scenes and local information for large-range sparse outdoor ones. Furthermore, our proposed decoupled IoU provides an easy-to-optimize training target for localization by disentangling the xy and z space. Extensive experiments validate that Uni3DETR exhibits excellent performance consistently on both indoor and outdoor 3D detection. In contrast to previous specialized detectors, which may perform well on some particular datasets but suffer a substantial degradation on different scenes, Uni3DETR demonstrates the strong generalization ability under heterogeneous conditions (Fig. 1). Codes are available at \href{https://github.com/zhenyuw16/Uni3DETR}{https://github.com/zhenyuw16/Uni3DETR}.

摘要
现有的点云基于3D检测器是为特定场景设计的，可以是indoor或outdoor场景。由于点云中对象的分布和点密度在不同环境中存在substantial differences，加之3D度量的复杂性，导致现有的网络架构无法涵盖多样化的场景。在这篇论文中，我们提出了Uni3DETR，一种通用的3D检测器，可以同时处理indoor和outdoor场景。我们使用检测转换器和点云交互来预测 объек，这里利用了点云的特征和 voxel 进行跨注意力和抗衰假。我们还提出了对于 dense small-range indoor scenes和large-range sparse outdoor scenes的mixture of query points， sufficient exploits 全球信息。此外，我们提出的分离 IoU 提供了一个容易优化的训练目标 для Localization，通过分离 xy 和 z 空间。我们的Uni3DETR在不同场景下表现出了很好的性能，与之前的特циализирован检测器不同，Uni3DETR 在多样化的环境下具有强大的泛化能力（图1）。代码可以在上获取。

Combining recurrent and residual learning for deforestation monitoring using multitemporal SAR images

paper_url: http://arxiv.org/abs/2310.05697
repo_url: None
paper_authors: Carla Nascimento Neves, Raul Queiroz Feitosa, Mabel X. Ortega Adarme, Gilson Antonio Giraldi
for: 本研究旨在提高亚马逊雨林中的森林侵蚀检测精度，利用Synthetic Aperture Radar（SAR）多时序数据。
methods: 本研究提出了三种深度学习模型，包括RRCNN-1、RRCNN-2和RRCNN-3，用于森林侵蚀监测。这些模型都基于卷积神经网络，并且使用多时序SAR数据来提高检测精度。
results: 实验分析表明，使用多时序SAR数据可以更好地检测森林侵蚀。特别是，RRCNN-1模型在所有测试网络中显示出最高精度，同时具有半个处理时间的提高。

Abstract
With its vast expanse, exceeding that of Western Europe by twice, the Amazon rainforest stands as the largest forest of the Earth, holding immense importance in global climate regulation. Yet, deforestation detection from remote sensing data in this region poses a critical challenge, often hindered by the persistent cloud cover that obscures optical satellite data for much of the year. Addressing this need, this paper proposes three deep-learning models tailored for deforestation monitoring, utilizing SAR (Synthetic Aperture Radar) multitemporal data moved by its independence on atmospheric conditions. Specifically, the study proposes three novel recurrent fully convolutional network architectures-namely, RRCNN-1, RRCNN-2, and RRCNN-3, crafted to enhance the accuracy of deforestation detection. Additionally, this research explores replacing a bitemporal with multitemporal SAR sequences, motivated by the hypothesis that deforestation signs quickly fade in SAR images over time. A comprehensive assessment of the proposed approaches was conducted using a Sentinel-1 multitemporal sequence from a sample site in the Brazilian rainforest. The experimental analysis confirmed that analyzing a sequence of SAR images over an observation period can reveal deforestation spots undetectable in a pair of images. Notably, experimental results underscored the superiority of the multitemporal approach, yielding approximately a five percent enhancement in F1-Score across all tested network architectures. Particularly the RRCNN-1 achieved the highest accuracy and also boasted half the processing time of its closest counterpart.

摘要
Amazon雨林是地球上最大的雨林之一，其盛装面积超过西欧的西欧面积之二倍，具有全球气候调节的重要性。然而，在这个区域中的森林开发探测从远程感应数据受到挑战，因为大部分年份受到云层覆盖，使光学卫星数据无法得到有效处理。为了解决这个需求，本研究提出了三个深度学习模型，用于森林开发监控，并利用Synthetic Aperture Radar（SAR）多时间序数据，不受大气情况的限制。具体来说，这些研究提出了三个新的循环型条件网络架构，称为RRCNN-1、RRCNN-2和RRCNN-3，以提高森林开发探测的精度。此外，这些研究还探讨了使用多时间序SAR序列取代双时间序列，这是因为开发痕迹在SAR图像中很快消失。经过了一系列的实验分析，发现可以通过分析多时间序SAR图像序列来探测森林开发痕迹，并且这种方法可以提高精度约五成。特别是RRCNN-1最高精度和处理时间的半倍，与其他架构相比。

Climate-sensitive Urban Planning through Optimization of Tree Placements

paper_url: http://arxiv.org/abs/2310.05691
repo_url: https://github.com/lmb-freiburg/tree-planting
paper_authors: Simon Schrodi, Ferdinand Briegel, Max Argus, Andreas Christen, Thomas Brox
for: Mitigating heat stress in urban areas through optimal placement of urban trees.
methods: Using neural networks to simulate point-wise mean radiant temperatures and an iterated local search framework with tailored adaptations to optimize tree placements.
results: Empirical efficacy of the approach across a wide spectrum of study areas and time scales, demonstrating the potential of urban trees to mitigate heat stress.

Abstract
Climate change is increasing the intensity and frequency of many extreme weather events, including heatwaves, which results in increased thermal discomfort and mortality rates. While global mitigation action is undoubtedly necessary, so is climate adaptation, e.g., through climate-sensitive urban planning. Among the most promising strategies is harnessing the benefits of urban trees in shading and cooling pedestrian-level environments. Our work investigates the challenge of optimal placement of such trees. Physical simulations can estimate the radiative and thermal impact of trees on human thermal comfort but induce high computational costs. This rules out optimization of tree placements over large areas and considering effects over longer time scales. Hence, we employ neural networks to simulate the point-wise mean radiant temperatures--a driving factor of outdoor human thermal comfort--across various time scales, spanning from daily variations to extended time scales of heatwave events and even decades. To optimize tree placements, we harness the innate local effect of trees within the iterated local search framework with tailored adaptations. We show the efficacy of our approach across a wide spectrum of study areas and time scales. We believe that our approach is a step towards empowering decision-makers, urban designers and planners to proactively and effectively assess the potential of urban trees to mitigate heat stress.

摘要
климат变化会增加多种极端天气事件的Intensity和频率，包括热波，这会导致增加的热舒适度和死亡率。 global的气候适应措施是必要的，例如通过气候敏感的城市规划。 among the most promising strategies is to harness the benefits of urban trees in shading and cooling pedestrian-level environments. our work investigates the challenge of optimal placement of such trees. physical simulations can estimate the radiative and thermal impact of trees on human thermal comfort, but induce high computational costs. This rules out optimization of tree placements over large areas and considering effects over longer time scales. Hence, we employ neural networks to simulate the point-wise mean radiant temperatures--a driving factor of outdoor human thermal comfort--across various time scales, spanning from daily variations to extended time scales of heatwave events and even decades. To optimize tree placements, we harness the innate local effect of trees within the iterated local search framework with tailored adaptations. We show the efficacy of our approach across a wide spectrum of study areas and time scales. We believe that our approach is a step towards empowering decision-makers, urban designers and planners to proactively and effectively assess the potential of urban trees to mitigate heat stress.

Analysis of Rainfall Variability and Water Extent of Selected Hydropower Reservoir Using Google Earth Engine (GEE): A Case Study from Two Tropical Countries, Sri Lanka and Vietnam

paper_url: http://arxiv.org/abs/2310.05682
repo_url: None
paper_authors: Punsisi Rajakaruna, Surajit Ghosh, Bunyod Holmatov
for: 这项研究旨在 investigate 瑞典和斯里兰卡两国热带季风地区降水模式和选择的水电库水域面积的关系。methods: 该研究使用高分辨率光学图像和Sentinel-1 Synthetic Aperture Radar（SAR）数据观察和监测不同天气情况下水体的变化，特别是在雨季期间。采用 Climate Hazards Group InfraRed Precipitation with Station（CHIRPS）数据进行降水年际变化的分析，并对选择的水库区域进行水域面积的 derivation。results: 研究结果显示，雨季期间的降水量带来水库水域面积的增加，而非雨季期间的降水量则导致水库水域面积的减少。这些结果表明降水模式对水库水资源的影响，并可以帮助这两个国家决策 relativity 水电、洪水管理和灌溉等方面的政策。

Abstract
This study presents a comprehensive remote sensing analysis of rainfall patterns and selected hydropower reservoir water extent in two tropical monsoon countries, Vietnam and Sri Lanka. The aim is to understand the relationship between remotely sensed rainfall data and the dynamic changes (monthly) in reservoir water extent. The analysis utilizes high-resolution optical imagery and Sentinel-1 Synthetic Aperture Radar (SAR) data to observe and monitor water bodies during different weather conditions, especially during the monsoon season. The average annual rainfall for both countries is determined, and spatiotemporal variations in monthly average rainfall are examined at regional and reservoir basin levels using the Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) dataset from 1981 to 2022. Water extents are derived for selected reservoirs using Sentinel-1 SAR Ground Range Detected (GRD) images in Vietnam and Sri Lanka from 2017 to 2022. The images are pre-processed and corrected using terrain correction and refined Lee filter. An automated thresholding algorithm, OTSU, distinguishes water and land, taking advantage of both VV and VH polarization data. The connected pixel count threshold is applied to enhance result accuracy. The results indicate a clear relationship between rainfall patterns and reservoir water extent, with increased precipitation during the monsoon season leading to higher water extents in the later months. This study contributes to understanding how rainfall variability impacts reservoir water resources in tropical monsoon regions. The preliminary findings can inform water resource management strategies and support these countries' decision-making processes related to hydropower generation, flood management, and irrigation.

摘要
本研究通过远程感知技术分析了越南和斯里兰卡两国热带雨季的降雨模式和选择的水电库水域面积。研究的目的是了解远程感知降雨数据和水库月度变化的关系。研究使用高分辨率光学图像和Sentinel-1Synthetic Aperture Radar（SAR）数据观察和监测不同天气情况下的水体，特别是在雨季期间。研究计算了两国的年平均降雨量，并对不同地区和水库流域的月平均降雨变化进行了空间时间分析使用CHIRPS数据集从1981年到2022年。水域面积 derive 从越南和斯里兰卡2017年到2022年的Sentinel-1 SAR Ground Range Detected（GRD）图像。图像进行了地形 corrections和精细的李 filter 修正。使用OTSU自动分割算法，利用VV和VH极化数据，将水和陆分开。应用连接 pixel 计数阈值以提高结果准确性。结果表明，降雨模式和水库水域面积之间存在明显的关系，雨季期间的降雨量增加会导致 later 月的水域面积增加。这项研究对热带雨季地区水库水资源的变化产生了重要影响，可以为这些国家的水资源管理策略和水电生产、洪水管理和灌溉决策提供参考。

Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection

paper_url: http://arxiv.org/abs/2310.05666
repo_url: https://github.com/yilonglv/aid
paper_authors: Yilong Lv, Min Li, Yujie He, Shaopeng Li, Zhuzhen He, Aitao Yang
for: 本文提出了一种新的检测器模型，即盒子解couple-couple(BDC)策略，以提高对象检测的准确率。
methods: 该模型使用了多个盒子 вместе工作，而不是单个盒子，以提高边界偏移的预测精度。具体来说，该模型使用了一个简单 yet novel的模型，即盒子间接头(AID)，其包括两个头网络：一个基于盒子的 anchor-based 头和一个基于盒子的 corner-aware 头。
results: 对于 MS COCO 测试 dataset，该模型的实验结果表明，与基eline RetinaNet 和 GFL 方法相比，该模型可以在无任何额外配置下达到 $\sim$2.4 和 $\sim$1.2 AP 的提升。

Abstract
Anchor-based detectors have been continuously developed for object detection. However, the individual anchor box makes it difficult to predict the boundary's offset accurately. Instead of taking each bounding box as a closed individual, we consider using multiple boxes together to get prediction boxes. To this end, this paper proposes the \textbf{Box Decouple-Couple(BDC) strategy} in the inference, which no longer discards the overlapping boxes, but decouples the corner points of these boxes. Then, according to each corner's score, we couple the corner points to select the most accurate corner pairs. To meet the BDC strategy, a simple but novel model is designed named the \textbf{Anchor-Intermediate Detector(AID)}, which contains two head networks, i.e., an anchor-based head and an anchor-free \textbf{Corner-aware head}. The corner-aware head is able to score the corners of each bounding box to facilitate the coupling between corner points. Extensive experiments on MS COCO show that the proposed anchor-intermediate detector respectively outperforms their baseline RetinaNet and GFL method by $\sim$2.4 and $\sim$1.2 AP on the MS COCO test-dev dataset without any bells and whistles. Code is available at: https://github.com/YilongLv/AID.

摘要
锚点基于的检测器一直在不断发展，但是每个锚点盒子的偏移精度预测却受到限制。而不是单独处理每个 bounding box，我们可以考虑使用多个盒子共同预测。为此，本文提出了 \textbf{Box Decouple-Couple（BDC）策略}，不再产生 overlap 的盒子，而是解couple 盒子的角点。然后，根据每个角点的得分，我们couple 角点选择最准确的角点对。为满足 BDC 策略，我们设计了一种简单 yet novel的模型，即 \textbf{锚点-中间检测器（AID）}，它包含两个头网络，即 anchor-based 头和 anchor-free \textbf{角点检测头}。角点检测头可以为每个 bounding box 的角点分配得分，以便 coupling 角点。经验表明，我们的锚点-中间检测器在 MS COCO 测试数据集上分别超过了基eline RetinaNet 和 GFL 方法的 $\sim$2.4 和 $\sim$1.2 AP。代码可以在：https://github.com/YilongLv/AID 找到。

Exploiting Manifold Structured Data Priors for Improved MR Fingerprinting Reconstruction

paper_url: http://arxiv.org/abs/2310.05647
repo_url: None
paper_authors: Peng Li, Yuping Ji, Yue Hu
for: 高精度高精度MR fingerprinting（MRF）数据的重建问题解决
methods: 基于映射结构数据优先的新型MRF重建框架，利用指print映射到参数 manifold 提高重建性能
results: 实验结果显示，我们的方法可以在非 carteesian 探测场景下减少计算时间，并提高重建性能，比州前方法有显著提升

Abstract
Estimating tissue parameter maps with high accuracy and precision from highly undersampled measurements presents one of the major challenges in MR fingerprinting (MRF). Many existing works project the recovered voxel fingerprints onto the Bloch manifold to improve reconstruction performance. However, little research focuses on exploiting the latent manifold structure priors among fingerprints. To fill this gap, we propose a novel MRF reconstruction framework based on manifold structured data priors. Since it is difficult to directly estimate the fingerprint manifold structure, we model the tissue parameters as points on a low-dimensional parameter manifold. We reveal that the fingerprint manifold shares the same intrinsic topology as the parameter manifold, although being embedded in different Euclidean spaces. To exploit the non-linear and non-local redundancies in MRF data, we divide the MRF data into spatial patches, and the similarity measurement among data patches can be accurately obtained using the Euclidean distance between the corresponding patches in the parameter manifold. The measured similarity is then used to construct the graph Laplacian operator, which represents the fingerprint manifold structure. Thus, the fingerprint manifold structure is introduced in the reconstruction framework by using the low-dimensional parameter manifold. Additionally, we incorporate the locally low-rank prior in the reconstruction framework to further utilize the local correlations within each patch for improved reconstruction performance. We also adopt a GPU-accelerated NUFFT library to accelerate reconstruction in non-Cartesian sampling scenarios. Experimental results demonstrate that our method can achieve significantly improved reconstruction performance with reduced computational time over the state-of-the-art methods.

摘要
估计组织参数地图具有高精度和精度从高度受损量测量中进行估计是MR fingerprinting（MRF）的一个主要挑战。许多现有的方法将recovered voxel fingerprints projection onto Bloch manifold以提高重建性能。然而， littleresearch关注在挖掘指纹 manifold的尚未知识之间的 latent manifold structure priors。为了填这个空白，我们提出了一种基于 manifold 结构数据约束的新的MRF重建框架。由于直接估计指纹 manifold 结构困难，我们模型了组织参数为低维度参数 manifold 上的点。我们发现，指纹 manifold 和参数 manifold 具有同一个内在结构，虽然在不同的欧几何空间中嵌入。为了利用MRF数据中的非线性和非本地征 redundancy，我们将MRF数据分成空间块，并通过在参数 manifold 上计算块之间的Euclidean距离来准确地获得数据块之间的相似度。这个测量的相似度后来用于构建图 Laplacian 算子，该算子表示指纹 manifold 结构。因此，指纹 manifold 结构通过使用参数 manifold 被引入到重建框架中。此外，我们还在重建框架中采用了本地低级别 prior，以利用每个块中的本地相关性来提高重建性能。我们还使用了加速非极化 sampling enario的GPU加速的NUFFT库来加速重建。实验结果表明，我们的方法可以在计算时间和重建性能两个方面具有显著改进，与现有方法相比。

Diagnosing Catastrophe: Large parts of accuracy loss in continual learning can be accounted for by readout misalignment

paper_url: http://arxiv.org/abs/2310.05644
repo_url: None
paper_authors: Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann
for: This paper investigates the representational changes that underlie the phenomenon of catastrophic forgetting in artificial neural networks (ANNs) when trained on changing data distributions.
methods: The paper uses a combination of theoretical analysis and experimental studies to identify the three distinct processes that contribute to catastrophic forgetting.
results: The study finds that the largest component of catastrophic forgetting is a misalignment between hidden representations and readout layers, which causes internal representations to shift. Additionally, the study shows that representational geometry is partially conserved under this misalignment, but a small part of the information is irrecoverably lost. The findings have implications for deep learning applications that need to be continuously updated.Here’s the Chinese version of the information points:
for: 这篇论文研究人工神经网络（ANN）在数据分布变化训练下的表达变化，以解释它们快速忘记旧任务的现象。
methods: 论文使用理论分析和实验研究，确定快速忘记的三种主要过程。
results: 研究发现，快速忘记的主要组成部分是隐藏层和输出层之间的不同，导致内部表示的变化。此外，研究还发现，表示的几何结构在这种不同下仍有一定保留，但有一小部分信息丢失不可回收。这些发现对深度学习应用，需要不断更新，有益。

Abstract
Unlike primates, training artificial neural networks on changing data distributions leads to a rapid decrease in performance on old tasks. This phenomenon is commonly referred to as catastrophic forgetting. In this paper, we investigate the representational changes that underlie this performance decrease and identify three distinct processes that together account for the phenomenon. The largest component is a misalignment between hidden representations and readout layers. Misalignment occurs due to learning on additional tasks and causes internal representations to shift. Representational geometry is partially conserved under this misalignment and only a small part of the information is irrecoverably lost. All types of representational changes scale with the dimensionality of hidden representations. These insights have implications for deep learning applications that need to be continuously updated, but may also aid aligning ANN models to the rather robust biological vision.

摘要
The largest component is a misalignment between hidden representations and readout layers. This misalignment occurs due to learning on additional tasks, causing internal representations to shift. Despite this misalignment, representational geometry is partially conserved, and only a small part of the information is irrecoverably lost.All types of representational changes scale with the dimensionality of hidden representations. These insights have implications for deep learning applications that need to be continuously updated, and may also aid in aligning ANN models with the robust biological vision.

High Accuracy and Cost-Saving Active Learning 3D WD-UNet for Airway Segmentation

paper_url: http://arxiv.org/abs/2310.05638
repo_url: None
paper_authors: Shiyi Wang, Yang Nan, Simon Walsh, Guang Yang
for: 降低医疗3D计算机断层成像（CT）分割的标注努力。
methods: 提出了一种新的深度活动学习（DeepAL）模型-3D Wasserstein Discriminative UNet（WD-UNet），通过在半supervised的方式学习，加速学习减速，使得模型可以与supervised学习模型匹配或超越其预测结果。
results: 在3D肺空气道CT扫描图像中进行医疗分割，使用不确定度度量（参数化为查询策略的输入），导致更准确的预测结果，比如3DUNet和3D CEUNet等状态当前的深度学习超vised模型。相比之下，WD-UNet不仅节省了诊断员的标注成本，还节省了计算资源。WD-UNet使用有限的标注数据（35%的总数），实现更好的预测 метриク。

Abstract
We propose a novel Deep Active Learning (DeepAL) model-3D Wasserstein Discriminative UNet (WD-UNet) for reducing the annotation effort of medical 3D Computed Tomography (CT) segmentation. The proposed WD-UNet learns in a semi-supervised way and accelerates learning convergence to meet or exceed the prediction metrics of supervised learning models. Our method can be embedded with different Active Learning (AL) strategies and different network structures. The model is evaluated on 3D lung airway CT scans for medical segmentation and show that the use of uncertainty metric, which is parametrized as an input of query strategy, leads to more accurate prediction results than some state-of-the-art Deep Learning (DL) supervised models, e.g.,3DUNet and 3D CEUNet. Compared to the above supervised DL methods, our WD-UNet not only saves the cost of annotation for radiologists but also saves computational resources. WD-UNet uses a limited amount of annotated data (35% of the total) to achieve better predictive metrics with a more efficient deep learning model algorithm.

摘要
我们提出了一种新的深度活动学习（DeepAL）模型——3D Wasserstein Discriminative UNet（WD-UNet），用于降低医疗3D计算机Tomography（CT） segmentation的注释努力。我们的WD-UNet在半upervised的方式学习，并加速学习征 Stern 到达或超越supervised学习模型的预测 метри。我们的方法可以与不同的活动学习（AL）策略和不同的网络结构结合使用。我们的模型在3D肺空气道CT扫描图中进行医疗分 segmentation的测试，并显示了使用uncertainty度量（作为查询策略的输入参数）可以获得更准确的预测结果，比如State-of-the-art的深度学习（DL）supervised模型，如3DUNet和3D CEUNet。相比这些supervised DL方法，我们的WD-UNet不仅可以为放射学家节省注释成本，还可以节省计算资源。WD-UNet使用limited amount of annotated data（35% of the total）可以达到更好的预测 метри，同时使用更高效的深度学习算法。

Locality-Aware Generalizable Implicit Neural Representation

paper_url: http://arxiv.org/abs/2310.05624
repo_url: None
paper_authors: Doyup Lee, Chiheon Kim, Minsu Cho, Wook-Shin Han
for: 这 paper 的目的是提出一种新的普适隐藏表示（INR）框架，以便单个连续函数可以表示多个数据实例。
methods: 该框架 combinest 一个 transformer 编码器和一个具有本地化能力的 INR 解码器。 transformer 编码器预测数据实例中的一系列隐藏 токен，以编码本地信息。 INR 解码器通过 Cross-Attention 进行选择性聚合隐藏 токен，并逐步解码以获得输出。
results: 该框架在 previous 的普适 INR 上显著提高表达能力，并验证了本地化隐藏的有用性 для 下游任务 such as 图像生成。

Abstract
Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.

摘要
通用隐藏神经表示（INR）可以使一个连续函数，即坐标基本神经网络，表示多个数据实例。通过调整其权重或中间特征使用隐藏代码来模块化其权重。然而，现有的模块化表达能力有限，因为它无法当地化和捕捉数据实体细节，如特定像素和射线。为解决这个问题，我们提出了一种新的框架 для通用INR，该框架将transformer编码器与本地性感知INR解码器结合。transformer编码器预测一个数据实例的latent token集，从而对每个latent token进行本地信息编码。本地性感知INR解码器通过对坐标输入进行交叉注意力选择性聚合latent token，然后逐渐解码为多个频率带宽，进行进一步的坐标解码和特征修饰。这种选择性的token聚合和多频特征修饰使得我们可以学习本地特征表示，并在空间和频率方面进行本地化。我们的框架在前期INR的表达能力方面显著超越了现有的方法，并证明了隐藏代码的有用性 для下游任务，如图像生成。

ASM: Adaptive Sample Mining for In-The-Wild Facial Expression Recognition

paper_url: http://arxiv.org/abs/2310.05618
repo_url: None
paper_authors: Ziyang Zhang, Xiao Sun, Liuwei An, Meng Wang
for: 提高 facial expression recognition (FER) 数据集中表达不确定性和标注错误的问题。
methods: 提出了一种名为 Adaptive Sample Mining (ASM) 的新方法，通过动态地处理每个表达类别中的不确定性和噪声来解决这个问题。
results: 经验证明，该方法可以有效地挖掘不确定性和噪声，并在 synthetic noisy 和原始数据集上超越 State-of-the-Art (SOTA) 方法。

Abstract
Given the similarity between facial expression categories, the presence of compound facial expressions, and the subjectivity of annotators, facial expression recognition (FER) datasets often suffer from ambiguity and noisy labels. Ambiguous expressions are challenging to differentiate from expressions with noisy labels, which hurt the robustness of FER models. Furthermore, the difficulty of recognition varies across different expression categories, rendering a uniform approach unfair for all expressions. In this paper, we introduce a novel approach called Adaptive Sample Mining (ASM) to dynamically address ambiguity and noise within each expression category. First, the Adaptive Threshold Learning module generates two thresholds, namely the clean and noisy thresholds, for each category. These thresholds are based on the mean class probabilities at each training epoch. Next, the Sample Mining module partitions the dataset into three subsets: clean, ambiguity, and noise, by comparing the sample confidence with the clean and noisy thresholds. Finally, the Tri-Regularization module employs a mutual learning strategy for the ambiguity subset to enhance discrimination ability, and an unsupervised learning strategy for the noise subset to mitigate the impact of noisy labels. Extensive experiments prove that our method can effectively mine both ambiguity and noise, and outperform SOTA methods on both synthetic noisy and original datasets. The supplement material is available at https://github.com/zzzzzzyang/ASM.

摘要
由于表情表达category之间的相似性，合成表情和评分器的 Subjectivity， facial expression recognition（FER）数据集经常受到模糊和噪声的影响。模糊的表情和噪声标签降低FER模型的稳定性。此外，不同表情类别的识别难度不同，这使得一种一致的方法不公平对所有表情。在这篇论文中，我们提出了一种新的方法called Adaptive Sample Mining（ASM），用于动态地处理表情中的模糊和噪声。首先，Adaptive Threshold Learning模块生成了每个类别的两个阈值：净和噪声阈值。这些阈值基于每个训练epoch的类别mean probability。接着，Sample Mining模块将数据集分为三个子集：净、模糊和噪声，根据样本信息与净和噪声阈值进行比较。最后，Tri-Regularization模块使用了一种互学习策略来增强模糊子集的推理能力，并使用了一种无监督学习策略来抑制噪声标签的影响。我们的方法能够有效地挖掘模糊和噪声，并在 synthetic noisy和原始数据集上超越了State-of-the-Art（SOTA）方法。详细实验结果可以在https://github.com/zzzzzzyang/ASM中找到。

Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments

paper_url: http://arxiv.org/abs/2310.05600
repo_url: https://github.com/m-g-a/care3d
paper_authors: Michael G. Adam, Sebastian Eger, Martin Piccolrovazzi, Maged Iskandar, Joern Vogel, Alexander Dietrich, Seongjien Bien, Jon Skerlj, Abdeldjallil Naceri, Eckehard Steinbach, Alin Albu-Schaeffer, Sami Haddadin, Wolfram Burgard
for: 为了弥补医疗领域人才匮乏的问题，这篇短文提供了一个帮助Robotics开发的注释数据集。
methods: 本文使用了真实环境中的捕捉数据，以及一个医疗机器人内部的真实环境进行描述。
results: 本文提供了一个可靠的SLAM算法评估数据集，以便在医疗机器人上运行SLAM算法。

Abstract
As labor shortage increases in the health sector, the demand for assistive robotics grows. However, the needed test data to develop those robots is scarce, especially for the application of active 3D object detection, where no real data exists at all. This short paper counters this by introducing such an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We further provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot.

摘要
随着医疗领域的劳动力短缺增加，需求 для协助 роботикс也在增长。然而，为开发这些机器人所需的测试数据却缺乏，尤其是在活动3D对象检测方面，无现实数据存在。这短篇论文解决了这个问题，通过发布真实环境的注释数据集。这些捕捉的环境代表了现在在医疗机器人研究领域已经使用的区域。我们还提供了一个房间内的真实数据，用于评估运行直接在医疗机器人上的SLAM算法。

Perceptual Artifacts Localization for Image Synthesis Tasks

paper_url: http://arxiv.org/abs/2310.05590
repo_url: https://github.com/open-mmlab/mmsegmentation
paper_authors: Lingzhi Zhang, Zhengjie Xu, Connelly Barnes, Yuqian Zhou, Qing Liu, He Zhang, Sohrab Amirghodsi, Zhe Lin, Eli Shechtman, Jianbo Shi
for: 这个论文主要是为了研究图像生成模型中的感知缺陷，以及如何自动修复这些缺陷。
methods: 该论文使用了一种新的数据集，并提出了一种基于分割模型的感知缺陷地图生成方法。
results: 该论文的实验结果显示，该方法可以有效地检测和修复图像生成中的感知缺陷，并且可以适应不同的图像生成模型。

Abstract
Recent advancements in deep generative models have facilitated the creation of photo-realistic images across various tasks. However, these generated images often exhibit perceptual artifacts in specific regions, necessitating manual correction. In this study, we present a comprehensive empirical examination of Perceptual Artifacts Localization (PAL) spanning diverse image synthesis endeavors. We introduce a novel dataset comprising 10,168 generated images, each annotated with per-pixel perceptual artifact labels across ten synthesis tasks. A segmentation model, trained on our proposed dataset, effectively localizes artifacts across a range of tasks. Additionally, we illustrate its proficiency in adapting to previously unseen models using minimal training samples. We further propose an innovative zoom-in inpainting pipeline that seamlessly rectifies perceptual artifacts in the generated images. Through our experimental analyses, we elucidate several practical downstream applications, such as automated artifact rectification, non-referential image quality evaluation, and abnormal region detection in images. The dataset and code are released.

摘要
中文翻译：现代深度生成模型的发展使得可以生成高品质的图像，但是这些生成图像经常在特定区域出现感知 artifacts，需要手动修正。在这项研究中，我们对感知artifacts的地方进行了广泛的实验性研究，涵盖了多种图像生成任务。我们提出了一个新的数据集，包含10,168个生成图像，每个图像都有每像素的感知artifact标签。我们训练了一个分割模型，并在我们提出的数据集上进行了训练。我们发现这个模型可以在多种任务上有效地 lokalisieren artifacts。此外，我们还展示了它可以使用最小的训练样本适应未经见过的模型。我们还提出了一种innovative的缩放填充框架，可以轻松地修正生成图像中的感知 artifacts。通过我们的实验分析，我们描述了一些实用的下游应用，例如自动修正感知 artifacts、非参照图像质量评价和图像中异常区域检测。我们的数据集和代码都已经发布。

A review of uncertainty quantification in medical image analysis: probabilistic and non-probabilistic methods

paper_url: http://arxiv.org/abs/2310.06873
repo_url: None
paper_authors: Ling Huang, Su Ruan, Yucheng Xing, Mengling Feng
for: 这种评估机器学习模型可靠性的方法，可以帮助医生更好地理解和accept机器学习模型的结果。
methods: 这篇文章介绍了多种评估机器学习模型的不确定性方法，包括概率方法和非概率方法。
results: 文章提供了一个全面的审视，涵盖了各种医学图像任务中机器学习模型的不确定性评估方法。

Abstract
The comprehensive integration of machine learning healthcare models within clinical practice remains suboptimal, notwithstanding the proliferation of high-performing solutions reported in the literature. A predominant factor hindering widespread adoption pertains to an insufficiency of evidence affirming the reliability of the aforementioned models. Recently, uncertainty quantification methods have been proposed as a potential solution to quantify the reliability of machine learning models and thus increase the interpretability and acceptability of the result. In this review, we offer a comprehensive overview of prevailing methods proposed to quantify uncertainty inherent in machine learning models developed for various medical image tasks. Contrary to earlier reviews that exclusively focused on probabilistic methods, this review also explores non-probabilistic approaches, thereby furnishing a more holistic survey of research pertaining to uncertainty quantification for machine learning models. Analysis of medical images with the summary and discussion on medical applications and the corresponding uncertainty evaluation protocols are presented, which focus on the specific challenges of uncertainty in medical image analysis. We also highlight some potential future research work at the end. Generally, this review aims to allow researchers from both clinical and technical backgrounds to gain a quick and yet in-depth understanding of the research in uncertainty quantification for medical image analysis machine learning models.

摘要
文章概要：随着机器学习模型在医疗实践中的普及，其完整性仍然受到限制，尽管文献中报道了高性能解决方案。主要阻碍广泛采用的因素是证据不足，证明机器学习模型的可靠性。最近，不确定性评估方法得到了关注，以量化机器学习模型中的不确定性。本文提供了各种不确定性评估方法的总览，包括非概率方法，以提供更全面的研究报告。分析医疗影像的总结和讨论，以及相应的不确定性评估协议，将注重医疗影像分析中的特殊挑战。此外，我们还提出了未来研究的可能性。本文的目标是为医疗和技术背景的研究人员提供快速但深入了解机器学习模型在医疗影像分析中的不确定性评估研究。

A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers

paper_url: http://arxiv.org/abs/2310.05572
repo_url: https://github.com/matteo-bastico/mi-seg
paper_authors: Matteo Bastico, David Ryckelynck, Laurent Corté, Yannick Tillier, Etienne Decencière
for:This paper aims to address the challenge of cross-modality image segmentation, where a single model can perform well on multiple types of images.methods:The proposed method uses a simple framework that adapts normalization layers based on the input type, trained with non-registered interleaved mixed data.results:The proposed method outperforms other cross-modality segmentation methods on the Multi-Modality Whole Heart Segmentation Challenge, with an improvement of up to 6.87% in Dice accuracy. Additionally, the proposed Conditional Vision Transformer (C-ViT) encoder brings significant improvements to the resulting segmentation.

Abstract
When it comes to clinical images, automatic segmentation has a wide variety of applications and a considerable diversity of input domains, such as different types of Magnetic Resonance Images (MRIs) and Computerized Tomography (CT) scans. This heterogeneity is a challenge for cross-modality algorithms that should equally perform independently of the input image type fed to them. Often, segmentation models are trained using a single modality, preventing generalization to other types of input data without resorting to transfer learning techniques. Furthermore, the multi-modal or cross-modality architectures proposed in the literature frequently require registered images, which are not easy to collect in clinical environments, or need additional processing steps, such as synthetic image generation. In this work, we propose a simple framework to achieve fair image segmentation of multiple modalities using a single conditional model that adapts its normalization layers based on the input type, trained with non-registered interleaved mixed data. We show that our framework outperforms other cross-modality segmentation methods, when applied to the same 3D UNet baseline model, on the Multi-Modality Whole Heart Segmentation Challenge. Furthermore, we define the Conditional Vision Transformer (C-ViT) encoder, based on the proposed cross-modality framework, and we show that it brings significant improvements to the resulting segmentation, up to 6.87\% of Dice accuracy, with respect to its baseline reference. The code to reproduce our experiments and the trained model weights are available at https://github.com/matteo-bastico/MI-Seg.

摘要
当来到临床图像时，自动分割有很多应用场景和各种输入领域的多样性，如不同类型的核磁共振成像（MRI）和计算机成像（CT）扫描。这种多样性是跨模态算法的挑战，这些算法需要独立于输入图像类型来运行。常见的分割模型在训练时使用单一模态，这使得它们无法泛化到其他输入数据类型，需要使用传输学习技术。此外，在文献中提出的多模态或跨模态架构常需要已经注册的图像，这些图像在临床环境中很难获得，或需要额外的处理步骤，如生成synthetic图像。在这种工作中，我们提出了一个简单的框架，可以实现多模态图像分割的公平性，使用单个条件模型，该模型根据输入类型调整正规化层。我们表明，我们的框架在与同一个3D UNet基础模型进行比较时，已经超越了其他跨模态分割方法。此外，我们定义了 Conditional Vision Transformer（C-ViT）Encoder，基于我们的跨模态框架，并证明它对结果的分割带来了显著改进，达到6.87%的Dice准确率，相比其参照基eline。我们在GitHub上提供了实验代码和训练模型参数，请参考https://github.com/matteo-bastico/MI-Seg。

M3FPolypSegNet: Segmentation Network with Multi-frequency Feature Fusion for Polyp Localization in Colonoscopy Images

paper_url: http://arxiv.org/abs/2310.05538
repo_url: None
paper_authors: Ju-Hyeon Nam, Seo-Hyeong Park, Nur Suriza Syazwany, Yerim Jung, Yu-Han Im, Sang-Chul Lee
For: 降低抑制肠癌的风险，通过自动segmentation of polyps* Methods: 使用深度学习， Specifically, a novel frequency-based fully convolutional neural network (M3FPolypSegNet) that decomposes the input image into low/high/full-frequency components and uses multiple independent multi-frequency encoders to map the input image into a high-dimensional feature space.* Results: 比较多种segmentation模型，取得性能提升6.92%和7.52%的平均提升值， indicating that the proposed model outperformed various segmentation models.

Abstract
Polyp segmentation is crucial for preventing colorectal cancer a common type of cancer. Deep learning has been used to segment polyps automatically, which reduces the risk of misdiagnosis. Localizing small polyps in colonoscopy images is challenging because of its complex characteristics, such as color, occlusion, and various shapes of polyps. To address this challenge, a novel frequency-based fully convolutional neural network, Multi-Frequency Feature Fusion Polyp Segmentation Network (M3FPolypSegNet) was proposed to decompose the input image into low/high/full-frequency components to use the characteristics of each component. We used three independent multi-frequency encoders to map multiple input images into a high-dimensional feature space. In the Frequency-ASPP Scalable Attention Module (F-ASPP SAM), ASPP was applied between each frequency component to preserve scale information. Subsequently, scalable attention was applied to emphasize polyp regions in a high-dimensional feature space. Finally, we designed three multi-task learning (i.e., region, edge, and distance) in four decoder blocks to learn the structural characteristics of the region. The proposed model outperformed various segmentation models with performance gains of 6.92% and 7.52% on average for all metrics on CVC-ClinicDB and BKAI-IGH-NeoPolyp, respectively.

摘要
《多脉冲分割是预防结肠癌的关键，深度学习已经用于自动分割脉冲，从而减少误诊风险。在colonoscopy图像中 lokalisir small脉冲具有复杂的特征，如颜色、 occlusion 和不同形状的脉冲。为解决这个挑战，我们提出了一种基于频率的全 convolutional neural network，即Multi-Frequency Feature Fusion Polyp Segmentation Network (M3FPolypSegNet)。我们将输入图像分解成低/高/全频率组成部分，并使用每个组成部分的特征来进行分割。在Frequency-ASPP Scalable Attention Module (F-ASPP SAM)中，我们应用了ASPP между每个频率组成部分，以保持尺度信息。然后，我们应用了可扩展的注意力 Mechanism来强调脉冲区域在高维特征空间中。最后，我们设计了三种多任务学习（即区域、边缘和距离），并在四个解码块中进行学习区域的结构特征。我们的模型在CVC-ClinicDB和BKAI-IGH-NeoPolyp上的各种评价指标上表现出了6.92%和7.52%的性能提升，相对于其他分割模型。》

Bi-directional Deformation for Parameterization of Neural Implicit Surfaces

paper_url: http://arxiv.org/abs/2310.05524
repo_url: None
paper_authors: Baixin Xu, Jiangbei Hu, Fei Hou, Kwan-Yee Lin, Wayne Wu, Chen Qian, Ying He
for: 本研究旨在提供一种能够直观地编辑3D对象的新技术，特别是当3D对象表示为神经隐存函数时。
methods: 我们提出了一种基于神经网络的参数化方法，将3D对象映射到简单的参数域中，如球体、立方体或多面体，从而方便视觉化和编辑工作。我们采用了双向偏振变换，以消除任何先前信息的需求。
results: 我们的方法可以快速渲染编辑后的текxture图像，无需重新训练神经网络。此外，我们的方法还支持多对象共参数化和Texture传输。我们在人头和人工对象的图像上进行了实验，并将源代码公开发布。

Abstract
The growing capabilities of neural rendering have increased the demand for new techniques that enable the intuitive editing of 3D objects, particularly when they are represented as neural implicit surfaces. In this paper, we present a novel neural algorithm to parameterize neural implicit surfaces to simple parametric domains, such as spheres, cubes or polycubes, where 3D radiance field can be represented as a 2D field, thereby facilitating visualization and various editing tasks. Technically, our method computes a bi-directional deformation between 3D objects and their chosen parametric domains, eliminating the need for any prior information. We adopt a forward mapping of points on the zero level set of the 3D object to a parametric domain, followed by a backward mapping through inverse deformation. To ensure the map is bijective, we employ a cycle loss while optimizing the smoothness of both deformations. Additionally, we leverage a Laplacian regularizer to effectively control angle distortion and offer the flexibility to choose from a range of parametric domains for managing area distortion. Designed for compatibility, our framework integrates seamlessly with existing neural rendering pipelines, taking multi-view images as input to reconstruct 3D geometry and compute the corresponding texture map. We also introduce a simple yet effective technique for intrinsic radiance decomposition, facilitating both view-independent material editing and view-dependent shading editing. Our method allows for the immediate rendering of edited textures through volume rendering, without the need for network re-training. Moreover, our approach supports the co-parameterization of multiple objects and enables texture transfer between them. We demonstrate the effectiveness of our method on images of human heads and man-made objects. We will make the source code publicly available.

摘要
“对于内部运算的需求增加，特别是对于内部运算表示为对� Neurolayers 的需求。在这篇文章中，我们提出一个新的对� Neurolayers 的对� Parametric 表示方法，使得可以将 3D 颜色场表示为 2D 场，并且方便了视觉化和各种修改任务。技术上，我们的方法通过两个方向的扭曲来将 3D 物体与选择的对� Parametric 类型（例如：球体、立方体或多边形）进行对�映射，从而实现了对� 3D 物体的可视化和修改。我们运用了零层Set 的点映射到对� Parametric 类型，然后透过反对映对映射。为确保映射是对�的，我们运用了一个周期损失函数，并且对映射进行均匀调整。此外，我们还运用了一个 Laplacian 调整器，以控制角度扭曲和面积扭曲。我们的框架可以与现有的内部运算架构集成，并且可以将多视图像作为输入，以实现 3D 几何学的重建和相应的纹理图。我们还提出了一个简单又有效的纹理分解方法，以便进行视� Independent 材质修改和视� Dependent 颜色修改。我们的方法可以立即将修改过的纹理显示出来， без需要网络重训。此外，我们的方法支持多个物体的共同对�映射，并且允许纹理转移 между它们。我们在人类头部和人工物体的图像上进行了试验，并且将源代码公开。”

Proposal-based Temporal Action Localization with Point-level Supervision

paper_url: http://arxiv.org/abs/2310.05511
repo_url: None
paper_authors: Yuan Yin, Yifei Huang, Ryosuke Furuta, Yoichi Sato
for: 本研究旨在提高点水平批处时间动作地标记（PTAL）的性能，PTAL 是在无法提供全程动作标注的情况下，通过单个帧的标注来识别和地标记动作的技术。
methods: 我们提出了一种新的方法，即生成和评估动作提案，以便更好地利用动作的时间信息。此外，我们还引入了一种高效的分 clustering 算法，以生成密集的 Pseudo 标签，并使用细化的对比损失来进一步改进 Pseudo 标签的质量。
results: 我们的方法在四个 benchmark 上实现了比或superior的性能，比如 ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID 数据集。

Abstract
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos where only a single point (frame) within every action instance is annotated in training data. Without temporal annotations, most previous works adopt the multiple instance learning (MIL) framework, where the input video is segmented into non-overlapped short snippets, and action classification is performed independently on every short snippet. We argue that the MIL framework is suboptimal for PTAL because it operates on separated short snippets that contain limited temporal information. Therefore, the classifier only focuses on several easy-to-distinguish snippets instead of discovering the whole action instance without missing any relevant snippets. To alleviate this problem, we propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration that involve more comprehensive temporal information. Moreover, we introduce an efficient clustering algorithm to efficiently generate dense pseudo labels that provide stronger supervision, and a fine-grained contrastive loss to further refine the quality of pseudo labels. Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods and some fully-supervised methods on four benchmarks: ActivityNet 1.3, THUMOS 14, GTEA, and BEOID datasets.

摘要
点水平监督时间动作Localization（PTAL）目标是在没有时间标注的视频中识别和地图动作实例。在训练数据中只有每个动作实例中的一帧图像被标注。大多数先前的工作采用多个实例学习（MIL）框架，将输入视频切分成不重叠的短片段，然后独立地对每个短片段进行动作分类。我们认为MIL框架不适合PTAL，因为它只处理每个短片段中含有有限的时间信息，导致分类器只关注一些易于分辨的短片段，而不是找到整个动作实例没有遗弃任何相关的短片段。为解决这个问题，我们提议一种新的方法，通过生成和评估动作提案的灵活时间长度来地图动作。此外，我们引入高效的归一化算法，以生成高密度的假标签，并引入细化的对比损失来进一步改进假标签的质量。实验结果表明，我们的提议方法可以与状态艺术方法和一些完全监督方法在四个标准测试集（ActivityNet 1.3、THUMOS 14、GTEA 和 BEOID 数据集）上达到竞争性或超过性能。

Colmap-PCD: An Open-source Tool for Fine Image-to-point cloud Registration

paper_url: http://arxiv.org/abs/2310.05504
repo_url: https://github.com/xiaobaiiiiii/colmap-pcd
paper_authors: Chunge Bai, Ruijie Fu, Xiang Gao
for: 本研究旨在解决单目camera重建中存在的缺乏级别信息问题，通过利用预设置的LiDAR地图作为固定约束，实现高精度的重建。methods: 本研究提出了一种新的成本效果重建管道，通过不需要同步拍摄相机和LiDAR数据的注册图像到点云地图，实现了不同区域的重建细节水平的管理。基于Colmap算法，我们开发了一个开源工具Colmap-PCD${^{3}$，以便进一步研究在这个领域中。results: 本研究实现了在不同区域中管理重建细节水平的可能性，并且不需要同步拍摄相机和LiDAR数据。通过与现有的重建方法进行比较，我们发现我们的方法可以提供更高的精度和更多的细节。

Abstract
State-of-the-art techniques for monocular camera reconstruction predominantly rely on the Structure from Motion (SfM) pipeline. However, such methods often yield reconstruction outcomes that lack crucial scale information, and over time, accumulation of images leads to inevitable drift issues. In contrast, mapping methods based on LiDAR scans are popular in large-scale urban scene reconstruction due to their precise distance measurements, a capability fundamentally absent in visual-based approaches. Researchers have made attempts to utilize concurrent LiDAR and camera measurements in pursuit of precise scaling and color details within mapping outcomes. However, the outcomes are subject to extrinsic calibration and time synchronization precision. In this paper, we propose a novel cost-effective reconstruction pipeline that utilizes a pre-established LiDAR map as a fixed constraint to effectively address the inherent scale challenges present in monocular camera reconstruction. To our knowledge, our method is the first to register images onto the point cloud map without requiring synchronous capture of camera and LiDAR data, granting us the flexibility to manage reconstruction detail levels across various areas of interest. To facilitate further research in this domain, we have released Colmap-PCD${^{3}$, an open-source tool leveraging the Colmap algorithm, that enables precise fine-scale registration of images to the point cloud map.

摘要
现代单目镜头重建技术主要基于结构 FROM 运动（SfM）管道。然而，这些方法经常导致重建结果缺乏重要的尺度信息，而且随着图像的增加，时间滤波问题会变得不可避免。相比之下，基于 LiDAR 扫描的映射方法在大规模城市场景重建中广泛应用，因为它们可以提供精确的距离测量，这是视觉基本缺乏的。研究人员已经尝试将同时 captured LiDAR 和镜头数据用于精确的涂抹和颜色细节，但结果受到外部尺度和时间同步精度的限制。在本文中，我们提出了一种新的成本效果重建管道，利用预先建立的 LiDAR 地图作为固定的约束，有效地解决单目镜头重建中的自然尺度挑战。与传统方法不同的是，我们的方法不需要同步捕捉镜头和 LiDAR 数据，这 grant us 更多的灵活性来管理重建细节水平。为了促进这个领域的进一步研究，我们已经发布了 Colmap-PCD${^{3}$，一款开源工具，基于 Colmap 算法，可以精确地将图像注射到点云地图上。

Semi-Supervised Object Detection with Uncurated Unlabeled Data for Remote Sensing Images

paper_url: http://arxiv.org/abs/2310.05498
repo_url: None
paper_authors: Nanqing Liu, Xun Xu, Yingjie Gao, Heng-Chao Li
for: 这篇 paper 的目的是探讨如何在无监督的资料上进行离散感知图像的标注，以提高标注效率和准确性。
methods: 这篇 paper 使用了 semi-supervised object detection (SSOD) 方法，通过将无监督资料中的所有类别转换为 pseudo-labels，以实现在无监督资料上的标注。然而，实际情况下可能会出现无法分类的数据（out-of-distribution，OOD）和可分类的数据（in-distribution，ID）混合在一起的情况。这篇 paper 提出了一种直接使用无监督资料进行 OSSOD 的方法，包括使用可靠的对应映射来验证 OOD 数据，以及透过 AdaBoost 等方法来提高 OSSOD 的准确性。
results: 实验结果显示，这篇 paper 的方法在两个 widely 使用的离散感知物体标注 dataset（DIOR 和 DOTA）上具有较高的效率和准确性，并且在面对不同的变化和噪音下也能够保持良好的表现。

Abstract
Annotating remote sensing images (RSIs) presents a notable challenge due to its labor-intensive nature. Semi-supervised object detection (SSOD) methods tackle this issue by generating pseudo-labels for the unlabeled data, assuming that all classes found in the unlabeled dataset are also represented in the labeled data. However, real-world situations introduce the possibility of out-of-distribution (OOD) samples being mixed with in-distribution (ID) samples within the unlabeled dataset. In this paper, we delve into techniques for conducting SSOD directly on uncurated unlabeled data, which is termed Open-Set Semi-Supervised Object Detection (OSSOD). Our approach commences by employing labeled in-distribution data to dynamically construct a class-wise feature bank (CFB) that captures features specific to each class. Subsequently, we compare the features of predicted object bounding boxes with the corresponding entries in the CFB to calculate OOD scores. We design an adaptive threshold based on the statistical properties of the CFB, allowing us to filter out OOD samples effectively. The effectiveness of our proposed method is substantiated through extensive experiments on two widely used remote sensing object detection datasets: DIOR and DOTA. These experiments showcase the superior performance and efficacy of our approach for OSSOD on RSIs.

摘要
annotating remote sensing images (RSIs) 是一项具有很大挑战性的任务，尤其是由于它的劳动 INTENSIVE 性。半supervised object detection (SSOD) 方法可以解决这个问题，通过生成 pseudo-labels для没有标签数据，假设所有在没有标签数据中出现的类也出现在标签数据中。然而，现实中的情况可能会出现在不符合预期的样本（OOD）被混合到了没有标签数据中。在这篇论文中，我们深入探讨了如何直接在没有检查的数据上进行 SSOD，我们称之为 Open-Set Semi-Supervised Object Detection (OSSOD)。我们的方法开始是使用标注的 Distribution 数据来动态构建一个类别特定的特征银行 (CFB)，以捕捉每个类的特定特征。然后，我们将预测的对象 bounding box 的特征与 CFB 中相应的项目进行比较，以计算 OOD 分数。我们设计了基于 CFB 的适应阈值，以有效地过滤 OOD 样本。我们的提议的方法在 DIOR 和 DOTA 两个广泛使用的 remote sensing 对象检测数据集上进行了广泛的实验，这些实验证明了我们的方法在 OSSOD 中的超越性和效果。

Geometry-Guided Ray Augmentation for Neural Surface Reconstruction with Sparse Views

paper_url: http://arxiv.org/abs/2310.05483
repo_url: None
paper_authors: Jiawei Yao, Chen Wang, Tong Wu, Chuming Li
for: 本 paper 提出了一种基于多视图图像的 3D 场景和物体重建方法。与之前的方法不同的是，这种方法不需要额外信息，如深度或者场景特征，而是利用多视图输入中嵌入的场景特征来创建精确的 pseudo-标签，以便通过优化来获得更高的重建精度。
methods: 我们引入了一种几何学指导的方法，通过利用球面函数预测新的反射环境，以便从缺乏视图中提高表面重建精度。此外，我们的管道还利用代理几何和正确处理干扰，以便生成 pseudo-标签的反射值。
results: 我们的方法，名为 Ray Augmentation (RayAug)，在 DTU 和 Blender 数据集上实现了无需前期训练的超过优result，证明了其在缺乏视图重建问题中的效iveness。我们的管道可以与其他隐式神经重建方法结合使用，以满足不同的应用需求。

Abstract
In this paper, we propose a novel method for 3D scene and object reconstruction from sparse multi-view images. Different from previous methods that leverage extra information such as depth or generalizable features across scenes, our approach leverages the scene properties embedded in the multi-view inputs to create precise pseudo-labels for optimization without any prior training. Specifically, we introduce a geometry-guided approach that improves surface reconstruction accuracy from sparse views by leveraging spherical harmonics to predict the novel radiance while holistically considering all color observations for a point in the scene. Also, our pipeline exploits proxy geometry and correctly handles the occlusion in generating the pseudo-labels of radiance, which previous image-warping methods fail to avoid. Our method, dubbed Ray Augmentation (RayAug), achieves superior results on DTU and Blender datasets without requiring prior training, demonstrating its effectiveness in addressing the problem of sparse view reconstruction. Our pipeline is flexible and can be integrated into other implicit neural reconstruction methods for sparse views.

摘要
在这篇论文中，我们提出了一种新的方法用于从笔粗多视图图像中重建3D场景和物体。与前一些方法不同，我们的方法不使用Extra信息如深度或跨场景通用特征，而是利用多视图输入中嵌入的场景特性来创建精确的pseudo标签，而无需任何前期训练。我们引入了一种几何学导向的方法，通过利用球面幂函数预测新的反射灯度，同时兼容所有场景颜色观察，以提高从笔粗视图中的表面重建精度。此外，我们的管道利用代理几何和正确处理 occlusion，以生成 pseudo标签的灯度。我们的方法，命名为Ray Augmentation（RayAug），在DTU和Blender数据集上实现了无需前期训练的superior结果，证明了其在笔粗视图重建问题中的有效性。我们的管道可以与其他隐式神经重建方法结合使用。

AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential Cross Attention

paper_url: http://arxiv.org/abs/2310.05462
repo_url: https://github.com/xianming-gu/adafuse
paper_authors: Xianming Gu, Lihui Wang, Zeyu Deng, Ying Cao, Xingyu Huang, Yue-min Zhu
for: 这篇论文旨在提出一种基于深度学习的多模态医疗图像融合方法，以提高诊断和手术导航的精度。
methods: 该方法基于频谱导向注意力机制，使用 fourier transform 来EXTRACT单模态图像信息，并通过cross-attention块来适应性地融合多模态图像信息。
results: 对多个数据集进行了比较性试验，结果表明，提出的方法可以在视觉质量和量化指标两个方面超过现有的状态 искусственный智能方法。此外，对折扣函数和融合策略进行了可视化检验，证明了提出的方法的有效性。

Abstract
Multi-modal medical image fusion is essential for the precise clinical diagnosis and surgical navigation since it can merge the complementary information in multi-modalities into a single image. The quality of the fused image depends on the extracted single modality features as well as the fusion rules for multi-modal information. Existing deep learning-based fusion methods can fully exploit the semantic features of each modality, they cannot distinguish the effective low and high frequency information of each modality and fuse them adaptively. To address this issue, we propose AdaFuse, in which multimodal image information is fused adaptively through frequency-guided attention mechanism based on Fourier transform. Specifically, we propose the cross-attention fusion (CAF) block, which adaptively fuses features of two modalities in the spatial and frequency domains by exchanging key and query values, and then calculates the cross-attention scores between the spatial and frequency features to further guide the spatial-frequential information fusion. The CAF block enhances the high-frequency features of the different modalities so that the details in the fused images can be retained. Moreover, we design a novel loss function composed of structure loss and content loss to preserve both low and high frequency information. Extensive comparison experiments on several datasets demonstrate that the proposed method outperforms state-of-the-art methods in terms of both visual quality and quantitative metrics. The ablation experiments also validate the effectiveness of the proposed loss and fusion strategy.

摘要
多模式医疗图像融合是致 preciseness 临床诊断和手术导航中不可或缺的，因为它可以将多种多样的信息融合到单一的图像中。图像融合的质量取决于提取出的单一模式特征以及多模式信息融合规则。现有的深度学习基于的融合方法可以充分利用每种模式的语义特征，但是它们无法区分每种模式的有效低频和高频信息，并将其 adaptively 融合。为解决这个问题，我们提出了 AdaFuse，它通过基于傅ри散变换的频谱导向注意力机制来逐渐融合多模式图像信息。具体来说，我们提出了交叉注意力融合（CAF）块，它可以在空间和频谱Domain中逐渐融合两种模式的特征，然后计算交叉注意力分数以更好地导航图像的空间频谱信息融合。CAF块可以增强不同模式的高频特征，以保留融合图像中的细节。此外，我们设计了一种新的损失函数，它包括结构损失和内容损失，以保持两种模式的低频和高频信息。广泛比较实验证明，我们提出的方法在多个数据集上表现出色，超过了当前状态的方法。我们还进行了精细的拆分实验，以证明提出的损失函数和融合策略的效果。

Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation

paper_url: http://arxiv.org/abs/2310.05453
repo_url: None
paper_authors: Yuxiang Lai, Xinghong Liu, Tao Zhou, Yi Zhou
for: 这个论文的目的是提出一种新的记忆驱动的子类别探索方法，以解决当存在类别内的概念变化时，普通的预测模型对类别的适应性不足的问题。methods: 这个方法使用了记忆驱动的子类别探索技术，可以从类别内的内部结构中学习到更加细致的特征空间，从而提高预测模型的适应性和精度。results: 这个方法在多个enario中实现了州际预测模型的提升，包括UniDA、OSDA和PDA等四个benchmark，在大多数情况下都可以 дости得顶尖的性能。

Abstract
Universal domain adaptation aims to align the classes and reduce the feature gap between the same category of the source and target domains. The target private category is set as the unknown class during the adaptation process, as it is not included in the source domain. However, most existing methods overlook the intra-class structure within a category, especially in cases where there exists significant concept shift between the samples belonging to the same category. When samples with large concept shift are forced to be pushed together, it may negatively affect the adaptation performance. Moreover, from the interpretability aspect, it is unreasonable to align visual features with significant differences, such as fighter jets and civil aircraft, into the same category. Unfortunately, due to such semantic ambiguity and annotation cost, categories are not always classified in detail, making it difficult for the model to perform precise adaptation. To address these issues, we propose a novel Memory-Assisted Sub-Prototype Mining (MemSPM) method that can learn the differences between samples belonging to the same category and mine sub-classes when there exists significant concept shift between them. By doing so, our model learns a more reasonable feature space that enhances the transferability and reflects the inherent differences among samples annotated as the same category. We evaluate the effectiveness of our MemSPM method over multiple scenarios, including UniDA, OSDA, and PDA. Our method achieves state-of-the-art performance on four benchmarks in most cases.

摘要
通用领域适应目标是调整源领域和目标领域之间的分类和特征差异。目标私有类别设置为未知类别，因为它不包括源领域中。然而，大多数现有方法忽略了同一类别内的结构，特别是当存在严重概念变化的情况下。当这些样本被迫推 together 时，可能会对适应性产生负面影响。此外，从解释方面来说，将不同的特征，如战斗机和民用飞机，迫使其推入同一类别中，是不合理的。实际上，这种概念混乱和标签成本问题，使得类别不一定严格分类，对模型进行精确适应很困难。为解决这些问题，我们提出了一种新的记忆帮助子型别挖掘（MemSPM）方法。这个方法可以学习同一类别内的差异，并在存在严重概念变化的情况下挖掘子类别。这样，我们的模型可以学习一个更合理的特征空间，增强转移性和反映内部类别之间的差异。我们在多个情况下评估了我们的MemSPM方法，包括UniDA、OSDA和PDA四个标准库。我们的方法在大多数情况下实现了顶尖性能。

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

paper_url: http://arxiv.org/abs/2310.05447
repo_url: None
paper_authors: Xinzhu Ma, Yongtao Wang, Yinmin Zhang, Zhiyi Xia, Yuan Meng, Zhihui Wang, Haojie Li, Wanli Ouyang
for: 提高图像基于3D物体检测领域的研究稳定性和可比性，并提供一套可重复性的训练标准和错误诊断工具箱。
methods: 使用模块化设计的代码库，定制强大的训练秘诀，并提供一套错误诊断工具箱来测量检测模型的细致特征。
results: 通过对当前方法进行深入分析和评估，提供一系列可重复性的结果，并解决一些开放问题，如 KITTI-3D 和 nuScenes 数据集上的差异，以便促进未来的图像基于3D物体检测研究。

Abstract
In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. In particular, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in image-based 3D object detection. Our codes will be released at \url{https://github.com/OpenGVLab/3dodi}

摘要
在这项工作中，我们建立了模块化设计的代码基础，制定了强大的训练荷重，设计了错误诊断工具箱，并讨论了现有的图像基于3D物体检测方法。特别是，与其他已经成熟的任务，如2D物体检测，社区的图像基于3D物体检测仍在发展中，其方法经常采用不同的训练荷重和技巧，导致不公正的评估和比较。甚至，这些技巧可能会超越其提出的设计，导致错误的结论。为解决这个问题，我们建立了模块化设计的代码基础，并制定了统一的训练标准 для社区。另外，我们还设计了错误诊断工具箱，以测量检测模型的细化特征。使用这些工具，我们在不同的设置下进行了深入的分析，并提供了一些开放的问题的讨论，如基于KITTI-3D和nuScenes dataset的结论不一致问题。我们希望这项工作能够促进图像基于3D物体检测的未来研究。我们的代码将在 GitHub 上发布，请参考 \url{https://github.com/OpenGVLab/3dodi}。

RetSeg: Retention-based Colorectal Polyps Segmentation Network

paper_url: http://arxiv.org/abs/2310.05446
repo_url: None
paper_authors: Khaled ELKarazle, Valliappan Raman, Caslon Chua, Patrick Then
for: 这个研究的目的是提高医疗影像分析中的肿瘤分类、检测和分 segmentation 的精度和速度，并且应用 Transformer 架构以减少资源消耗和提高training parallelism。
methods: 本研究使用了 Retention Mechanism 来解决 Transformer 在医疗影像分析中的问题，包括资源消耗和training parallelism。RetSeg 是一个以 RetNet 为基础的 encoder-decoder 网络，旨在提高肿瘤分类和资源利用。
results: 本研究在两个公开可用的数据集上训练和验证 RetSeg，并且在多个公开数据集上显示了 RetSeg 的出色表现。尤其是在colonoscopy 影像上，RetSeg 能够提高肿瘤分类的精度和速度。

Abstract
Vision Transformers (ViTs) have revolutionized medical imaging analysis, showcasing superior efficacy compared to conventional Convolutional Neural Networks (CNNs) in vital tasks such as polyp classification, detection, and segmentation. Leveraging attention mechanisms to focus on specific image regions, ViTs exhibit contextual awareness in processing visual data, culminating in robust and precise predictions, even for intricate medical images. Moreover, the inherent self-attention mechanism in Transformers accommodates varying input sizes and resolutions, granting an unprecedented flexibility absent in traditional CNNs. However, Transformers grapple with challenges like excessive memory usage and limited training parallelism due to self-attention, rendering them impractical for real-time disease detection on resource-constrained devices. In this study, we address these hurdles by investigating the integration of the recently introduced retention mechanism into polyp segmentation, introducing RetSeg, an encoder-decoder network featuring multi-head retention blocks. Drawing inspiration from Retentive Networks (RetNet), RetSeg is designed to bridge the gap between precise polyp segmentation and resource utilization, particularly tailored for colonoscopy images. We train and validate RetSeg for polyp segmentation employing two publicly available datasets: Kvasir-SEG and CVC-ClinicDB. Additionally, we showcase RetSeg's promising performance across diverse public datasets, including CVC-ColonDB, ETIS-LaribPolypDB, CVC-300, and BKAI-IGH NeoPolyp. While our work represents an early-stage exploration, further in-depth studies are imperative to advance these promising findings.

摘要
医学影像分析领域中，视野变换器（ViT）已经引领了革命性的改变，在重要的任务中，如肿瘤分类、检测和 segmentation 方面，表现出了Superior efficacy compared to traditional Convolutional Neural Networks（CNNs）。通过关注特定的图像区域，ViT 展示了Contextual awareness 在处理视觉数据，从而导致了Robust and precise predictions，甚至 для复杂的医学影像。此外，Transformers 内置的自注意机制可以满足不同的输入大小和分辨率，从而提供了前所未有的灵活性，与传统的 CNNs 不同。然而，Transformers 面临着大量内存使用和自注意机制限制了实时疾病检测的应用，特别是在有限的设备资源下。在本研究中，我们解决这些障碍物，通过调查Retention mechanism 的整合进行肿瘤分 segmentation，提出了RetSeg，一种基于 encoder-decoder 网络的多头保留块。 drawing inspiration from Retentive Networks（RetNet），RetSeg 旨在bridging the gap between precise polyp segmentation and resource utilization，特别适用于 colonoscopy 图像。我们使用两个公共可用的数据集进行训练和验证RetSeg：Kvasir-SEG 和 CVC-ClinicDB。此外，我们还展示了RetSeg 在多个公共数据集上的出色表现，包括 CVC-ColonDB、ETIS-LaribPolypDB、CVC-300 和 BKAI-IGH NeoPolyp。虽然我们的工作只是早期的探索，但更深入的研究是必要的，以进一步发展这些有前途的发现。

AngioMoCo: Learning-based Motion Correction in Cerebral Digital Subtraction Angiography

paper_url: http://arxiv.org/abs/2310.05445
repo_url: None
paper_authors: Ruisheng Su, Matthijs van der Sluijs, Sandra Cornelissen, Wim van Zwam, Aad van der Lugt, Wiro Niessen, Danny Ruijters, Theo van Walsum, Adrian Dalca
For: The paper aims to address the limitations of cerebral X-ray digital subtraction angiography (DSA) by developing a learning-based framework called AngioMoCo, which generates motion-compensated DSA sequences from X-ray angiography.* Methods: AngioMoCo integrates contrast extraction and motion correction, enabling differentiation between patient motion and intensity changes caused by contrast flow. The framework uses a learning-based approach that is substantially faster than iterative elastix-based methods.* Results: The paper demonstrates the effectiveness of AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow.

Abstract
Cerebral X-ray digital subtraction angiography (DSA) is the standard imaging technique for visualizing blood flow and guiding endovascular treatments. The quality of DSA is often negatively impacted by body motion during acquisition, leading to decreased diagnostic value. Time-consuming iterative methods address motion correction based on non-rigid registration, and employ sparse key points and non-rigidity penalties to limit vessel distortion. Recent methods alleviate subtraction artifacts by predicting the subtracted frame from the corresponding unsubtracted frame, but do not explicitly compensate for motion-induced misalignment between frames. This hinders the serial evaluation of blood flow, and often causes undesired vasculature and contrast flow alterations, leading to impeded usability in clinical practice. To address these limitations, we present AngioMoCo, a learning-based framework that generates motion-compensated DSA sequences from X-ray angiography. AngioMoCo integrates contrast extraction and motion correction, enabling differentiation between patient motion and intensity changes caused by contrast flow. This strategy improves registration quality while being substantially faster than iterative elastix-based methods. We demonstrate AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow. Code is publicly available at https://github.com/RuishengSu/AngioMoCo.

摘要
脑血管X射线数字抑减成像（DSA）是现代成像技术的标准，用于评估血液流动和导引内镜治疗。然而，DSA的质量经常受到身体运动的影响，导致诊断价值下降。时间consuming的迭代方法 Addresses motion correction based on non-rigid registration, using sparse key points and non-rigidity penalties to limit vessel distortion。Recent methods predict the subtracted frame from the corresponding unsubtracted frame, but do not explicitly compensate for motion-induced misalignment between frames, leading to serial evaluation of blood flow and undesired vasculature and contrast flow alterations, which hinders clinical practice. To overcome these limitations, we present AngioMoCo, a learning-based framework that generates motion-compensated DSA sequences from X-ray angiography. AngioMoCo integrates contrast extraction and motion correction, allowing for the differentiation between patient motion and intensity changes caused by contrast flow. This strategy improves registration quality while being substantially faster than iterative elastix-based methods. We demonstrate AngioMoCo on a large national multi-center dataset (MR CLEAN Registry) of clinically acquired angiographic images through comprehensive qualitative and quantitative analyses. AngioMoCo produces high-quality motion-compensated DSA, removing motion artifacts while preserving contrast flow. Code is publicly available at https://github.com/RuishengSu/AngioMoCo.

Semantic-aware Temporal Channel-wise Attention for Cardiac Function Assessment

paper_url: http://arxiv.org/abs/2310.05428
repo_url: None
paper_authors: Guanqi Chen, Guanbin Li
for: 预测左心脏肌功能评估 Left Ventricular Ejection Fraction (LVEF) 基于echocardiogram视频，以提高智能医疗辅助技术的准确性和自动性。
methods: 提议使用 semi-supervised auxilary learning парадигма，具有左心脏区域分割任务，以帮助左心脏区域的表征学学习。通过引入时间通道wise抽象（TCA）模块，更好地模型左心脏动态信息的重要性。并且通过对 segmentation 图像进行semantic perception，更好地关注左心脏动态模式。
results: 在Standford数据集上达到了状态机器人的性能，与原始模型的改进为0.22 MAE，0.26 RMSE和1.9% $R^2$。

Abstract
Cardiac function assessment aims at predicting left ventricular ejection fraction (LVEF) given an echocardiogram video, which requests models to focus on the changes in the left ventricle during the cardiac cycle. How to assess cardiac function accurately and automatically from an echocardiogram video is a valuable topic in intelligent assisted healthcare. Existing video-based methods do not pay much attention to the left ventricular region, nor the left ventricular changes caused by motion. In this work, we propose a semi-supervised auxiliary learning paradigm with a left ventricular segmentation task, which contributes to the representation learning for the left ventricular region. To better model the importance of motion information, we introduce a temporal channel-wise attention (TCA) module to excite those channels used to describe motion. Furthermore, we reform the TCA module with semantic perception by taking the segmentation map of the left ventricle as input to focus on the motion patterns of the left ventricle. Finally, to reduce the difficulty of direct LVEF regression, we utilize an anchor-based classification and regression method to predict LVEF. Our approach achieves state-of-the-art performance on the Stanford dataset with an improvement of 0.22 MAE, 0.26 RMSE, and 1.9% $R^2$.

摘要
Cardiac function assessment aims to predict left ventricular ejection fraction (LVEF) based on an echocardiogram video, which requires models to focus on changes in the left ventricle during the cardiac cycle. Accurately assessing cardiac function from an echocardiogram video is a valuable topic in intelligent assisted healthcare. Existing video-based methods do not pay much attention to the left ventricular region or the left ventricular changes caused by motion.In this work, we propose a semi-supervised auxiliary learning paradigm with a left ventricular segmentation task, which contributes to the representation learning for the left ventricular region. To better model the importance of motion information, we introduce a temporal channel-wise attention (TCA) module to excite those channels used to describe motion. Furthermore, we reform the TCA module with semantic perception by taking the segmentation map of the left ventricle as input to focus on the motion patterns of the left ventricle. Finally, to reduce the difficulty of direct LVEF regression, we utilize an anchor-based classification and regression method to predict LVEF. Our approach achieves state-of-the-art performance on the Stanford dataset with an improvement of 0.22 MAE, 0.26 RMSE, and 1.9% $R^2$.

GradientSurf: Gradient-Domain Neural Surface Reconstruction from RGB Video

paper_url: http://arxiv.org/abs/2310.05406
repo_url: None
paper_authors: Crane He Chen, Joerg Liebelt
for: 这个论文提出了一种实时表面重建方法，用于从单色RGB视频中重建场景表面。methods: 该方法基于紧密地关联表面、体积和方向云点云，并在梯度域内解决重建问题。不同于传统的Poisson表面重建方法，该方法在线解决问题，使用神经网络在部分扫描过程中逐步更新。results: 对于室内场景重建任务，视觉和量化实验结果显示，提出的方法可以在弯曲区域和小物体上重建表面的细节，并且比前方法有更高的准确率。

Abstract
This paper proposes GradientSurf, a novel algorithm for real time surface reconstruction from monocular RGB video. Inspired by Poisson Surface Reconstruction, the proposed method builds on the tight coupling between surface, volume, and oriented point cloud and solves the reconstruction problem in gradient-domain. Unlike Poisson Surface Reconstruction which finds an offline solution to the Poisson equation by solving a linear system after the scanning process is finished, our method finds online solutions from partial scans with a neural network incrementally where the Poisson layer is designed to supervise both local and global reconstruction. The main challenge that existing methods suffer from when reconstructing from RGB signal is a lack of details in the reconstructed surface. We hypothesize this is due to the spectral bias of neural networks towards learning low frequency geometric features. To address this issue, the reconstruction problem is cast onto gradient domain, where zeroth-order and first-order energies are minimized. The zeroth-order term penalizes location of the surface. The first-order term penalizes the difference between the gradient of reconstructed implicit function and the vector field formulated from oriented point clouds sampled at adaptive local densities. For the task of indoor scene reconstruction, visual and quantitative experimental results show that the proposed method reconstructs surfaces with more details in curved regions and higher fidelity for small objects than previous methods.

摘要
Existing methods suffer from a lack of details in the reconstructed surface, which we attribute to the spectral bias of neural networks towards learning low-frequency geometric features. To address this issue, we cast the reconstruction problem onto the gradient domain, where zeroth-order and first-order energies are minimized. The zeroth-order term penalizes the location of the surface, while the first-order term penalizes the difference between the gradient of the reconstructed implicit function and the vector field formulated from oriented point clouds sampled at adaptive local densities.For the task of indoor scene reconstruction, visual and quantitative experimental results show that the proposed method reconstructs surfaces with more details in curved regions and higher fidelity for small objects than previous methods.

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

paper_url: http://arxiv.org/abs/2310.05400
repo_url: None
paper_authors: Shiyue Cao, Yueqin Yin, Lianghua Huang, Yu Liu, Xin Zhao, Deli Zhao, Kaiqi Huang
for: 高分辨率图像生成
methods: local attention-based量化模型 + 多层次特征交互 + 自适应生成管道
results: 提高图像生成速度、质量和分辨率

Abstract
Vector-quantized image modeling has shown great potential in synthesizing high-quality images. However, generating high-resolution images remains a challenging task due to the quadratic computational overhead of the self-attention process. In this study, we seek to explore a more efficient two-stage framework for high-resolution image generation with improvements in the following three aspects. (1) Based on the observation that the first quantization stage has solid local property, we employ a local attention-based quantization model instead of the global attention mechanism used in previous methods, leading to better efficiency and reconstruction quality. (2) We emphasize the importance of multi-grained feature interaction during image generation and introduce an efficient attention mechanism that combines global attention (long-range semantic consistency within the whole image) and local attention (fined-grained details). This approach results in faster generation speed, higher generation fidelity, and improved resolution. (3) We propose a new generation pipeline incorporating autoencoding training and autoregressive generation strategy, demonstrating a better paradigm for image synthesis. Extensive experiments demonstrate the superiority of our approach in high-quality and high-resolution image reconstruction and generation.

摘要
vector-quantized 图像模型已经展示出Synthesizing 高质量图像的潜力。然而，生成高分辨率图像仍然是一项具有quadratic computational overhead的挑战，因为自我注意机制的计算复杂度 quadratic。在这项研究中，我们想要探索一种更高效的两 stage 框架，以提高以下三个方面：1. 根据我们所观察到的首次量化阶段具有坚实的地方性质，我们采用了地方注意机制来取代之前的全局注意机制，从而提高效率和重建质量。2. 我们强调图像生成中多层次特征之间的互动对于图像质量的影响，并引入了一种高效的注意机制，该机制结合了全局注意（整个图像的长距离semantic consistency）和地方注意（细腻的特征）。这种方法会提高生成速度、生成质量和分辨率。3. 我们提出了一种新的生成管道，其包括自动编码训练和泛化生成策略。我们的方法在高质量和高分辨率图像重建和生成中具有优势。extensive experiments 表明，我们的方法在高质量和高分辨率图像重建和生成中具有优势。

Hierarchical Side-Tuning for Vision Transformers

paper_url: http://arxiv.org/abs/2310.05393
repo_url: https://github.com/AFeng-x/HST
paper_authors: Weifeng Lin, Ziheng Wu, Jiayu Chen, Wentao Yang, Mingxin Huang, Jun Huang, Lianwen Jin
for: 本文旨在提出一种基于卷积 transformer 的 Hierarchical Side-Tuning (HST) 方法，用于有效地将预训练模型传递到多个下游任务中。
methods: 本文使用了一种基于卷积 transformer 的 Hierarchical Side-Tuning (HST) 方法，其中包括一个轻量级的并行网络 (HSN)，用于生成多级特征并进行预测。
results: 本文在多种视觉任务上进行了广泛的实验，包括分类、物体检测、实例分割和 semantic segmentation。结果显示，HST 方法可以达到最佳性能，其中在 VTAB-1k 测试集上实现了平均 Top-1 准确率为 76.0%，并且只需要 fine-tune 0.78M 参数。在 COCO 测试设置上，HST 方法还超过了全局 fine-tuning，并在 Cascade Mask R-CNN 上获得了更高的权重 AP 值（49.7 和 43.2）。

Abstract
Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large pre-trained models to various tasks poses a significant challenge. This challenge arises from the need for each model to undergo an independent and comprehensive fine-tuning process, leading to substantial computational and memory demands. While recent advancements in Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to achieve superior performance compared to full fine-tuning with a smaller subset of parameter updates, they tend to overlook dense prediction tasks such as object detection and segmentation. In this paper, we introduce Hierarchical Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various downstream tasks effectively. Diverging from existing methods that exclusively fine-tune parameters within input spaces or certain modules connected to the backbone, we tune a lightweight and hierarchical side network (HSN) that leverages intermediate activations extracted from the backbone and generates multi-scale features to make predictions. To validate HST, we conducted extensive experiments encompassing diverse visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Notably, our method achieves state-of-the-art average Top-1 accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters. When applied to object detection tasks on COCO testdev benchmark, HST even surpasses full fine-tuning and obtains better performance with 49.7 box AP and 43.2 mask AP using Cascade Mask R-CNN.

摘要
可以使用已经训练过的视觉转换器（ViT）进行精细调整，以提高视觉识别的性能。然而，将大型预训练模型应用到不同任务时，会带来很大的计算和内存需求。而最近的参数有效转移学习（PETL）技术已经证明可以在小型参数更新的情况下，超越全部 fine-tuning，实现更好的性能。然而，这些方法通常忽略 dense prediction 任务，如对象检测和分割。在这篇论文中，我们介绍了一种新的 PETL 方法，即层次侧调整（HST）。与现有方法不同，我们不是直接在输入空间或特定模块与背景连接的参数进行 fine-tuning，而是在较重的和层次结构的侧网络（HSN）上进行 fine-tuning。HSN 利用了背景网络中的中间活动，生成了多级特征，以进行预测。为 validate HST，我们进行了广泛的实验，包括多种视觉任务，如分类、对象检测、实例分割和semantic segmentation。特别是，我们的方法在 VTAB-1k 上实现了平均 Top-1 准确率为 76.0%，而且只需要 fine-tune 0.78M 个参数。在对象检测任务中，HST 甚至超过了全部 fine-tuning，在 COCO 测试dev benchmark上实现了49.7个box AP和43.2个mask AP，使用 Cascade Mask R-CNN。

Lightweight Full-Convolutional Siamese Tracker

paper_url: http://arxiv.org/abs/2310.05392
repo_url: https://github.com/liyunfenglyf/lightfc
paper_authors: Yunfeng Li, Bo Wang, Xueyi Wu, Zhuoyan Liu, Ye Li
for: 提高大规模跟踪器的精度和效率，并在有限资源的平台上应用。
methods: 提出了一种轻量级全 convolutional Siamese tracker，即 LightFC，使用了一种新的有效的协调模块（ECM）和一种新的有效的重心头（ERH）来提高跟踪pipeline的非线性表达能力。
results: 实验表明，LightFC可以实现跟踪器的优化平衡，在性能、参数、Flops和FPS等方面均达到了优秀的水平，并且在 CPU 上运行速度比 MixFormerV2-S 快2倍。

Abstract
Although single object trackers have achieved advanced performance, their large-scale models make it difficult to apply them on the platforms with limited resources. Moreover, existing lightweight trackers only achieve balance between 2-3 points in terms of parameters, performance, Flops and FPS. To achieve the optimal balance among these points, this paper propose a lightweight full-convolutional Siamese tracker called LightFC. LightFC employs a novel efficient cross-correlation module (ECM) and a novel efficient rep-center head (ERH) to enhance the nonlinear expressiveness of the convolutional tracking pipeline. The ECM employs an attention-like module design, which conducts spatial and channel linear fusion of fused features and enhances the nonlinearly of the fused features. Additionally, it references successful factors of current lightweight trackers and introduces skip-connections and reuse of search area features. The ERH reparameterizes the feature dimensional stage in the standard center head and introduces channel attention to optimize the bottleneck of key feature flows. Comprehensive experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS. The precision score of LightFC outperforms MixFormerV2-S by 3.7 \% and 6.5 \% on LaSOT and TNL2K, respectively, while using 5x fewer parameters and 4.6x fewer Flops. Besides, LightFC runs 2x faster than MixFormerV2-S on CPUs. Our code and raw results can be found at https://github.com/LiYunfengLYF/LightFC

摘要
尽管单个对象跟踪器已经实现了高度的表现，但它们的大型模型使得在有限资源的平台上应用困难。此外，现有的轻量级跟踪器只能达到2-3点的平衡，即参数、性能、 Flops 和 FPS 之间的平衡。为了实现这些点之间的优化平衡，本文提出了一种轻量级全 convolutional Siamese 跟踪器，即 LightFC。LightFC 使用了一种新的高效 cross-correlation module (ECM) 和一种新的高效 reuse center head (ERH)，以增强 convolutional 跟踪管道的非线性表达能力。ECM 使用了注意力模块的设计，通过空间和通道线性融合含有融合特征的特征，提高非线性特征的表达能力。此外，它参考了当前轻量级跟踪器的成功因素，并引入了跳过连接和 reuse 搜索区域特征。ERH 重新parameterizes 标准中心头中的特征维度阶段，并引入通道注意力来优化瓶颈特征流。经过全面的实验，LightFC 实现了最佳的性能、参数、 Flops 和 FPS 的平衡。LightFC 的精度分数高于 MixFormerV2-S 的 LaSOT 和 TNL2K 上的精度分数，分别提高了3.7% 和 6.5%，同时使用5x fewer parameters 和 4.6x fewer Flops。此外，LightFC 在 CPU 上运行2x faster than MixFormerV2-S。我们的代码和原始结果可以在 GitHub 上找到：https://github.com/LiYunfengLYF/LightFC

Neural Impostor: Editing Neural Radiance Fields with Explicit Shape Manipulation

paper_url: http://arxiv.org/abs/2310.05391
repo_url: None
paper_authors: Ruiyang Liu, Jinxu Xiang, Bowen Zhao, Ran Zhang, Jingyi Yu, Changxi Zheng
for: 提高NeRF的 Editing能力，尤其是几何修改。
methods: 引入explicit tetrahedral mesh和multigrid implicit field，并通过 multigrid barycentric coordinate encoding将explicit shape manipulation和implicit fields的编辑 bridged。
results: 提供了一种实用的方法来 modify NeRF，包括deform、composite和generate neural implicit fields，同时维护复杂的零体积表现。

Abstract
Neural Radiance Fields (NeRF) have significantly advanced the generation of highly realistic and expressive 3D scenes. However, the task of editing NeRF, particularly in terms of geometry modification, poses a significant challenge. This issue has obstructed NeRF's wider adoption across various applications. To tackle the problem of efficiently editing neural implicit fields, we introduce Neural Impostor, a hybrid representation incorporating an explicit tetrahedral mesh alongside a multigrid implicit field designated for each tetrahedron within the explicit mesh. Our framework bridges the explicit shape manipulation and the geometric editing of implicit fields by utilizing multigrid barycentric coordinate encoding, thus offering a pragmatic solution to deform, composite, and generate neural implicit fields while maintaining a complex volumetric appearance. Furthermore, we propose a comprehensive pipeline for editing neural implicit fields based on a set of explicit geometric editing operations. We show the robustness and adaptability of our system through diverse examples and experiments, including the editing of both synthetic objects and real captured data. Finally, we demonstrate the authoring process of a hybrid synthetic-captured object utilizing a variety of editing operations, underlining the transformative potential of Neural Impostor in the field of 3D content creation and manipulation.

摘要

Three-Stage Cascade Framework for Blurry Video Frame Interpolation

paper_url: http://arxiv.org/abs/2310.05383
repo_url: None
paper_authors: Pengcheng Lei, Zaoming Yan, Tingting Wang, Faming Fang, Guixu Zhang
for: 这个论文的目标是提出一种简单的终端三个阶段框架，以全面利用恶化视频中的有用信息，并提高高速清晰视频生成的性能。methods: 该模型采用了三个阶段：帧 interpolate阶段、时间特征融合阶段和锐化阶段。帧 interpolate阶段使用了时间可变网络直接从恶化输入中提取有用信息并生成目标帧的中间帧。时间特征融合阶段通过双向循环可变网络挖掘每个目标帧的长期时间信息。锐化阶段使用了基于 transformer 的 Taylor 近似网络来逐层恢复高频细节。results: 实验结果表明，我们的模型在四个标准测试集上具有较高的性能，并且在真实恶化视频上也有良好的泛化能力。

Abstract
Blurry video frame interpolation (BVFI) aims to generate high-frame-rate clear videos from low-frame-rate blurry videos, is a challenging but important topic in the computer vision community. Blurry videos not only provide spatial and temporal information like clear videos, but also contain additional motion information hidden in each blurry frame. However, existing BVFI methods usually fail to fully leverage all valuable information, which ultimately hinders their performance. In this paper, we propose a simple end-to-end three-stage framework to fully explore useful information from blurry videos. The frame interpolation stage designs a temporal deformable network to directly sample useful information from blurry inputs and synthesize an intermediate frame at an arbitrary time interval. The temporal feature fusion stage explores the long-term temporal information for each target frame through a bi-directional recurrent deformable alignment network. And the deblurring stage applies a transformer-empowered Taylor approximation network to recursively recover the high-frequency details. The proposed three-stage framework has clear task assignment for each module and offers good expandability, the effectiveness of which are demonstrated by various experimental results. We evaluate our model on four benchmarks, including the Adobe240 dataset, GoPro dataset, YouTube240 dataset and Sony dataset. Quantitative and qualitative results indicate that our model outperforms existing SOTA methods. Besides, experiments on real-world blurry videos also indicate the good generalization ability of our model.

摘要
《不清晰视频帧 interpolate (BVFI)》是计算机视觉领域一项重要但具有挑战性的任务，该任务的目标是将低帧率不清晰视频转换为高帧率清晰视频。不清晰视频除了提供空间信息外，还包含隐藏在每帧不清晰图像中的动态信息。然而，现有的BVFI方法通常不能充分利用所有有价信息，这 ultimately 阻碍其性能。在本文中，我们提出了一个简单的三stage框架来完全探索不清晰视频中的有用信息。帧 interpolate stage 使用时间变换可靠网络来直接从不清晰输入中提取有用信息并生成目标帧中的中间帧。帧特征融合 stage 通过双向径向变换可靠网络来探索每个目标帧的长期时间信息。并且，卷积Transformer 加持 Taylor 近似网络来重复回归高频细节。我们的三stage框架具有明确的任务分配和扩展性，其效果由多种实验证明。我们在 Adobe240 数据集、GoPro数据集、YouTube240 数据集和 Sony 数据集等四个标准准 benchmark 上评估了我们的模型，并取得了较高的比较结果。此外，我们还对真实的不清晰视频进行了实验，并证明了我们的模型具有良好的通用性。

IPDreamer: Appearance-Controllable 3D Object Generation with Image Prompts

paper_url: http://arxiv.org/abs/2310.05375
repo_url: None
paper_authors: Bohan Zeng, Shanglin Li, Yutang Feng, Hong Li, Sicheng Gao, Jiaming Liu, Huaxia Li, Xu Tang, Jianzhuang Liu, Baochang Zhang
for: 文章目的是提出一种新的文本到3D形态生成方法，以提高3D形态生成的控制性和质量。
methods: 该方法基于大规模文本到图像扩散模型，并采用变量分数精益熬煮法提高3D形态生成的精度和可控性。
results: 实验结果表明，IPDreamer可以有效地生成高质量的3D形态，与文本和图像提示相符，达到了控制性和可预测性的目标。

Abstract
Recent advances in text-to-3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to supervise 3D generation. These methods, including the variational score distillation proposed by ProlificDreamer, enable the synthesis of detailed and photorealistic textured meshes. However, the appearance of 3D objects generated by these methods is often random and uncontrollable, posing a challenge in achieving appearance-controllable 3D objects. To address this challenge, we introduce IPDreamer, a novel approach that incorporates image prompts to provide specific and comprehensive appearance information for 3D object generation. Our results demonstrate that IPDreamer effectively generates high-quality 3D objects that are consistent with both the provided text and image prompts, demonstrating its promising capability in appearance-controllable 3D object generation.

摘要
近年文本到3D生成技术的发展很remarkable，如DreamFusion等方法利用大规模文本到图像扩散型模型来监督3D生成。这些方法，包括ProlificDreamer提出的变量分数热采样，使得可以生成细节rich和实际图像的纹理体。然而，由这些方法生成的3D对象的外观往往随机和无法控制，这成为实现外观可控3D对象的挑战。为解决这个挑战，我们介绍IPDreamer，一种新的方法，它通过图像提示来提供特定和全面的外观信息，以实现外观可控3D对象的生成。我们的结果表明，IPDreamer能够生成高质量的3D对象，这些对象与提供的文本和图像提示保持一致，这demonstrates its promising capability in appearance-controllable 3D object generation。

Enhancing Prostate Cancer Diagnosis with Deep Learning: A Study using mpMRI Segmentation and Classification

paper_url: http://arxiv.org/abs/2310.05371
repo_url: None
paper_authors: Anil B. Gavade, Neel Kanwal, Priyanka A. Gavade, Rajendra Nerli
for: 这个论文主要是为了提高膀肉癌早期诊断和精准诊断，以便提供更好的治疗方案。methods: 这个研究使用了深度学习模型来分类和 segmentation mpMRI 图像，并对不同的抑肿器进行了比较。results: 实验结果表明，结合 U-Net 和 LSTM 模型的ipeline 在分类和 segmentation 任务中表现最佳，其性能超过了所有其他组合。

Abstract
Prostate cancer (PCa) is a severe disease among men globally. It is important to identify PCa early and make a precise diagnosis for effective treatment. For PCa diagnosis, Multi-parametric magnetic resonance imaging (mpMRI) emerged as an invaluable imaging modality that offers a precise anatomical view of the prostate gland and its tissue structure. Deep learning (DL) models can enhance existing clinical systems and improve patient care by locating regions of interest for physicians. Recently, DL techniques have been employed to develop a pipeline for segmenting and classifying different cancer types. These studies show that DL can be used to increase diagnostic precision and give objective results without variability. This work uses well-known DL models for the classification and segmentation of mpMRI images to detect PCa. Our implementation involves four pipelines; Semantic DeepSegNet with ResNet50, DeepSegNet with recurrent neural network (RNN), U-Net with RNN, and U-Net with a long short-term memory (LSTM). Each segmentation model is paired with a different classifier to evaluate the performance using different metrics. The results of our experiments show that the pipeline that uses the combination of U-Net and the LSTM model outperforms all other combinations, excelling in both segmentation and classification tasks.

摘要
乳腺癌（PCa）是男性全球最重要的疾病之一。 Early detection and precise diagnosis are crucial for effective treatment. Multi-parametric magnetic resonance imaging（mpMRI）已成为男性乳腺癌诊断的不可或缺的成像方式，可提供乳腺脏组织的精确 Анатомиче视图。深度学习（DL）模型可以增强现有的临床系统，提高患者的护理质量。在最近的研究中，DL技术已被用来开发分类和 segmentation 不同类型的癌病。这些研究表明，DL可以增强诊断的精度，提供 объектив的结果，无差异。本工作使用了一些常见的 DL 模型，用于分类和 segmentation mpMRI 图像，检测 PCa。我们的实现包括四个管道：Semantic DeepSegNet with ResNet50、DeepSegNet with RNN、U-Net with RNN 和 U-Net with LSTM。每个分 segmentation 模型都与不同的分类器相结合，以评估不同的 metric。实验结果表明，使用 U-Net 和 LSTM 模型的组合，在分 segmentation 和分类任务中具有最高的表现。

paper_url: http://arxiv.org/abs/2310.05370
repo_url: https://github.com/cocoon2wong/socialcircle
paper_authors: Conghao Wong, Beihao Xia, Xinge You
for: 这篇论文是为了研究和预测在复杂场景中的人工智能和自动驾驶系统中人体和车辆的运动轨迹的方法和技术。
methods: 这篇论文使用了一种新的角度基于的可教学社交表示（SocialCircle），通过在不同的方向角度对目标机器人的位置进行反射，来考虑社交互动的Context。
results: 实验表明，与新发布的轨迹预测模型一起训练SocialCircle后，可以Quantitatively提高预测性能，同时Qualitatively帮助更好地考虑社交互动when预测人行轨迹，与人类直觉相符。

Abstract
Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures, but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes, we build a new anglebased trainable social representation, named SocialCircle, for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models, and experiments show that the SocialCircle not only quantitatively improves the prediction performance, but also qualitatively helps better consider social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.

摘要
Inspired by marine animals that use echoes to locate their companions underwater, we propose a new angle-based trainable social representation, called SocialCircle, to continuously reflect the context of social interactions at different angular orientations relative to the target agent. We validate the effectiveness of SocialCircle by training it along with several recently released trajectory prediction models, and experiments show that it not only quantitatively improves prediction performance but also qualitatively considers social interactions in a way that is consistent with human intuition when forecasting pedestrian trajectories.

Rotation Matters: Generalized Monocular 3D Object Detection for Various Camera Systems

paper_url: http://arxiv.org/abs/2310.05366
repo_url: None
paper_authors: SungHo Moon, JinWoo Bae, SungHoon Im
for: 这篇论文的目的是分析单目3D物体探测性能下降的原因，以及提出一种通用的3D物体探测方法，以提高其在不同摄像头系统上的性能。methods: 该论文采用了广泛的实验分析性能下降的原因，并提出了一种修正模块，用于更正估计的3D bounding box位置和方向。results: 该论文的提出的修正模块可以在大多数最新的3D物体探测网络上应用，提高了AP3D score（KITTI中等，IoU > 70%）约6-10倍于基eline，而且对质量和量表现都有显著提高。

Abstract
Research on monocular 3D object detection is being actively studied, and as a result, performance has been steadily improving. However, 3D object detection performance is significantly reduced when applied to a camera system different from the system used to capture the training datasets. For example, a 3D detector trained on datasets from a passenger car mostly fails to regress accurate 3D bounding boxes for a camera mounted on a bus. In this paper, we conduct extensive experiments to analyze the factors that cause performance degradation. We find that changing the camera pose, especially camera orientation, relative to the road plane caused performance degradation. In addition, we propose a generalized 3D object detection method that can be universally applied to various camera systems. We newly design a compensation module that corrects the estimated 3D bounding box location and heading direction. The proposed module can be applied to most of the recent 3D object detection networks. It increases AP3D score (KITTI moderate, IoU $> 70\%$) about 6-to-10-times above the baselines without additional training. Both quantitative and qualitative results show the effectiveness of the proposed method.

摘要
研究监视单目3D对象检测正在活跃进行，因此性能不断提高。然而，当应用于不同的摄像头系统时，3D对象检测性能会受到明显的降低。例如，一个用于汽车摄像头的3D检测器通常无法在公共汽车摄像头上提供准确的3D包围盒。在这篇论文中，我们进行了广泛的实验分析，发现改变摄像头pose，特别是摄像头方向 relative to the road plane，会导致性能下降。此外，我们提议一种通用的3D对象检测方法，可以适用于不同的摄像头系统。我们新设计了一个补偿模块，可以正确地修正估计的3D包围盒位置和方向。该模块可以应用于大多数最近的3D对象检测网络。它可以在不需要再训练的情况下，提高AP3D score（KITTI中等，IoU>70%）约6-10倍。both quantitative and qualitative results表明提案的方法的有效性。

paper_url: http://arxiv.org/abs/2310.05355
repo_url: None
paper_authors: Ruizhi Wang, Xiangtao Wang, Jie Zhou, Thomas Lukasiewicz, Zhenghua Xu
for: 这个研究旨在提高医疗影像报告生成的准确性和可靠性，并且能够在单一影像入力下进行推导。
methods: 本研究提出了一个跨模式一致的多视角医疗报告生成模型（C^2M-DoT），包括对多视角资料进行semantic-based对比学习，并且将这些资料转换为单一模型中使用。
results: 实验结果显示，C^2M-DoT在两个公共 benchmark 数据集上均substantially outperform 州际前进行的基elines 在所有指标中。ablation 研究也证实了每个 ком成分的有效性和必要性。

Abstract
In clinical scenarios, multiple medical images with different views are usually generated simultaneously, and these images have high semantic consistency. However, most existing medical report generation methods only consider single-view data. The rich multi-view mutual information of medical images can help generate more accurate reports, however, the dependence of multi-view models on multi-view data in the inference stage severely limits their application in clinical practice. In addition, word-level optimization based on numbers ignores the semantics of reports and medical images, and the generated reports often cannot achieve good performance. Therefore, we propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C^2M-DoT). Specifically, (i) a semantic-based multi-view contrastive learning medical report generation framework is adopted to utilize cross-view information to learn the semantic representation of lesions; (ii) a domain transfer network is further proposed to ensure that the multi-view report generation model can still achieve good inference performance under single-view input; (iii) meanwhile, optimization using a cross-modal consistency loss facilitates the generation of textual reports that are semantically consistent with medical images. Extensive experimental studies on two public benchmark datasets demonstrate that C^2M-DoT substantially outperforms state-of-the-art baselines in all metrics. Ablation studies also confirmed the validity and necessity of each component in C^2M-DoT.

摘要
在临床场景下，通常同时生成多个具有不同视图的医疗图像，这些图像之间具有高度相关性。然而，大多数现有的医疗报告生成方法只考虑单视图数据。多视图图像之间的丰富多视图相互信息可以帮助生成更准确的报告，但是在推理阶段，多视图模型对多视图数据的依赖妨碍了它们在临床实践中的应用。此外，基于数字的单词优化 ignore 医疗图像和报告的 semantics，并且生成的报告frequently 无法达到好的性能。因此，我们提出了一种跨模态一致多视图医疗报告生成器（C^2M-DoT）。具体来说，我们采用了semantic-based multi-view contrastive learning医疗报告生成框架，以利用cross-view信息来学习lesion的semantic表示; 其次，我们提出了一种域转移网络，以确保多视图报告生成模型在单视图输入下仍可以实现好的推理性能; 最后，我们使用了一种跨模态一致损失来促进生成的文本报告与医疗图像之间的semantic一致。经过广泛的实验研究，我们发现C^2M-DoT在所有指标上都大幅超越了状态对比baselines。我们还进行了ablation研究，并证明了C^2M-DoT中每个组件的有效性和必要性。

Infrared Small Target Detection Using Double-Weighted Multi-Granularity Patch Tensor Model With Tensor-Train Decomposition

paper_url: http://arxiv.org/abs/2310.05347
repo_url: None
paper_authors: Guiyu Zhang, Qunbo Lv, Zui Tao, Baoyu Zhu, Zheng Tan, Yuan Ma
for: 干预热小目标检测在远程感知领域中扮演着重要角色，这paper提出了一种新的double-weighted多粒度infrared patch tensor（DWMGIPT）模型，以解决现有方法中的缺陷。
methods: 该paper使用了多粒度infrared patch tensor（MGIPT）模型，通过不重叠的patches和tensor归一化基于tensor train（TT） decompositions来捕捉不同粒度信息。其次，通过自动权重机制来均衡不同粒度信息的重要性。最后，使用steering kernel（SK）来提取本地结构先验，以抑制干扰。
results: 对于不同的复杂场景，该paper的方法表现了更好的鲁棒性和稳定性，并且在不同评价指标下，与其他八种state-of-the-art方法进行比较，得到了更好的检测性能。

Abstract
Infrared small target detection plays an important role in the remote sensing fields. Therefore, many detection algorithms have been proposed, in which the infrared patch-tensor (IPT) model has become a mainstream tool due to its excellent performance. However, most IPT-based methods face great challenges, such as inaccurate measure of the tensor low-rankness and poor robustness to complex scenes, which will leadto poor detection performance. In order to solve these problems, this paper proposes a novel double-weighted multi-granularity infrared patch tensor (DWMGIPT) model. First, to capture different granularity information of tensor from multiple modes, a multi-granularity infrared patch tensor (MGIPT) model is constructed by collecting nonoverlapping patches and tensor augmentation based on the tensor train (TT) decomposition. Second, to explore the latent structure of tensor more efficiently, we utilize the auto-weighted mechanism to balance the importance of information at different granularity. Then, the steering kernel (SK) is employed to extract local structure prior, which suppresses background interference such as strong edges and noise. Finally, an efficient optimization algorithm based on the alternating direction method of multipliers (ADMM) is presented to solve the model. Extensive experiments in various challenging scenes show that the proposed algorithm is robust to noise and different scenes. Compared with the other eight state-of-the-art methods, different evaluation metrics demonstrate that our method achieves better detection performance in various complex scenes.

摘要
infrared小目标检测在远程感知领域扮演着重要的角色。因此，许多检测算法已经被提出，其中infrared patch-tensor（IPT）模型因其出色的表现而成为主流工具。然而，大多数IPT基于的方法面临着准确度低、鲁棒性差等问题，这将导致检测性能差。为解决这些问题，本文提出了一种新的双重加权多级infrared patch tensor（DWMGIPT）模型。首先，通过多种模式收集非重叠的patches和tensor增强来捕捉不同粒度信息的tensor。其次，通过自动权重机制来均衡不同粒度信息的重要性。然后，使用执政kernel（SK）来提取本地结构优先项，以抑制干扰性的强边和噪声。最后，基于alternating direction method of multipliers（ADMM）的有效优化算法来解决模型。在不同的复杂场景中进行了广泛的实验，结果显示，提出的算法具有较好的鲁棒性和能够抗干扰性。与其他八种当前状态的方法进行比较，不同的评价指标都表明，我们的方法在多种复杂场景中实现了更好的检测性能。

Anyview: Generalizable Indoor 3D Object Detection with Variable Frames

paper_url: http://arxiv.org/abs/2310.05346
repo_url: https://github.com/xuxw98/AnyView
paper_authors: Zhenyu Wu, Xiuwei Xu, Ziwei Wang, Chong Xia, Linqing Zhao, Jiwen Lu, Haibin Yan
for: 提出一种新的网络框架，用于实现室内3D对象检测，以适应实际应用场景中的变化输入帧数。
methods: 提出了一种新的3D检测框架，名为AnyView，可以在不同的输入帧数下进行检测，并且可以在不同的RGB-D图像帧中检测3D对象。
results: 在ScanNet dataset上进行了广泛的实验，并达到了高的检测精度和普适性，而且该方法的参数量与基eline相似。

Abstract
In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.

摘要
在这篇论文中，我们提出了一种新的网络框架 для室内3D对象检测，以适应实际应用场景中的变量输入帧数。现有方法只考虑固定帧的输入数据，如独立的RGB-D图像或者 dense多视图RGB-D图像重建的点云。然而，在实际应用场景中，输入到3D检测器的原始数据是RGB-D图像，而不是重建的场景点云。这些前一些方法只能处理固定帧的输入数据，导致其在变量帧输入下表现不佳。为了使3D对象检测方法适用于实际任务，我们提出了一种名为AnyView的新的3D检测框架。我们认为，在不同的输入帧数下，我们可以通过地理学习器来挖掘每帧RGB-D图像的本地几何特征，并通过设计的空间混合模块来实现本地-全局特征互动。此外，我们还使用动态凭据策略来自适应性地调整每帧提取的特征数量，以保证每帧特征的净度和总特征密度的一致，从而进一步提高总体化性。我们在ScanNet数据集进行了广泛的实验，显示我们的方法可以同时具有优秀的一致性和高检测精度，并且具有简单清晰的架构，与基elines相似的参数量。

What do larger image classifiers memorise?

paper_url: http://arxiv.org/abs/2310.05337
repo_url: None
paper_authors: Michal Lukasik, Vaishnavh Nagarajan, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar
for: 研究现代神经网络的连接 между吸收和总结，以及过参数化模型的总结性能。
methods: 使用Feldman提出的度量来衡量具体训练示例的吸收程度，并对一个ResNet模型在图像分类任务上进行了实验性计算。
results: 发现大型神经网络模型在训练示例上的吸收趋势非常复杂，大多数示例在更大的模型下经受减少吸收，而其余示例则表现出拱形或增加吸收趋势。此外，发现各种 Feldman 吸收度指标不能准确捕捉这些基本趋势。最后，发现知识传递可以减少吸收，同时提高总结性能。

Abstract
The success of modern neural networks has prompted study of the connection between memorisation and generalisation: overparameterised models generalise well, despite being able to perfectly fit (memorise) completely random labels. To carefully study this issue, Feldman proposed a metric to quantify the degree of memorisation of individual training examples, and empirically computed the corresponding memorisation profile of a ResNet on image classification bench-marks. While an exciting first glimpse into what real-world models memorise, this leaves open a fundamental question: do larger neural models memorise more? We present a comprehensive empirical analysis of this question on image classification benchmarks. We find that training examples exhibit an unexpectedly diverse set of memorisation trajectories across model sizes: most samples experience decreased memorisation under larger models, while the rest exhibit cap-shaped or increasing memorisation. We show that various proxies for the Feldman memorization score fail to capture these fundamental trends. Lastly, we find that knowledge distillation, an effective and popular model compression technique, tends to inhibit memorisation, while also improving generalisation. Specifically, memorisation is mostly inhibited on examples with increasing memorisation trajectories, thus pointing at how distillation improves generalisation.

摘要
modern neural network 的成功引起了 memorization 和通用化之间的关系的研究：过参数化模型可以通过完美地适应（memorize）完全随机的标签来泛化 well。为了精心研究这个问题，Feldman 提出了一个度量 Memorialization 的个体训练示例的度量，并对一个 ResNet 在图像分类benchmark 上进行了empirical计算。这是一个对实际模型的 memorization 进行了初步的研究，但还留下了一个基本问题：大型神经网络是否能够更好地记忆？我们进行了一项完整的 empirical 分析，发现训练示例在不同的模型大小下 exhibit 一些不同的记忆轨迹：大多数样本在大型模型下表现出减少的记忆，而剩下的一些样本则表现出顶点形或增加的记忆。我们发现了不同的 Feldman 记忆度量的代理都无法捕捉这些基本趋势。最后，我们发现了知识储存，一种效果和流行的模型压缩技术，会妨碍记忆，同时也会提高通用化。具体来说，记忆在大型模型下增加的样本上受到了妨碍，这指出了如何储存提高通用化。

GReAT: A Graph Regularized Adversarial Training Method

paper_url: http://arxiv.org/abs/2310.05336
repo_url: None
paper_authors: Samet Bayram, Kenneth Barner
for: 提高深度学习模型的分类性能
methods: 利用数据图 структуры进行对抗训练
results: 比靡�状态方法更有效，提高模型的抗性和通用性

Abstract
This paper proposes a regularization method called GReAT, Graph Regularized Adversarial Training, to improve deep learning models' classification performance. Adversarial examples are a well-known challenge in machine learning, where small, purposeful perturbations to input data can mislead models. Adversarial training, a powerful and one of the most effective defense strategies, involves training models with both regular and adversarial examples. However, it often neglects the underlying structure of the data. In response, we propose GReAT, a method that leverages data graph structure to enhance model robustness. GReAT deploys the graph structure of the data into the adversarial training process, resulting in more robust models that better generalize its testing performance and defend against adversarial attacks. Through extensive evaluation on benchmark datasets, we demonstrate GReAT's effectiveness compared to state-of-the-art classification methods, highlighting its potential in improving deep learning models' classification performance.

摘要
Here's the translation in Simplified Chinese:这篇论文提出了一种名为GReAT（图structure regularized adversarial training）的常规化方法，用于改善深度学习模型的分类性能。对于机器学习来说，针对性例子是一种非常知名的挑战，小量、有目的的输入数据修改可以让模型产生错误。对此，我们提出了GReAT，一种利用数据图结构来增强模型的 Robustness。GReAT在对抗训练过程中部署了数据图结构，从而生成更加Robust的模型，能够更好地抗击针对性例子和对抗攻击。通过对标准 benchmark 数据集进行了广泛的评估，我们展示了GReAT的效果，与现有的分类方法相比，强调了它在改善深度学习模型的分类性能的潜在力量。

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

paper_url: http://arxiv.org/abs/2310.05330
repo_url: None
paper_authors: Yang Wang, Jiaogen Zhou, Jihong Guan
For: This paper focuses on weakly supervised video anomaly detection, which is useful for effective and intelligent public safety management.* Methods: The proposed model uses an adaptive instance selection strategy and a lightweight multi-level temporal correlation attention module, which reduces the number of model parameters to 0.56% of existing methods.* Results: The proposed model achieves comparable or superior AUC scores compared to state-of-the-art methods, with a significantly reduced number of model parameters, making it suitable for resource-limited scenarios such as edge computing.Here is the simplified Chinese version of the three key points:* For: 这篇论文关注弱类视频异常检测，可以实现有效和智能的公共安全管理。* Methods: 提议的模型使用适应实例选择策略和轻量级多层时间相关注意力模块，这将模型参数减少至0.56%以上方法（如RTFM）。* Results: 提议的模型在两个公共数据集（UCF-Crime和上海理工大学）的广泛实验中，可以与状态的方法相比或超过其AUC分数，同时减少了模型参数的数量，适用于限制资源的场景如边缘计算。

Abstract
Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.

摘要
视频异常检测是确定视频中是否存在任何异常事件、行为或物体，以实现有效和智能公共安全管理。由于视频异常标注是时间费时和昂贵的， większość existing works使用不upervised or weakly supervised learning methods。这篇论文关注弱类视频异常检测，在哪些训练视频被标注为异常或正常，但没有任何Frame异常的信息。然而，弱类标注数据的不确定性和大型模型的尺寸阻碍了现有方法在实际场景中广泛应用，特别是在资源有限的情况下，如边计算。在这篇论文中，我们开发了一个轻量级视频异常检测模型。一方面，我们提出了适应实例选择策略，基于模型当前状态选择信任实例，从而减轻弱类标注数据的不确定性，并使模型性能提高。另一方面，我们设计了轻量级多层时间相关注意力模块和梯形全连接层，构建模型，可以将模型参数减少至0.56%（例如RTFM）。我们对公共数据集UCF-Crime和ShanghaiTech进行了广泛的实验，结果表明，我们的模型可以与状态之artefacts方法相当或超过，同时具有显著减少模型参数的优势。

Edge Computing-Enabled Road Condition Monitoring: System Development and Evaluation

paper_url: http://arxiv.org/abs/2310.05321
repo_url: None
paper_authors: Abdulateef Daud, Mark Amo-Boateng, Neema Jakisa Owor, Armstrong Aboah, Yaw Adu-Gyamfi
For: Real-time pavement condition monitoring for highway agencies to inform pavement maintenance and rehabilitation policies.* Methods: Utilizes affordable MEMS sensors, edge computing, and deployable machine learning models to stream live pavement condition data and reduce latency.* Results: Demonstrates high accuracy in predicting International Roughness Index and classifying pavement segments based on ride quality, with the potential to provide real-time data to State Highway Agencies and Department of Transportation.Here’s the Chinese version:* For: 提供高速公路机构实时路面状况监测，以便制定路面维护和重建策略。* Methods: 利用可获得的便宜MEMS传感器，边缘计算和可部署机器学习模型，实时流传路面状况数据，降低延迟。* Results: 实验表明，该方法可以高度准确地预测国际护净指数，并基于乘用质量分类路面段， achieved an average accuracy of 96.76% on I-70EB and 63.15% on South Providence.

Abstract
Real-time pavement condition monitoring provides highway agencies with timely and accurate information that could form the basis of pavement maintenance and rehabilitation policies. Existing technologies rely heavily on manual data processing, are expensive and therefore, difficult to scale for frequent, networklevel pavement condition monitoring. Additionally, these systems require sending large packets of data to the cloud which requires large storage space, are computationally expensive to process, and results in high latency. The current study proposes a solution that capitalizes on the widespread availability of affordable Micro Electro-Mechanical System (MEMS) sensors, edge computing and internet connection capabilities of microcontrollers, and deployable machine learning (ML) models to (a) design an Internet of Things (IoT)-enabled device that can be mounted on axles of vehicles to stream live pavement condition data (b) reduce latency through on-device processing and analytics of pavement condition sensor data before sending to the cloud servers. In this study, three ML models including Random Forest, LightGBM and XGBoost were trained to predict International Roughness Index (IRI) at every 0.1-mile segment. XGBoost had the highest accuracy with an RMSE and MAPE of 16.89in/mi and 20.3%, respectively. In terms of the ability to classify the IRI of pavement segments based on ride quality according to MAP-21 criteria, our proposed device achieved an average accuracy of 96.76% on I-70EB and 63.15% on South Providence. Overall, our proposed device demonstrates significant potential in providing real-time pavement condition data to State Highway Agencies (SHA) and Department of Transportation (DOTs) with a satisfactory level of accuracy.

摘要
现有技术依赖于手动数据处理，昂贵，难以扩展到频繁的网络级别路面状况监测。此外，这些系统需要将大量数据发送到云服务器，需要大量存储空间，计算昂贵，导致高延迟。本研究提出一种解决方案，利用可得到的便宜MEMS传感器、边缘计算和互联网连接能力，以及可搬移的机器学习（ML）模型，设计一个互联网器（IoT）启用的设备，可以在车辙上附加，实时传输路面状况数据。在本研究中，我们使用Random Forest、LightGBM和XGBoost三种ML模型，对每个0.1英里路段进行预测国际折叠指数（IRI）。XGBoost模型的最小平均差误（RMSE）和平均绝对误差（MAPE）分别为16.89英寸/英里和20.3%。在根据MAP-21标准对路面段的车辙质量进行分类方面，我们的提案设备实现了70.76%的准确率在I-70EB，以及63.15%的准确率在南普罗维登斯。总的来说，我们的提案设备表现出了在提供实时路面状况数据给州公路局（SHA）和交通部（DOTs）的满意水平的可能性。

Understanding the Feature Norm for Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2310.05316
repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
paper_authors: Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, Andrew Beng Jin Teoh
for: 本研究旨在解释 neural network 中批处数据集上的一个惊人现象：即在适用范围内的样本（ID）上，核心层特征的 vector 范数较高，而在未经见过的样本（OOD）上则较低。
methods: 我们通过分析潜在隐藏在网络层次中的分类结构，来解释这一现象。我们发现：1）特征范数是隐藏在网络层次中的类别预测值的信任度，具体来说是最大的 logit。2）特征范数是不同类别模型中的共同特征，可以检测 OOD 样本。3）传统的特征范数不能捕捉隐藏层神经元的启动和停止趋势，可能导致 ID 样本被误认为 OOD 样本。
results: 我们提出了一种新的负意识范数（NAN），可以捕捉隐藏层神经元的启动和停止趋势，并且可以与现有的 OOD 检测器兼容。我们进行了广泛的实验，证明 NAN 的有效性和可靠性，并且可以在无标签环境中使用。

Abstract
A neural network trained on a classification dataset often exhibits a higher vector norm of hidden layer features for in-distribution (ID) samples, while producing relatively lower norm values on unseen instances from out-of-distribution (OOD). Despite this intriguing phenomenon being utilized in many applications, the underlying cause has not been thoroughly investigated. In this study, we demystify this very phenomenon by scrutinizing the discriminative structures concealed in the intermediate layers of a neural network. Our analysis leads to the following discoveries: (1) The feature norm is a confidence value of a classifier hidden in the network layer, specifically its maximum logit. Hence, the feature norm distinguishes OOD from ID in the same manner that a classifier confidence does. (2) The feature norm is class-agnostic, thus it can detect OOD samples across diverse discriminative models. (3) The conventional feature norm fails to capture the deactivation tendency of hidden layer neurons, which may lead to misidentification of ID samples as OOD instances. To resolve this drawback, we propose a novel negative-aware norm (NAN) that can capture both the activation and deactivation tendencies of hidden layer neurons. We conduct extensive experiments on NAN, demonstrating its efficacy and compatibility with existing OOD detectors, as well as its capability in label-free environments.

摘要
一种神经网络训练于分类数据集时，常会表现出高向量范围的隐藏层特征值，而对未看过的外部数据（OOD）则产生相对较低的范围值。尽管这一现象已被广泛应用，但它的根本原因尚未得到全面研究。在这种研究中，我们对神经网络中隐藏的权重结构进行了仔细的检查。我们的分析导致以下发现：1. 特征范围是隐藏在神经网络层次中的类ifier的信息量，具体来说是最大的logs。因此，特征范围可以用来分辨OOD和ID的样本。2. 特征范围是不同类型的权重模型中的共同特征，可以检测OOD样本。3. 传统的特征范围不能捕捉隐藏层神经元的抑制趋势，这可能导致ID样本被误分类为OOD样本。为解决这个缺陷，我们提出了一种新的负意识范围（NAN），可以捕捉隐藏层神经元的活动和抑制趋势。我们进行了广泛的实验，证明了NAN的有效性和与现有OOD检测器兼容性，以及它在无标签环境中的可行性。

2023-10-09

DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

HydraViT: Adaptive Multi-Branch Transformer for Multi-Label Disease Classification from Chest X-ray Images

WinSyn: A High Resolution Testbed for Synthetic Data

Factorized Tensor Networks for Multi-Task and Multi-Domain Learning

QR-Tag: Angular Measurement and Tracking with a QR-Design Marker

Developing and Refining a Multifunctional Facial Recognition System for Older Adults with Cognitive Impairments: A Journey Towards Enhanced Quality of Life

Advancing Diagnostic Precision: Leveraging Machine Learning Techniques for Accurate Detection of Covid-19, Pneumonia, and Tuberculosis in Chest X-Ray Images

FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

SimPLR: A Simple and Plain Transformer for Object Detection and Segmentation

Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input

CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird’s-Eye View Fusion

Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models

Domain-wise Invariant Learning for Panoptic Scene Graph Generation

A Real-time Method for Inserting Virtual Objects into Neural Radiance Fields

Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View

Provably Convergent Data-Driven Convex-Nonconvex Regularization

Joint object detection and re-identification for 3D obstacle multi-camera systems

Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

3D tomatoes’ localisation with monocular cameras using histogram filters

Unleashing the power of Neural Collapse for Transferability Estimation

HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation

EdVAE: Mitigating Codebook Collapse with Evidential Discrete Variational Autoencoders

Uni3DETR: Unified 3D Detection Transformer

Combining recurrent and residual learning for deforestation monitoring using multitemporal SAR images

Climate-sensitive Urban Planning through Optimization of Tree Placements

Analysis of Rainfall Variability and Water Extent of Selected Hydropower Reservoir Using Google Earth Engine (GEE): A Case Study from Two Tropical Countries, Sri Lanka and Vietnam

Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection

Exploiting Manifold Structured Data Priors for Improved MR Fingerprinting Reconstruction

Diagnosing Catastrophe: Large parts of accuracy loss in continual learning can be accounted for by readout misalignment

High Accuracy and Cost-Saving Active Learning 3D WD-UNet for Airway Segmentation

Locality-Aware Generalizable Implicit Neural Representation

ASM: Adaptive Sample Mining for In-The-Wild Facial Expression Recognition

Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments

Perceptual Artifacts Localization for Image Synthesis Tasks

A review of uncertainty quantification in medical image analysis: probabilistic and non-probabilistic methods

A Simple and Robust Framework for Cross-Modality Medical Image Segmentation applied to Vision Transformers

M3FPolypSegNet: Segmentation Network with Multi-frequency Feature Fusion for Polyp Localization in Colonoscopy Images

Bi-directional Deformation for Parameterization of Neural Implicit Surfaces

Proposal-based Temporal Action Localization with Point-level Supervision

Colmap-PCD: An Open-source Tool for Fine Image-to-point cloud Registration

Semi-Supervised Object Detection with Uncurated Unlabeled Data for Remote Sensing Images

Geometry-Guided Ray Augmentation for Neural Surface Reconstruction with Sparse Views

AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential Cross Attention

Memory-Assisted Sub-Prototype Mining for Universal Domain Adaptation

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

RetSeg: Retention-based Colorectal Polyps Segmentation Network

AngioMoCo: Learning-based Motion Correction in Cerebral Digital Subtraction Angiography

Semantic-aware Temporal Channel-wise Attention for Cardiac Function Assessment

GradientSurf: Gradient-Domain Neural Surface Reconstruction from RGB Video

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

Hierarchical Side-Tuning for Vision Transformers

Lightweight Full-Convolutional Siamese Tracker

Neural Impostor: Editing Neural Radiance Fields with Explicit Shape Manipulation

Three-Stage Cascade Framework for Blurry Video Frame Interpolation

IPDreamer: Appearance-Controllable 3D Object Generation with Image Prompts

Enhancing Prostate Cancer Diagnosis with Deep Learning: A Study using mpMRI Segmentation and Classification

SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction

Rotation Matters: Generalized Monocular 3D Object Detection for Various Camera Systems

C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network

Infrared Small Target Detection Using Double-Weighted Multi-Granularity Patch Tensor Model With Tensor-Train Decomposition

Anyview: Generalizable Indoor 3D Object Detection with Variable Frames

What do larger image classifiers memorise?

GReAT: A Graph Regularized Adversarial Training Method

A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Edge Computing-Enabled Road Condition Monitoring: System Development and Evaluation

Understanding the Feature Norm for Out-of-Distribution Detection