2023-11-06

cs.CV

cs.CV - 2023-11-06

Toward Planet-Wide Traffic Camera Calibration

paper_url: http://arxiv.org/abs/2311.04243
repo_url: None
paper_authors: Khiem Vuong, Robert Tamburo, Srinivasa G. Narasimhan
for: addresses the challenge of calibration for outdoor cameras, which has limited their potential for automated analysis.
methods: uses street-level imagery to reconstruct a metric 3D model and accurately localize over 100 global traffic cameras, demonstrating a scalable framework.
results: achieves significant enhancements over existing automatic calibration techniques and enables traffic analysis through 3D vehicle reconstruction and speed measurement.

Abstract
Despite the widespread deployment of outdoor cameras, their potential for automated analysis remains largely untapped due, in part, to calibration challenges. The absence of precise camera calibration data, including intrinsic and extrinsic parameters, hinders accurate real-world distance measurements from captured videos. To address this, we present a scalable framework that utilizes street-level imagery to reconstruct a metric 3D model, facilitating precise calibration of in-the-wild traffic cameras. Notably, our framework achieves 3D scene reconstruction and accurate localization of over 100 global traffic cameras and is scalable to any camera with sufficient street-level imagery. For evaluation, we introduce a dataset of 20 fully calibrated traffic cameras, demonstrating our method's significant enhancements over existing automatic calibration techniques. Furthermore, we highlight our approach's utility in traffic analysis by extracting insights via 3D vehicle reconstruction and speed measurement, thereby opening up the potential of using outdoor cameras for automated analysis.

摘要

Unsupervised Region-Growing Network for Object Segmentation in Atmospheric Turbulence

paper_url: http://arxiv.org/abs/2311.03572
repo_url: None
paper_authors: Dehao Qin, Ripon Saha, Suren Jayasuriya, Jinwei Ye, Nianyi Li
for: 这个论文是为了提出一种无监督的前景对象分割网络，用于处理受气象干扰的动态场景。
methods: 该网络使用了拥平均的运动流来驱动一种新的区域增长算法，从而生成每个视频中的移动对象mask。然后，使用U-Net架构和一致性和分组损失来进一步优化这些mask，以确保它们在空间和时间方面具有最好的含义。
results: 该方法不需要标注训练数据，可以在不同的气象干扰强度下工作，并且在新发布的移动对象分割 dataset 上表现出了更高的分割精度和稳定性，比较于当前无监督方法。

Abstract
In this paper, we present a two-stage unsupervised foreground object segmentation network tailored for dynamic scenes affected by atmospheric turbulence. In the first stage, we utilize averaged optical flow from turbulence-distorted image sequences to feed a novel region-growing algorithm, crafting preliminary masks for each moving object in the video. In the second stage, we employ a U-Net architecture with consistency and grouping losses to further refine these masks optimizing their spatio-temporal alignment. Our approach does not require labeled training data and works across varied turbulence strengths for long-range video. Furthermore, we release the first moving object segmentation dataset of turbulence-affected videos, complete with manually annotated ground truth masks. Our method, evaluated on this new dataset, demonstrates superior segmentation accuracy and robustness as compared to current state-of-the-art unsupervised methods.

摘要
在这篇论文中，我们提出了一种基于动态场景和大气抖擞的两Stage无监督前景物 segmentation网络。在第一Stage中，我们利用滤波后的平均运动速度来驱动一种新的区域增长算法，生成每幅视频中移动对象的初步面积。在第二Stage中，我们使用U-Net架构和一致性和分组损失来进一步细化这些面积，以便在空间和时间方向进行最佳对齐。我们的方法不需要标注训练数据，并在不同的大气抖擞强度下工作，可以处理长距离视频。此外，我们发布了第一个受抖擞影响的运动对象分 segmentation数据集，包括手动标注的真实地面积。我们的方法，在这新的数据集上进行评估，与当前无监督方法相比，显示出更高的分 segmentation精度和鲁棒性。

Cal-DETR: Calibrated Detection Transformer

paper_url: http://arxiv.org/abs/2311.03570
repo_url: https://github.com/akhtarvision/cal-detr
paper_authors: Muhammad Akhtar Munir, Salman Khan, Muhammad Haris Khan, Mohsen Ali, Fahad Shahbaz Khan
for: 这个研究旨在对现代基于对话 transformer 的物体检测器进行准确性调整，以提高它们在安全应用中的适用范围。
methods: 本研究提出了一个名为 Cal-DETR 的训练时间准确性调整机制，包括一个简单 yet effective 的对 transformer-based 物体检测器的不确定性评估方法，以及一个基于不确定性的类别LOGIT 调整机制。
results: Results 显示 Cal-DETR 可以对内部和外部测试 scenario 中的检测器进行有效的准确性调整，同时保持或甚至提高检测性能。

Abstract
Albeit revealing impressive predictive performance for several computer vision tasks, deep neural networks (DNNs) are prone to making overconfident predictions. This limits the adoption and wider utilization of DNNs in many safety-critical applications. There have been recent efforts toward calibrating DNNs, however, almost all of them focus on the classification task. Surprisingly, very little attention has been devoted to calibrating modern DNN-based object detectors, especially detection transformers, which have recently demonstrated promising detection performance and are influential in many decision-making systems. In this work, we address the problem by proposing a mechanism for calibrated detection transformers (Cal-DETR), particularly for Deformable-DETR, UP-DETR and DINO. We pursue the train-time calibration route and make the following contributions. First, we propose a simple yet effective approach for quantifying uncertainty in transformer-based object detectors. Second, we develop an uncertainty-guided logit modulation mechanism that leverages the uncertainty to modulate the class logits. Third, we develop a logit mixing approach that acts as a regularizer with detection-specific losses and is also complementary to the uncertainty-guided logit modulation technique to further improve the calibration performance. Lastly, we conduct extensive experiments across three in-domain and four out-domain scenarios. Results corroborate the effectiveness of Cal-DETR against the competing train-time methods in calibrating both in-domain and out-domain detections while maintaining or even improving the detection performance. Our codebase and pre-trained models can be accessed at \url{https://github.com/akhtarvision/cal-detr}.

摘要
深度神经网络（DNN）在计算机视觉任务上表现出了卓越的预测能力，但它们受到过度自信的限制，这限制了DNN在安全关键应用中的广泛应用。有些最近的努力是对DNN进行准确化，但大多数这些努力都集中在分类任务上。却很少人关注现代基于DNN的物体检测器，特别是转换器基于的物体检测器，它们在许多决策系统中具有影响力。在这个工作中，我们解决这个问题，我们提出了一种 mechanism for calibrated detection transformers（Cal-DETR），特别是对Deformable-DETR、UP-DETR和DINO进行准确化。我们采用了训练时期的准确化路径，我们的贡献包括：首先，我们提出了一种简单 yet effective的转换器基于物体检测器的uncertainty量化方法。其次，我们开发了一种基于uncertainty的logit调整机制，该机制利用uncertainty来调整类logits。最后，我们开发了一种logit混合approach，该approach acts as a regularizer with detection-specific losses，并且与uncertainty-guided logit modulation technique相结合，以进一步提高准确性表现。我们在三个域内和四个外域场景进行了广泛的实验，结果证明Cal-DETR在对抗训练时期方法的竞争中，能够准确地调整域内和外域检测。我们的代码库和预训练模型可以在 \url{https://github.com/akhtarvision/cal-detr} 上获取。

Sea You Later: Metadata-Guided Long-Term Re-Identification for UAV-Based Multi-Object Tracking

paper_url: http://arxiv.org/abs/2311.03561
repo_url: None
paper_authors: Cheng-Yen Yang, Hsiang-Wei Huang, Zhongyu Jiang, Heng-Cheng Kuo, Jie Mei, Chung-I Huang, Jenq-Neng Hwang
for: 这篇论文是为了解决UAV在海上计算机视觉中的多对象跟踪问题，具体来说是解决短期重识别（ReID）和长期跟踪的问题。
methods: 这篇论文提出了一种适应性 metadata 导引的多对象跟踪算法（MG-MOT），利用了 UAV 的 GPS 位置、飞机高度和摄像头方向等 metadata，将短期跟踪数据融合成一个coherent的长期跟踪。
results: 在使用 SeaDroneSee 跟踪集，这篇论文在最新的UAV-based Maritime Object Tracking Challenge中获得了优秀的性能，其中 HOTA 为 69.5%，IDF1 为 85.9%。

Abstract
Re-identification (ReID) in multi-object tracking (MOT) for UAVs in maritime computer vision has been challenging for several reasons. More specifically, short-term re-identification (ReID) is difficult due to the nature of the characteristics of small targets and the sudden movement of the drone's gimbal. Long-term ReID suffers from the lack of useful appearance diversity. In response to these challenges, we present an adaptable motion-based MOT algorithm, called Metadata Guided MOT (MG-MOT). This algorithm effectively merges short-term tracking data into coherent long-term tracks, harnessing crucial metadata from UAVs, including GPS position, drone altitude, and camera orientations. Extensive experiments are conducted to validate the efficacy of our MOT algorithm. Utilizing the challenging SeaDroneSee tracking dataset, which encompasses the aforementioned scenarios, we achieve a much-improved performance in the latest edition of the UAV-based Maritime Object Tracking Challenge with a state-of-the-art HOTA of 69.5% and an IDF1 of 85.9% on the testing split.

摘要
多目标跟踪（MOT）在无人机（UAV）上的重新识别（ReID）具有许多挑战，主要是因为小目标的特点和无人机镜头的快速移动。长期ReID受到缺乏有用的外观多样性的限制。为解决这些挑战，我们提出了适应性Metadata驱动的MOT算法（MG-MOT）。这种算法可以将短期跟踪数据合并成一致的长期跟踪，利用无人机的GPS位置、飞行高度和相机 orientations 等重要metadata。我们对我们的MOT算法进行了广泛的实验验证。使用 SeaDroneSee 跟踪数据集，这个数据集包括以上场景，我们在最新的UAV基于海上物体跟踪挑战中取得了显著提高的性能，HOTA 为 69.5%，IDF1 为 85.9% 在测试分区。

Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer’s Disease Progression using MRI Data

paper_url: http://arxiv.org/abs/2311.03557
repo_url: None
paper_authors: Xulong Wang, Yu Zhang, Menghui Zhou, Tong Liu, Jun Qi, Po Yang
For: 这篇论文的目的是为了提出一种基于多任务学习的新方法，用于有效地预测阿兹海默症（AD）的进展，并对该疾病的进展过程中各个生物标记的变化进行敏感地捕捉。* Methods: 这篇论文使用了一种基于多任务学习的新方法，包括定义一个时间测量来评估生物标记的变化趋势和速度，并将这个趋势转换为向量，然后在单一的向量空间中与其他生物标记进行比较。* Results: 实验结果显示，与对 ROI 进行直接学习相比，这种方法更有效地预测疾病的进展。此外，这种方法还可以实现长期稳定选择，对于疾病进展中各个生物标记之间的变化关系进行敏感地捕捉，并证明了这些变化关系对于认知预测有着重要的影响。

Abstract
Identifying and utilising various biomarkers for tracking Alzheimer's disease (AD) progression have received many recent attentions and enable helping clinicians make the prompt decisions. Traditional progression models focus on extracting morphological biomarkers in regions of interest (ROIs) from MRI/PET images, such as regional average cortical thickness and regional volume. They are effective but ignore the relationships between brain ROIs over time, which would lead to synergistic deterioration. For exploring the synergistic deteriorating relationship between these biomarkers, in this paper, we propose a novel spatio-temporal similarity measure based multi-task learning approach for effectively predicting AD progression and sensitively capturing the critical relationships between biomarkers. Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(temporal). Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(spatial). The experimental results show that compared with directly ROI based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on the cognitive prediction.

摘要
identifying 和利用不同的生物标志物（biomarkers）来跟踪阿尔ц海默病（AD）的进程已经收到了很多最近的关注，这些biomarkers可以帮助临床医生做出更加快速的决策。传统的进程模型会提取ROI（区域 интерес点）中的形态生物标志物，如区域average cortical thickness和区域体积。它们效果很好，但它们忽略了脑ROI之间的时间关系，这会导致同时破坏。为了探索这些生物标志物之间的同时破坏关系，在这篇论文中，我们提出了一种基于多任务学习的新的空间-时间相似度测量方法，用于有效地预测AD进程和敏感地捕捉生物标志物之间的关键关系。Specifically, we first define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicates a changing trend (temporal). Converting this trend into a vector, we then compare this variability between biomarkers in a unified vector space (spatial). The experimental results show that compared with directly ROI-based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on cognitive prediction.

Leveraging point annotations in segmentation learning with boundary loss

paper_url: http://arxiv.org/abs/2311.03537
repo_url: None
paper_authors: Eva Breznik, Hoel Kervadec, Filip Malmberg, Joel Kullberg, Håkan Ahlström, Marleen de Bruijne, Robin Strand
for: 这个论文研究了基于强度的距离地图与边损失的点指导semantic segmentation。
methods: 论文使用了边损失来强制更加严格地处理false positive，并使用了intensity-aware距离来缓解这个问题。
results: 实验结果表明，这种监督方法在ACDC和POEM两个多类数据集上表现出色，并且在POEM数据集上与CRF损失基于方法相当。I hope that helps! Let me know if you have any other questions.

Abstract
This paper investigates the combination of intensity-based distance maps with boundary loss for point-supervised semantic segmentation. By design the boundary loss imposes a stronger penalty on the false positives the farther away from the object they occur. Hence it is intuitively inappropriate for weak supervision, where the ground truth label may be much smaller than the actual object and a certain amount of false positives (w.r.t. the weak ground truth) is actually desirable. Using intensity-aware distances instead may alleviate this drawback, allowing for a certain amount of false positives without a significant increase to the training loss. The motivation for applying the boundary loss directly under weak supervision lies in its great success for fully supervised segmentation tasks, but also in not requiring extra priors or outside information that is usually required -- in some form -- with existing weakly supervised methods in the literature. This formulation also remains potentially more attractive than existing CRF-based regularizers, due to its simplicity and computational efficiency. We perform experiments on two multi-class datasets; ACDC (heart segmentation) and POEM (whole-body abdominal organ segmentation). Preliminary results are encouraging and show that this supervision strategy has great potential. On ACDC it outperforms the CRF-loss based approach, and on POEM data it performs on par with it. The code for all our experiments is openly available.

摘要

High-resolution power equipment recognition based on improved self-attention

paper_url: http://arxiv.org/abs/2311.03518
repo_url: None
paper_authors: Siyi Zhang, Cheng Liu, Xiang Li, Xin Zhai, Zhen Wei, Sizhe Li, Xun Ma
for: 提高变压器图像识别精度，应对现有模型参数数量限制。
methods: 提出了一种基于深度自注意网络的新改进方法，包括基础网络、区域提议网络、目标区域提取和分割模块、最终预测网络。
results: 比较实验表明，该方法在变压器图像识别 tasks 上表现出色，大幅超越了两种常见的目标识别模型，为自动化电气设备检测带来新的思路。

Abstract
The current trend of automating inspections at substations has sparked a surge in interest in the field of transformer image recognition. However, due to restrictions in the number of parameters in existing models, high-resolution images can't be directly applied, leaving significant room for enhancing recognition accuracy. Addressing this challenge, the paper introduces a novel improvement on deep self-attention networks tailored for this issue. The proposed model comprises four key components: a foundational network, a region proposal network, a module for extracting and segmenting target areas, and a final prediction network. The innovative approach of this paper differentiates itself by decoupling the processes of part localization and recognition, initially using low-resolution images for localization followed by high-resolution images for recognition. Moreover, the deep self-attention network's prediction mechanism uniquely incorporates the semantic context of images, resulting in substantially improved recognition performance. Comparative experiments validate that this method outperforms the two other prevalent target recognition models, offering a groundbreaking perspective for automating electrical equipment inspections.

摘要
当前的互动式检测技术在变电站中得到了广泛的应用，导致变压器图像识别领域的兴趣增加。然而，由于现有模型的参数数量限制，高分辨率图像直接应用不可，留下大量的提高识别精度的空间。为解决这个挑战，本文提出了一种新的深度自注意网络改进方法。该模型包括四个关键组成部分：基础网络、区域提议网络、目标区域提取和分割模块，以及最终预测网络。本文的创新approach是将部件localization和识别过程解耦，首先使用低分辨率图像进行localization，然后使用高分辨率图像进行识别。此外，深度自注意网络的预测机制唯一地含有图像 semanticcontext，导致识别性能明显提高。 comparative experiments表明，该方法在变电器目标识别方面表现出色，超越了两种常见的目标识别模型，提供了一个创新的视角为自动化电气设备检测。

SoundCam: A Dataset for Finding Humans Using Room Acoustics

paper_url: http://arxiv.org/abs/2311.03517
repo_url: None
paper_authors: Mason Wang, Samuel Clarke, Jui-Hsien Wang, Ruohan Gao, Jiajun Wu
for: 这个论文的目的是提供一个大规模的听声环境数据集，用于研究听声环境的特征和人工智能应用。
methods: 这个论文使用了10个通道的实际世界听声响应测量和10个通道的音乐录音，在三个不同的房间中进行了测量，包括一个控制的听声实验室、一个生活室和一个会议室，并在每个房间中进行了不同的人的位置测量。
results: 这个论文发现，这些测量可以用于探测和识别人类，以及跟踪他们的位置。

Abstract
A room's acoustic properties are a product of the room's geometry, the objects within the room, and their specific positions. A room's acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or roughly inferred from recordings of natural signals present in the room. Variations in the positions of objects in a room can effect measurable changes in the room's acoustic properties, as characterized by the RIR. Existing datasets of RIRs either do not systematically vary positions of objects in an environment, or they consist of only simulated RIRs. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms publicly released to date. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying humans, and tracking their positions.

摘要
一个房间的声学性质是由房间的几何结构、房间内的物品以及它们的具体位置相互关系而决定。一个房间的声学性质可以通过源和听众位置之间的冲激响应（RIR）来描述，或者通过在房间中存在的自然信号来推导出。房间中物品的位置变化可以导致明显变化的声学性质，这些变化可以通过RIR来描述。现有的RIR数据集ither不系统地变化环境中的物品位置，或者只是 simulate RIR数据集。我们提出了SoundCam，这是历史上最大的公共发布的声学环境数据集，包括5000个真实世界中的10个通道冲激响应测量和3个不同房间中的2000个10个通道音乐录音。我们展示了这些测量可以用于有趣的任务，如检测和识别人类，以及跟踪他们的位置。

Predicting Age from White Matter Diffusivity with Residual Learning

paper_url: http://arxiv.org/abs/2311.03500
repo_url: None
paper_authors: Chenyu Gao, Michael E. Kim, Ho Hin Lee, Qi Yang, Nazirah Mohd Khairi, Praitayini Kanakaraj, Nancy R. Newlin, Derek B. Archer, Angela L. Jefferson, Warren D. Taylor, Brian D. Boyd, Lori L. Beason-Held, Susan M. Resnick, The BIOCARD Study Team, Yuankai Huo, Katherine D. Van Schaik, Kurt G. Schilling, Daniel Moyer, Ivana Išgum, Bennett A. Landman
for: 这个论文目的是开发白 matter 特有的年龄估计方法，以捕捉与正常年龄增长不同的异常。
methods: 这个论文使用了两种方法来预测年龄：一种是提取特定区域的微结构特征，另一种是使用3D差异神经网络（ResNets）学习图像直接特征，并对图像进行非线性对齐和折叠以最小化宏strucutral变化。
results: 测试数据上，第一种方法的 mean absolute error（MAE）为6.11年（正常参与者）和6.62年（ cognitively impaired participant），而第二种方法的 MAE 为4.69年（正常参与者）和4.96年（ cognitively impaired participant），显示 ResNet 模型可以更好地捕捉微结构特征进行年龄预测。

Abstract
Imaging findings inconsistent with those expected at specific chronological age ranges may serve as early indicators of neurological disorders and increased mortality risk. Estimation of chronological age, and deviations from expected results, from structural MRI data has become an important task for developing biomarkers that are sensitive to such deviations. Complementary to structural analysis, diffusion tensor imaging (DTI) has proven effective in identifying age-related microstructural changes within the brain white matter, thereby presenting itself as a promising additional modality for brain age prediction. Although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction.

摘要
干预发现与期望的年龄范围不符的可能是脑神经疾病和死亡风险的早期指标。确定年龄和与期望结果的偏差，从结构MRI数据中获得的任务已成为开发敏感于这些偏差的生物标志物的重要任务。与结构分析相 complementary，Diffusion tensor imaging (DTI)已经证明可以在脑白 matter中检测年龄相关的微strucutural变化，因此成为脑年龄预测的有力的附加模式。 although early studies have sought to harness DTI's advantages for age estimation, there is no evidence that the success of this prediction is owed to the unique microstructural and diffusivity features that DTI provides, rather than the macrostructural features that are also available in DTI data. Therefore, we seek to develop white-matter-specific age estimation to capture deviations from normal white matter aging. Specifically, we deliberately disregard the macrostructural information when predicting age from DTI scalar images, using two distinct methods. The first method relies on extracting only microstructural features from regions of interest. The second applies 3D residual neural networks (ResNets) to learn features directly from the images, which are non-linearly registered and warped to a template to minimize macrostructural variations. When tested on unseen data, the first method yields mean absolute error (MAE) of 6.11 years for cognitively normal participants and MAE of 6.62 years for cognitively impaired participants, while the second method achieves MAE of 4.69 years for cognitively normal participants and MAE of 4.96 years for cognitively impaired participants. We find that the ResNet model captures subtler, non-macrostructural features for brain age prediction.

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

paper_url: http://arxiv.org/abs/2311.03354
repo_url: https://github.com/UMass-Foundation-Model/CoVLM
paper_authors: Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
for: 提高大型视语言基础模型（VLM）的可 композиitional 能力，使其能够更好地理解和生成视语言对话。
methods: 提出了一种新的通信 токен技术，以便动态地通信 между视觉检测系统和语言系统，使Language Model（LM）能够更好地组合视觉实体和关系。
results: 与传统VLM相比，CoVLM在compositional reasoning benchmarks上表现出色，提高了约20%的HICO-DET mAP、约14%的Cola top-1准确率和约3%的ARO top-1准确率，同时在传统视语言任务中也达到了状态级表现。

Abstract
A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

摘要
人类拥有非凡的作业能力，即将“无限用 finite means”。然而，当前大量视语基础模型（VLM）仍然缺乏这种作业能力，因为它们的“袋子行为”和无法正确地构成视觉实体和实体之间的关系。为此，我们提出了CoVLM，可以引导LLM在文本和视觉Encoder之间进行可靠的交流，以实现视语通信编码。具体来说，我们首先设计了一组新的交流符，用于LLM与视觉检测系统之间的动态交流。这些交流符由LLM在视觉实体或关系后生成，以通知检测网络提出相关的区域。这些提出的区域的兴趣点（ROIs）然后被反馈到LLM，以便更好地根据相关区域进行语言生成。LLM因此可以通过交流符来组合视觉实体和关系。我们的框架可以凝聚视觉和LLM之间的关系，并在组合理解benchmark上超越前一代VLM的表现（例如，HICO-DET mAP上的~20%提升、Cola top-1准确率上的~14%提升和ARO top-1准确率上的~3%提升）。我们还在传统的视觉语言任务中实现了状态之前的表现，如引用表达理解和视觉问答。

Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion

paper_url: http://arxiv.org/abs/2311.03352
repo_url: https://github.com/qqlu/entity
paper_authors: Hao Zhou, Tiancheng Shen, Xu Yang, Hai Huang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang
for: 本文探讨了开放词汇 segmentation 中的评估指标问题，即评估过程仍然强调关闭集成 metric 在零shot或者 cross-dataset 管道中，而不考虑预测和实际标签类别之间的相似性。
methods: 本文首先对 eleven 种类别间的相似度量进行了抽查和用户研究，包括 WordNet 语言统计学、文本嵌入和语言模型。基于这些探讨的 measurements，我们设计了一些新的评估指标，包括 Open mIoU、Open AP 和 Open PQ，适用于三种开放词汇 segmentation 任务。
results: 我们对 twelve 种开放词汇 segmentation 方法进行了 benchmark，并证明了我们的评估指标可以很好地评估开放能力。尽管相对性的subjectivity 存在，但我们的工作希望可以带领社区新的思考如何评估开放能力。评估代码在github 上发布。

Abstract
In this paper, we highlight a problem of evaluation metrics adopted in the open-vocabulary segmentation. That is, the evaluation process still heavily relies on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. To tackle this issue, we first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, and language models by comprehensive quantitative analysis and user study. Built upon those explored measurements, we designed novel evaluation metrics, namely Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks. We benchmarked the proposed evaluation metrics on 12 open-vocabulary methods of three segmentation tasks. Even though the relative subjectivity of similarity distance, we demonstrate that our metrics can still well evaluate the open ability of the existing open-vocabulary segmentation methods. We hope that our work can bring with the community new thinking about how to evaluate the open ability of models. The evaluation code is released in github.

摘要

Long-Term Invariant Local Features via Implicit Cross-Domain Correspondences

paper_url: http://arxiv.org/abs/2311.03345
repo_url: None
paper_authors: Zador Pataki, Mohammad Altillawi, Menelaos Kanakis, Rémi Pautrat, Fengyi Shen, Ziyuan Liu, Luc Van Gool, Marc Pollefeys
for: 本文旨在investigate long-term visual domain variations的影响 на visual localization，并提出一种数据驱动的方法来改善现代特征提取网络的跨Domain可靠性。
methods: 本文使用了现代特征提取网络，并对其进行了改进，包括提出了一种新的数据驱动方法（Implicit Cross-Domain Correspondences，iCDC），该方法可以生成跨Domain的准确对应关系。
results: 本文的实验结果显示，使用了提出的iCDC方法的网络，可以在跨Domain的情况下提高视觉本地化性能，并且与现有方法相比，有显著的性能优势。

Abstract
Modern learning-based visual feature extraction networks perform well in intra-domain localization, however, their performance significantly declines when image pairs are captured across long-term visual domain variations, such as different seasonal and daytime variations. In this paper, our first contribution is a benchmark to investigate the performance impact of long-term variations on visual localization. We conduct a thorough analysis of the performance of current state-of-the-art feature extraction networks under various domain changes and find a significant performance gap between intra- and cross-domain localization. We investigate different methods to close this gap by improving the supervision of modern feature extractor networks. We propose a novel data-centric method, Implicit Cross-Domain Correspondences (iCDC). iCDC represents the same environment with multiple Neural Radiance Fields, each fitting the scene under individual visual domains. It utilizes the underlying 3D representations to generate accurate correspondences across different long-term visual conditions. Our proposed method enhances cross-domain localization performance, significantly reducing the performance gap. When evaluated on popular long-term localization benchmarks, our trained networks consistently outperform existing methods. This work serves as a substantial stride toward more robust visual localization pipelines for long-term deployments, and opens up research avenues in the development of long-term invariant descriptors.

摘要
现代学习基于的视觉特征提取网络在同一个频谱域内的本地化表现良好，但是当图像对被捕捉到不同季节和日期变化时，其表现却明显下降。在这篇论文中，我们的首要贡献是设立了跨域变化的视觉本地化性能的benchmark，并进行了当今状态艺术特征提取网络的系统性分析。我们发现了跨域变化对视觉本地化性能的显著性 gap，并 investigate了不同的方法来填补这个差距。我们提出了一种新的数据中心方法，即隐藏的跨域相对性（iCDC）。iCDC使用不同视觉频谱下的场景的多个神经辐射场，每个场景都适应各自的视觉频谱。它利用了下面的3D表示来生成准确的跨域相对性。我们的提议方法可以显著提高跨域本地化性能，降低性能差距。当我们的训练网络被评估于流行的长期本地化benchmark上，它们一直表现出色，超越了现有的方法。这项工作是对长期可靠的视觉本地化管道的重要进步，并开启了长期不变描述器的研究方向。

Cross-Image Attention for Zero-Shot Appearance Transfer

paper_url: http://arxiv.org/abs/2311.03335
repo_url: https://github.com/garibida/cross-image-attention
paper_authors: Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or
For: The paper aims to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape.* Methods: The authors build upon the self-attention layers of text-to-image generative models and introduce a cross-image attention mechanism to establish semantic correspondences across images. They also use three mechanisms to manipulate the noisy latent codes or the model’s internal representations throughout the denoising process.* Results: The approach is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

Abstract
Recent advancements in text-to-image generative models have demonstrated a remarkable ability to capture a deep semantic understanding of images. In this work, we leverage this semantic knowledge to transfer the visual appearance between objects that share similar semantics but may differ significantly in shape. To achieve this, we build upon the self-attention layers of these generative models and introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images. Specifically, given a pair of images -- one depicting the target structure and the other specifying the desired appearance -- our cross-image attention combines the queries corresponding to the structure image with the keys and values of the appearance image. This operation, when applied during the denoising process, leverages the established semantic correspondences to generate an image combining the desired structure and appearance. In addition, to improve the output image quality, we harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process. Importantly, our approach is zero-shot, requiring no optimization or training. Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint between the two input images.

摘要
现代文本到图像生成模型的进步已经显示了捕捉图像深度Semantic理解的能力。在这项工作中，我们利用这种Semantic知识来传递图像之间的Visual形态。为 достичь这一目标，我们在生成模型中的自注意层上建立了跨图像注意机制，该机制将target图像中的查询与desired appearance图像中的键和值相结合。当应用于噪声除法过程中时，这种操作可以利用已经建立的Semantic匹配来生成一个 combining the desired structure and appearance的图像。此外，为了提高输出图像质量，我们利用了三种机制，分别是修改噪声缺失代码或模型内部表示的方法，这些机制在噪声除法过程中进行。值得注意的是，我们的方法是零 shot的，不需要优化或训练。实验结果表明，我们的方法在多种物体类别上效果广泛，并且对图像之间的形态、大小和视角变化 display 具有较好的Robustness。

TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding

paper_url: http://arxiv.org/abs/2311.03427
repo_url: https://github.com/tb2-sy/tsp-transformer
paper_authors: Shuo Wang, Jing Li, Zibo Zhao, Dongze Lian, Binbin Huang, Xiaomei Wang, Zhengxin Li, Shenghua Gao
for: 本文主要针对holistic scene understanding问题提出了一种Task-Specific Prompts Transformer（TSP-Transformer）方法，用于学习有效的表示。
methods: 该方法首先使用了一个vanilla transformer层，然后使用了一个任务特定的prompt transformer层，其中任务特定的prompt可以被视为启发器，帮助模型学习各个任务的特异性特征。
results: 实验结果表明， compared with existing方法，TSP-Transformer可以在NYUD-v2和PASCAL-Context datasets上达到最佳性能，证明了该方法的效果性。 code可以在以下链接中找到：https://github.com/tb2-sy/TSP-Transformer。

Abstract
Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc. The key aspect of this problem is to learn representation effectively, as each subtask builds upon not only correlated but also distinct attributes. Inspired by visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It features a vanilla transformer in the early stage and tasks-specific prompts transformer encoder in the lateral stage, where tasks-specific prompts are augmented. By doing so, the transformer layer learns the generic information from the shared parts and is endowed with task-specific capacity. First, the tasks-specific prompts serve as induced priors for each task effectively. Moreover, the task-specific prompts can be seen as switches to favor task-specific representation learning for different tasks. Extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. We also provide our code in the following link https://github.com/tb2-sy/TSP-Transformer.

摘要
整体场景理解包括semantic segmentation、表面normal estimation、物体边界检测、深度估计等。关键问题在于学习表示效果，因为每个子任务建立在不仅相关性还有特定属性上。 Drawing inspiration from visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It consists of a vanilla transformer in the early stage and a task-specific prompts transformer encoder in the lateral stage, where task-specific prompts are augmented. By doing so, the transformer layer learns generic information from the shared parts and is endowed with task-specific capacity. First, the task-specific prompts serve as induced priors for each task, and they can be seen as switches that favor task-specific representation learning for different tasks. Our extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. Our code is available at the following link: .

A Robust Bi-Directional Algorithm For People Count In Crowded Areas

paper_url: http://arxiv.org/abs/2311.03323
repo_url: None
paper_authors: Satyanarayana Penke, Gopikrishna Pavuluri, Soukhya Kunda, Satvik M, CharanKumar Y
for: 本研究旨在提供一种精准的人数计数系统，以帮助管理人员在繁忙的场所进行人员管理和救援。
methods: 本研究使用了图像处理技术和机器学习算法，通过识别人群形态和跟踪人员的移动方向来计算人数。
results: 实验结果表明，该算法可以准确地计算人数，并且可以在实时 scenarios 中提供人员的流动信息。

Abstract
People counting system in crowded places has become a very useful practical application that can be accomplished in various ways which include many traditional methods using sensors. Examining the case of real time scenarios, the algorithm espoused should be steadfast and accurate. People counting algorithm presented in this paper, is centered on blob assessment, devoted to yield the count of the people through a path along with the direction of traversal. The system depicted is often ensconced at the entrance of a building so that the unmitigated frequency of visitors can be recorded. The core premise of this work is to extricate count of people inflow and outflow pertaining to a particular area. The tot-up achieved can be exploited for purpose of statistics in the circumstances of any calamity occurrence in that zone. Relying upon the count totaled, the population in that vicinity can be assimilated in order to take on relevant measures to rescue the people.

摘要
人数计算系统在拥挤地方已成为非常有用的实践应用，可以通过多种传统方法使用传感器实现。在实时场景中，算法应该坚定稳定，准确地计算人数。本文所描述的人数算法基于物体评估，通过跟踪人们在特定路径上的移动方向来计算人数。系统通常安装在建筑物入口处，以记录进出人数的频率。本研究的核心思想是从人数流入和流出中提取具体的人数统计数据，以便在紧急情况下采取有关救援措施。基于计算的人数，可以对当地人口进行融合，以便采取有关救援措施。

FATE: Feature-Agnostic Transformer-based Encoder for learning generalized embedding spaces in flow cytometry data

paper_url: http://arxiv.org/abs/2311.03314
repo_url: https://github.com/lisaweijler/fate
paper_authors: Lisa Weijler, Florian Kowarsch, Michael Reiter, Pedro Hermosilla, Margarita Maurer-Granofszky, Michael Dworzak
for: 这个论文是为了解决资料收集时给定特征的限制，以便将资料处理为可以运行的模型。
methods: 这个论文提出了一个新的架构，即set-transformer架构，可以直接处理不同特征空间的数据，而不需要规定输入空间为特征集的交集或联集。这个架构通过添加特征编解层，实现了从不同特征空间中学习共同的潜在特征空间。
results: 这个论文的模型在自动检测淋巴细胞抗生素敏感性检测中得到了良好的结果，并且可以运行在不同的特征空间中。特别是在淋巴细胞抗生素敏感性检测中，资料稀缺性是由疾病的低流行率导致的，这个模型的能力对于这种情况是非常重要的。

Abstract
While model architectures and training strategies have become more generic and flexible with respect to different data modalities over the past years, a persistent limitation lies in the assumption of fixed quantities and arrangements of input features. This limitation becomes particularly relevant in scenarios where the attributes captured during data acquisition vary across different samples. In this work, we aim at effectively leveraging data with varying features, without the need to constrain the input space to the intersection of potential feature sets or to expand it to their union. We propose a novel architecture that can directly process data without the necessity of aligned feature modalities by learning a general embedding space that captures the relationship between features across data samples with varying sets of features. This is achieved via a set-transformer architecture augmented by feature-encoder layers, thereby enabling the learning of a shared latent feature space from data originating from heterogeneous feature spaces. The advantages of the model are demonstrated for automatic cancer cell detection in acute myeloid leukemia in flow cytometry data, where the features measured during acquisition often vary between samples. Our proposed architecture's capacity to operate seamlessly across incongruent feature spaces is particularly relevant in this context, where data scarcity arises from the low prevalence of the disease. The code is available for research purposes at https://github.com/lisaweijler/FATE.

摘要
“在过去几年中，模型架构和训练策略在不同数据模式之间变得更加通用和灵活，但是一个持续的限制是假设输入特征的固定量和排序。这个限制在样本中的特征采集时发生变化时 particualrly relevant。在这项工作中，我们希望通过不需要受限于输入空间的交叉或者拓展来有效地利用变量特征的数据。我们提出了一种新的架构，可以直接处理数据，不需要对特征模式进行对齐。我们通过将特征编码层添加到集成 transformer 架构中，使得模型可以从不同特征空间中学习共享的幂等特征空间。这些优点在抑制静脉细胞检测中 AUTOMATIC 的淋巴细胞癌症数据中得到了证明，这里的特征通常在样本之间发生变化。我们的提出的架构能够不受不同特征空间之间的不一致限制，特别 relevance 在这个上，由于疾病的低发生率，数据的稀缺性是一个主要的问题。模型代码可以在 GitHub 上获取，请参考。”

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.03312
repo_url: https://github.com/QitaoZhao/ContextAware-PoseFormer
paper_authors: Qitao Zhao, Ce Zheng, Mengyuan Liu, Chen Chen
for: 提高3D人姿估计精度，不需要使用大量视频帧
methods: 利用 pré-trained 2D pose 检测器生成的中间视觉表示，无需进行训练
results: 对比 Context-Aware PoseFormer 和其他方法，显示了更高的速度和精度

Abstract
The dominant paradigm in 3D human pose estimation that lifts a 2D pose sequence to 3D heavily relies on long-term temporal clues (i.e., using a daunting number of video frames) for improved accuracy, which incurs performance saturation, intractable computation and the non-causal problem. This can be attributed to their inherent inability to perceive spatial context as plain 2D joint coordinates carry no visual cues. To address this issue, we propose a straightforward yet powerful solution: leveraging the readily available intermediate visual representations produced by off-the-shelf (pre-trained) 2D pose detectors -- no finetuning on the 3D task is even needed. The key observation is that, while the pose detector learns to localize 2D joints, such representations (e.g., feature maps) implicitly encode the joint-centric spatial context thanks to the regional operations in backbone networks. We design a simple baseline named Context-Aware PoseFormer to showcase its effectiveness. Without access to any temporal information, the proposed method significantly outperforms its context-agnostic counterpart, PoseFormer, and other state-of-the-art methods using up to hundreds of video frames regarding both speed and precision. Project page: https://qitaozhao.github.io/ContextAware-PoseFormer

摘要
主流的3D人姿估算方法通过长期时间做为准备（使用大量视频帧）来提高准确性，这会导致性能混叠、计算困难和非 causa 问题。这可以归结于它们的内置不能感知空间上下文的问题，因为平面的2D关节坐标不含视觉提示。为解决这个问题，我们提出了一个简单 yet powerful的解决方案：利用可用的 intermediate visual representation（如Feature Map）生成的 off-the-shelf（预训练）2Dpose detector。我们发现，虽然pose detector learns to localize 2D关节，但这些表示（e.g., feature maps）在 backbone network中的 regional operation implicitely encode 关节-centric的空间上下文。我们设计了一个简单的基线方案，名为Context-Aware PoseFormer，以示其效果。不需要访问任何时间信息，我们的提posed方法在速度和精度两个方面与无Context-agnostic counterpart PoseFormer和其他使用Up to hundreds of video frames的方法相比，具有显著的优势。项目页面：https://qitaozhao.github.io/ContextAware-PoseFormer

Machine Learning-Based Tea Leaf Disease Detection: A Comprehensive Review

paper_url: http://arxiv.org/abs/2311.03240
repo_url: None
paper_authors: Faruk Ahmed, Md. Taimur Ahad, Yousuf Rayhan Emon
for: 本研究旨在探讨机器学习技术在荵茶叶病诊断中的应用，以提高茶叶生产效率和质量。
methods: 本研究使用了多种图像处理技术，包括各种Transformer模型（如Inception Convolutional Vision Transformer（ICVT）、GreenViT、PlantXViT、PlantViT、MSCVT、Transfer Learning Model & Vision Transformer（TLMViT）、IterationViT、IEM-ViT）以及其他模型（如Dense Convolutional Network（DenseNet）、Residual Neural Network（ResNet）-50V2、YOLOv5、YOLOv7、Convolutional Neural Network（CNN）、Deep CNN、Non-dominated Sorting Genetic Algorithm（NSGA-II）、MobileNetv2、Lesion-Aware Visual Transformer）。
results: 本研究通过对多个数据集进行测试，证明了这些机器学习模型在实际应用中的可行性。

Abstract
Tea leaf diseases are a major challenge to agricultural productivity, with far-reaching implications for yield and quality in the tea industry. The rise of machine learning has enabled the development of innovative approaches to combat these diseases. Early detection and diagnosis are crucial for effective crop management. For predicting tea leaf disease, several automated systems have already been developed using different image processing techniques. This paper delivers a systematic review of the literature on machine learning methodologies applied to diagnose tea leaf disease via image classification. It thoroughly evaluates the strengths and constraints of various Vision Transformer models, including Inception Convolutional Vision Transformer (ICVT), GreenViT, PlantXViT, PlantViT, MSCVT, Transfer Learning Model & Vision Transformer (TLMViT), IterationViT, IEM-ViT. Moreover, this paper also reviews models like Dense Convolutional Network (DenseNet), Residual Neural Network (ResNet)-50V2, YOLOv5, YOLOv7, Convolutional Neural Network (CNN), Deep CNN, Non-dominated Sorting Genetic Algorithm (NSGA-II), MobileNetv2, and Lesion-Aware Visual Transformer. These machine-learning models have been tested on various datasets, demonstrating their real-world applicability. This review study not only highlights current progress in the field but also provides valuable insights for future research directions in the machine learning-based detection and classification of tea leaf diseases.

摘要
茶叶病菌是现代农业生产的主要挑战，对茶业产量和质量有着深远的影响。随着机器学习技术的发展，开发了一些创新的方法来抵御茶叶病菌。早期检测和诊断是农业管理的关键。在预测茶叶病菌方面，已经开发了许多自动化系统，使用不同的图像处理技术。本文提供了机器学习方法在诊断茶叶病菌方面的系统性评价，全面评估了不同的视觉转移模型，包括Inception Convolutional Vision Transformer（ICVT）、GreenViT、PlantXViT、PlantViT、MSCVT、Transfer Learning Model & Vision Transformer（TLMViT）、IterationViT、IEM-ViT等。此外，本文还评估了其他模型，如 dense convolutional network（DenseNet）、Residual Neural Network（ResNet）-50V2、YOLOv5、YOLOv7、Convolutional Neural Network（CNN）、Deep CNN、Non-dominated Sorting Genetic Algorithm（NSGA-II）、MobileNetv2、Lesion-Aware Visual Transformer等。这些机器学习模型在不同的数据集上进行了测试，表明了它们在实际应用中的可行性。本文不仅概述了当前领域的进展，还提供了有价值的未来研究方向，帮助推动机器学习在茶叶病菌检测和分类方面的进一步发展。

Navigating Scaling Laws: Accelerating Vision Transformer’s Training via Adaptive Strategies

paper_url: http://arxiv.org/abs/2311.03233
repo_url: None
paper_authors: Sotiris Anagnostidis, Gregor Bachmann, Thomas Hofmann
for: 这 paper 的目的是提出一种可以在训练过程中动态调整模型形态的方法，以优化模型的性能。
methods: 这 paper 使用了一种基于 scaling laws 的方法，通过调整模型的形态，可以最优地利用计算资源，以提高模型的性能。
results: 这 paper 的实验结果表明，使用这种方法可以创造出一种更高效的 vision transformer 模型，并且可以在不同的 patch size 和宽度下进行调整，以优化模型的性能。

Abstract
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: Investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of compute. This leads to the notion of a "compute-optimal" model, i.e. a model that allocates a given level of compute during training optimally to maximise performance. In this work, we extend the concept of optimality by allowing for an "adaptive" model, i.e. a model that can change its shape during the course of training. By allowing the shape to adapt, we can optimally traverse between the underlying scaling laws, leading to a significant reduction in the required compute to reach a given target performance. We focus on vision tasks and the family of Vision Transformers, where the patch size as well as the width naturally serve as adaptive shape parameters. We demonstrate that, guided by scaling laws, we can design compute-optimal adaptive models that beat their "static" counterparts.

摘要
近年来，深度学习领域的状态精顶由大型模型所占据。这种方法很简单：投入更多计算资源（最优）会提高性能，甚至可预测性能如何提高。基于这种思想，我们提出了“计算优质”模型的概念，即在训练期间最优化计算资源的分配，以最大化性能。在这项工作中，我们将“可靠”模型扩展为可变形态模型，即在训练过程中可以改变模型的形态。通过允许形态变化，我们可以优化地 traverse 在下面的减少计算资源，以达到给定目标性能。我们将视觉任务和视觉转换器家族作为研究对象，并证明可以遵循减少计算资源的扩展。我们的计算优质可变模型可以在训练过程中击败其“静态”对手。

Segmentation of Drone Collision Hazards in Airborne RADAR Point Clouds Using PointNet

paper_url: http://arxiv.org/abs/2311.03221
repo_url: None
paper_authors: Hector Arroyo, Paul Kier, Dylan Angus, Santiago Matalonga, Svetlozar Georgiev, Mehdi Goli, Gerard Dooly, James Riordan
for: 本研究旨在帮助无人机在共享空域中进行 beyond visual line of sight（BVLOS）操作，提高无人机的 situational awareness，以确保安全操作。
methods: 本研究使用雷达技术，开发了一种基于 PointNet 架构的终端到终端语义分割方法，可同时识别多个Collision Hazard。
results: 该方法可以在 aerial 设定下，识别出 Five distinct classes：移动无人机（DJI M300和DJI Mini）和飞机（Ikarus C42），以及静止返回（地面和基础设施），提高了无人机的 situational awareness，达到了94%的准确率。

Abstract
The integration of unmanned aerial vehicles (UAVs) into shared airspace for beyond visual line of sight (BVLOS) operations presents significant challenges but holds transformative potential for sectors like transportation, construction, energy and defense. A critical prerequisite for this integration is equipping UAVs with enhanced situational awareness to ensure safe operations. Current approaches mainly target single object detection or classification, or simpler sensing outputs that offer limited perceptual understanding and lack the rapid end-to-end processing needed to convert sensor data into safety-critical insights. In contrast, our study leverages radar technology for novel end-to-end semantic segmentation of aerial point clouds to simultaneously identify multiple collision hazards. By adapting and optimizing the PointNet architecture and integrating aerial domain insights, our framework distinguishes five distinct classes: mobile drones (DJI M300 and DJI Mini) and airplanes (Ikarus C42), and static returns (ground and infrastructure) which results in enhanced situational awareness for UAVs. To our knowledge, this is the first approach addressing simultaneous identification of multiple collision threats in an aerial setting, achieving a robust 94% accuracy. This work highlights the potential of radar technology to advance situational awareness in UAVs, facilitating safe and efficient BVLOS operations.

摘要
integrating unmanned aerial vehicles (UAVs) into shared airspace for beyond visual line of sight (BVLOS) operations presents significant challenges but holds transformative potential for sectors like transportation, construction, energy, and defense. A critical prerequisite for this integration is equipping UAVs with enhanced situational awareness to ensure safe operations. current approaches mainly target single object detection or classification, or simpler sensing outputs that offer limited perceptual understanding and lack the rapid end-to-end processing needed to convert sensor data into safety-critical insights. in contrast, our study leverages radar technology for novel end-to-end semantic segmentation of aerial point clouds to simultaneously identify multiple collision hazards. by adapting and optimizing the PointNet architecture and integrating aerial domain insights, our framework distinguishes five distinct classes: mobile drones (DJI M300 and DJI Mini) and airplanes (Ikarus C42), and static returns (ground and infrastructure) which results in enhanced situational awareness for UAVs. to our knowledge, this is the first approach addressing simultaneous identification of multiple collision threats in an aerial setting, achieving a robust 94% accuracy. this work highlights the potential of radar technology to advance situational awareness in UAVs, facilitating safe and efficient BVLOS operations.

paper_url: http://arxiv.org/abs/2311.03217
repo_url: None
paper_authors: Yiqiu Shen, Jungkyu Park, Frank Yeung, Eliana Goldberg, Laura Heacock, Farah Shamout, Krzysztof J. Geras
for: 这个研究旨在提高乳癌检测精度，特别是透过融合多modal的对话和时间变化信息以提高病人状态评估和未来癌症风险评估。
methods: 这个研究使用的是Multi-modal Transformer（MMT）神经网络，融合了脉冲探测和超音波对话，以帮助验证病人是否目前有癌症，并估算未来癌症风险。MMT使用自我对话和比较现有对话以聚焦多modal资料，并追踪时间变化以比较现有对话和先前对话。
results: 这个研究使用130万个检测数据，获得了AUROC0.943的检测精度，超过单一modal的基eline。此外，这个模型还可以对病人状态进行5年的风险评估，AUROC为0.826，超过了先前的脉冲探测基eline。

Abstract
Breast cancer screening, primarily conducted through mammography, is often supplemented with ultrasound for women with dense breast tissue. However, existing deep learning models analyze each modality independently, missing opportunities to integrate information across imaging modalities and time. In this study, we present Multi-modal Transformer (MMT), a neural network that utilizes mammography and ultrasound synergistically, to identify patients who currently have cancer and estimate the risk of future cancer for patients who are currently cancer-free. MMT aggregates multi-modal data through self-attention and tracks temporal tissue changes by comparing current exams to prior imaging. Trained on 1.3 million exams, MMT achieves an AUROC of 0.943 in detecting existing cancers, surpassing strong uni-modal baselines. For 5-year risk prediction, MMT attains an AUROC of 0.826, outperforming prior mammography-based risk models. Our research highlights the value of multi-modal and longitudinal imaging in cancer diagnosis and risk stratification.

摘要
乳癌检查通常通过胸部X射线摄影进行，但现有的深度学习模型通常只分析每种成像模式独立， missed opportunities to integrate多种成像模式和时间信息。本研究提出了多模态变换（MMT），一种使用胸部X射线摄影和ultrasound同时进行 synergistic 分析，以识别当前患有乳癌的患者和评估无癌患者是否会发展为乳癌。MMT通过自注意力和比较当前检测与过去成像来聚合多模态数据，并且可以跟踪时间变化。在130万个检测数据上训练，MMT在检测现有癌症方面达到了AUROC 0.943，超过了强大的单模态基线。而在5年风险预测方面，MMT达到了AUROC 0.826，超过了过去基于胸部X射线摄影的风险模型。我们的研究强调了多模态和 longitudinal 成像在肿瘤诊断和风险 stratification 中的价值。

PainSeeker: An Automated Method for Assessing Pain in Rats Through Facial Expressions

paper_url: http://arxiv.org/abs/2311.03205
repo_url: None
paper_authors: Liu Liu, Guang Li, Dingfan Deng, Jinhua Yu, Yuan Zong
for: investigate whether laboratory rats’ pain can be automatically assessed through their facial expressions.
methods: proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions.
results: demonstrated the feasibility of assessing rats’ pain from their facial expressions and also verified the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem.Here is the full text in Simplified Chinese:
for: investigate whether laboratory rats’ pain can be automatically assessed through their facial expressions.
methods: proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions.
results: demonstrated the feasibility of assessing rats’ pain from their facial expressions and also verified the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem.I hope this helps!

Abstract
In this letter, we aim to investigate whether laboratory rats' pain can be automatically assessed through their facial expressions. To this end, we began by presenting a publicly available dataset called RatsPain, consisting of 1,138 facial images captured from six rats that underwent an orthodontic treatment operation. Each rat' facial images in RatsPain were carefully selected from videos recorded either before or after the operation and well labeled by eight annotators according to the Rat Grimace Scale (RGS). We then proposed a novel deep learning method called PainSeeker for automatically assessing pain in rats via facial expressions. PainSeeker aims to seek pain-related facial local regions that facilitate learning both pain discriminative and head pose robust features from facial expression images. To evaluate the PainSeeker, we conducted extensive experiments on the RatsPain dataset. The results demonstrate the feasibility of assessing rats' pain from their facial expressions and also verify the effectiveness of the proposed PainSeeker in addressing this emerging but intriguing problem. The RasPain dataset can be freely obtained from https://github.com/xhzongyuan/RatsPain.

摘要
在这封信中，我们想 investigate 是否可以通过鼠标的表情自动评估它们的痛苦。为此，我们开始使用公共可用的 dataset called RatsPain，包含 1,138 个鼠标的面部图像，来自六只鼠标在 ortodontic 治疗操作后的视频记录。每只鼠标的面部图像在 RatsPain 中被精心选择并由八名注解者根据鼠标的抽筋scale (RGS) 进行了分类标注。然后，我们提出了一种新的深度学习方法 called PainSeeker，用于自动评估鼠标的痛苦程度 via 面部表情图像。PainSeeker 的目标是寻找痛苦相关的面部地方，以便从面部表情图像中学习痛苦特异和头 pose 稳定的特征。为了评估 PainSeeker，我们在 RatsPain 数据集上进行了广泛的实验。结果表明可以通过鼠标的面部表情评估它们的痛苦程度，并且证明了我们提出的 PainSeeker 可以有效地解决这个出现的问题。RatsPain 数据集可以免费下载于 https://github.com/xhzongyuan/RatsPain。

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

paper_url: http://arxiv.org/abs/2311.03198
repo_url: https://github.com/ZhouZijie77/LCPR
paper_authors: Zijie Zhou, Jingyi Xu, Guangming Xiong, Junyi Ma
for: 本研究旨在提高自动驾驶车辆在GPS无效环境中识别已经前期访问过的地点。
methods: 本研究使用多模态感知融合来超越个体感知器的不足之处。
results: 实验结果表明，我们的方法可以充分利用多视图相机和激光雷达数据来提高地点识别性能，同时具有强大的视角变化robustness。Here’s the English version of the summary for reference:
for: The purpose of this research is to improve the place recognition of autonomous vehicles in GPS-invalid environments.
methods: The study uses multimodal sensor fusion to overcome the limitations of individual sensors.
results: The experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve place recognition performance while maintaining strong robustness to viewpoint changes.

Abstract
Place recognition is one of the most crucial modules for autonomous vehicles to identify places that were previously visited in GPS-invalid environments. Sensor fusion is considered an effective method to overcome the weaknesses of individual sensors. In recent years, multimodal place recognition fusing information from multiple sensors has gathered increasing attention. However, most existing multimodal place recognition methods only use limited field-of-view camera images, which leads to an imbalance between features from different modalities and limits the effectiveness of sensor fusion. In this paper, we present a novel neural network named LCPR for robust multimodal place recognition, which fuses LiDAR point clouds with multi-view RGB images to generate discriminative and yaw-rotation invariant representations of the environment. A multi-scale attention-based fusion module is proposed to fully exploit the panoramic views from different modalities of the environment and their correlations. We evaluate our method on the nuScenes dataset, and the experimental results show that our method can effectively utilize multi-view camera and LiDAR data to improve the place recognition performance while maintaining strong robustness to viewpoint changes. Our open-source code and pre-trained models are available at https://github.com/ZhouZijie77/LCPR .

摘要
固定位置识别是自动驾驶车辆最重要的模块之一，用于在GPS无效环境中识别之前访问过的地点。感知融合是一种有效的方法来超越个体感知器的缺陷。在过去几年，多模态固定位置识别方法已经吸引了增加的关注。然而，大多数现有的多模态固定位置识别方法只使用有限视场的相机图像，这会导致不同模态特征之间的均衡不良，限制感知融合的效iveness。在这篇论文中，我们提出了一种新的神经网络模型，名为LCPR，用于实现可靠的多模态固定位置识别。LCPR模型将LiDAR点云与多视角RGB图像融合，生成特征rich和旋转不变的环境表示。我们提出了一种多级注意力基于的混合模块，以便完全利用不同模态环境的全景视图和其相关性。我们在nuScenes数据集上进行了实验，结果表明，我们的方法可以有效地利用多视角相机和LiDAR数据，提高固定位置识别性能，同时保持强大的视点变化Robustness。我们的开源代码和预训练模型可以在https://github.com/ZhouZijie77/LCPR上获取。

Few-shot Learning using Data Augmentation and Time-Frequency Transformation for Time Series Classification

paper_url: http://arxiv.org/abs/2311.03194
repo_url: None
paper_authors: Hao Zhang, Zhendong Pang, Jiangpeng Wang, Teng Li
for: 这 paper 的目的是解决时间序列分类任务中的少量数据问题，提出了一种基于数据扩展的几何学学习框架。
methods: 该方法使用了时间频率域的变换和随机绘制来生成synthetic图像，并开发了一种序列 спектрограм神经网络模型（SSNN），该模型由两个子网络组成：一个使用1D径向块来提取输入序列中的特征，另一个使用2D径向块来提取spectrogram表示中的特征。
results: 在一个amyotrophic lateral sclerosis（ALS）数据集和一个风力机 fault（WTF）数据集上进行了对 existed DNN 模型的比较研究，结果表明，我们提出的方法可以在几何学学习中提高时间序列分类的精度。

Abstract
Deep neural networks (DNNs) that tackle the time series classification (TSC) task have provided a promising framework in signal processing. In real-world applications, as a data-driven model, DNNs are suffered from insufficient data. Few-shot learning has been studied to deal with this limitation. In this paper, we propose a novel few-shot learning framework through data augmentation, which involves transformation through the time-frequency domain and the generation of synthetic images through random erasing. Additionally, we develop a sequence-spectrogram neural network (SSNN). This neural network model composes of two sub-networks: one utilizing 1D residual blocks to extract features from the input sequence while the other one employing 2D residual blocks to extract features from the spectrogram representation. In the experiments, comparison studies of different existing DNN models with/without data augmentation are conducted on an amyotrophic lateral sclerosis (ALS) dataset and a wind turbine fault (WTF) dataset. The experimental results manifest that our proposed method achieves 93.75% F1 score and 93.33% accuracy on the ALS datasets while 95.48% F1 score and 95.59% accuracy on the WTF datasets. Our methodology demonstrates its applicability of addressing the few-shot problems for time series classification.

摘要
深度神经网络（DNNs）在时间序列分类（TSC）任务中提供了一个有前途的框架，在实际应用中，作为数据驱动模型，DNNs受到了数据不充足的限制。几何学学习被研究以解决这个限制。在这篇论文中，我们提出了一种新的几何学学习框架，通过数据扩展，包括时间频域的变换和随机磁化生成的Synthetic图像。此外，我们开发了一种序列spectrogram神经网络（SSNN）。这个神经网络模型由两个子网络组成：一个使用1D residual块提取输入序列的特征，另一个使用2D residual块提取spectrogram表示的特征。在实验中，我们对不同的现有DNN模型进行了与/无数据扩展的比较研究，并在amyotrophic lateral sclerosis（ALS）数据集和风电机缺陷（WTF）数据集上进行了实验。实验结果表明，我们提出的方法在ALS数据集上达到了93.75%的F1分数和93.33%的准确率，在WTF数据集上达到了95.48%的F1分数和95.59%的准确率。我们的方法证明了其适用性于Addressing几何学学习问题。

Efficient and Low-Footprint Object Classification using Spatial Contrast

paper_url: http://arxiv.org/abs/2311.03422
repo_url: None
paper_authors: Matthew Belding, Daniel C. Stumpp, Rajkumar Kubendran
for: 本研究探讨了一种基于事件的视觉感知器，使用本地化的空间对比（SC），并采用了两种阈值技术，相对阈值和绝对阈值。
methods: 本研究使用了虚拟模拟器来研究这种硬件感知器的可能性。此外，通过使用德国交通标志数据集（GTSRB）和知名的深度神经网络（DNN）进行交通标志分类，以评估空间对比的效果。
results: 研究发现，使用空间对比可以有效地捕捉图像中重要的特征，并且可以使用二进制micronet实现较大的减少输入数据量和内存资源（至少12倍），相比高精度RGB图像和DNN，只有小loss（约2%）。因此，SC在功能和资源有限的边缘计算环境中表现出了很大的抢夺。

Abstract
Event-based vision sensors traditionally compute temporal contrast that offers potential for low-power and low-latency sensing and computing. In this research, an alternative paradigm for event-based sensors using localized spatial contrast (SC) under two different thresholding techniques, relative and absolute, is investigated. Given the slow maturity of spatial contrast in comparison to temporal-based sensors, a theoretical simulated output of such a hardware sensor is explored. Furthermore, we evaluate traffic sign classification using the German Traffic Sign dataset (GTSRB) with well-known Deep Neural Networks (DNNs). This study shows that spatial contrast can effectively capture salient image features needed for classification using a Binarized DNN with significant reduction in input data usage (at least 12X) and memory resources (17.5X), compared to high precision RGB images and DNN, with only a small loss (~2%) in macro F1-score. Binarized MicronNet achieves an F1-score of 94.4% using spatial contrast, compared to only 56.3% when using RGB input images. Thus, SC offers great promise for deployment in power and resource constrained edge computing environments.

摘要
traducción al chino simplificado:传统的事件基于视觉传感器通常计算时间异相，这对低功耗和低延迟感知和计算具有潜在的潜力。在这项研究中，我们提出了一种基于本地空间异相（SC）的事件基于传感器，并使用两种不同的阈值技术：相对和绝对阈值。由于空间异相比 temporal-based传感器更慢成熔，我们首先 theoretically simulate 一个硬件传感器的输出。此外，我们使用德国交通标志数据集（GTSRB）进行交通标志分类，使用知名的深度神经网络（DNN）进行评估。结果表明，使用空间异相可以有效地捕捉图像中关键的特征，使用binarized DNN 进行分类，相比高精度 RGB 图像和 DNN，具有至少 12 倍的输入数据使用量和内存资源减少（17.5 倍），同时只失 2% 的 macro F1 score。binarized MicronNet 在使用空间异相时获得了 F1 score 94.4%，与使用 RGB 输入图像时相比，只有 56.3%。这表明 SC 具有在功率和资源受限的边缘计算环境中的潜力。

Frequency Domain Decomposition Translation for Enhanced Medical Image Translation Using GANs

paper_url: http://arxiv.org/abs/2311.03175
repo_url: None
paper_authors: Zhuhui Wang, Jianwei Zuo, Xuliang Deng, Jiajia Luo
for: 这篇论文主要针对医学影像转换任务，尤其是运用GAN方法实现高品质的医学影像转换。
methods: 本研究提出了一新的频域分解转换方法（FDDT），它将原始影像分解为高频和低频部分，并将这两个部分转换为同频域的转换结果，以保持原始影像的身份信息，同时最小化影像的形式信息损失。
results: 在实验中，FDDT与多个主流基eline模型进行比较，结果显示，FDDT可以将Fr'echet内部距离降低至24.4%、结构相似度降低至4.4%、峰值信号对比降低至5.8%和平均方差降低至31%，较前一方法降低23.7%、1.8%、6.8%和31.6%。

Abstract
Medical Image-to-image translation is a key task in computer vision and generative artificial intelligence, and it is highly applicable to medical image analysis. GAN-based methods are the mainstream image translation methods, but they often ignore the variation and distribution of images in the frequency domain, or only take simple measures to align high-frequency information, which can lead to distortion and low quality of the generated images. To solve these problems, we propose a novel method called frequency domain decomposition translation (FDDT). This method decomposes the original image into a high-frequency component and a low-frequency component, with the high-frequency component containing the details and identity information, and the low-frequency component containing the style information. Next, the high-frequency and low-frequency components of the transformed image are aligned with the transformed results of the high-frequency and low-frequency components of the original image in the same frequency band in the spatial domain, thus preserving the identity information of the image while destroying as little stylistic information of the image as possible. We conduct extensive experiments on MRI images and natural images with FDDT and several mainstream baseline models, and we use four evaluation metrics to assess the quality of the generated images. Compared with the baseline models, optimally, FDDT can reduce Fr\'echet inception distance by up to 24.4%, structural similarity by up to 4.4%, peak signal-to-noise ratio by up to 5.8%, and mean squared error by up to 31%. Compared with the previous method, optimally, FDDT can reduce Fr\'echet inception distance by up to 23.7%, structural similarity by up to 1.8%, peak signal-to-noise ratio by up to 6.8%, and mean squared error by up to 31.6%.

摘要
医学图像转换是计算机视觉和生成人工智能领域的关键任务，并且具有广泛的应用前景。GAN基本方法是主流图像转换方法，但它们经常忽略图像在频率频谱中的变化和分布，或者只是使用简单的方法来对高频信息进行对齐，这可能导致图像生成的质量下降。为解决这些问题，我们提出了一种新的方法called频率频谱分解翻译（FDDT）。FDDT方法将原始图像分解成高频组件和低频组件，其中高频组件包含细节和标识信息，而低频组件包含风格信息。然后，将高频和低频组件的转换结果与原始图像的高频和低频组件在同一频率带的空间频谱中进行对齐，以保持图像的标识信息，同时尽量减少图像的风格信息损失。我们在MRI图像和自然图像上进行了广泛的实验，并使用了数个主流基eline模型进行比较。我们使用了四种评价指标来评价生成图像的质量，其中包括Fréchet吸引距离、结构相似度、峰值信号噪声比和平均平方误差。与基eline模型相比，FDDT可以最大化Fréchet吸引距离下降24.4%、结构相似度下降4.4%、峰值信号噪声比下降5.8%和平均平方误差下降31%。与之前的方法相比，FDDT可以最大化Fréchet吸引距离下降23.7%、结构相似度下降1.8%、峰值信号噪声比下降6.8%和平均平方误差下降31.6%。

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

paper_url: http://arxiv.org/abs/2311.03149
repo_url: None
paper_authors: Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang
for: 这个论文主要针对的是用自动编码器预训练小型视Transformer模型，以提高计算成本和适用范围。
methods: 该论文提出了一种新的偏向masked distillation（AMD）框架，用于预训练小型模型。AMD使用不同的掩码策略，让老师模型可以看到更多的上下文信息，而学生模型仍然保持高的掩码率。
results: AMD在IN1K dataset上 achiev 84.6%的分类精度，和在Something-in-Something V2 dataset上 achiev 73.3%的分类精度，比原始ViT-B模型提高3.7%。此外，AMD预训练模型也可以 transferred to 下游任务，并取得了一致的性能提升。

Abstract
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost that might limit their deployment. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation(AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model still with high masking ratio to the original masked pre-training. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the standard pre-training.

摘要
自我监督基础模型在计算机视觉领域表现出了很大的潜力，这主要归功于预训练方法的遮盖自动编码。但是，这些大型基础模型经常会导致高计算成本，这可能会限制其部署。这篇论文关注预训练相对较小的视觉转换器模型，以实现高效地适应下游任务。我们提出了一种新的异 symmetry 遮盖（AMD）框架，用于预训练相对小型模型。AMD的核心思想是设计不同的遮盖策略，使得老师模型在低遮盖率下可以看到更多的上下文信息，而学生模型仍然保持高遮盖率。我们设计了特定的多层特征对Alignment来规范学生MAE的预训练。为了证明 AMD 的有效性和多样性，我们将其应用于 ImageMAE 和 VideoMAE 中的预训练相对小型 ViT 模型。在 IN1K 上，AMD 达到了 84.6% 的分类精度，使用 ViT-B 模型。在 Something-in-Something V2 数据集上，AMD 达到了 73.3% 的分类精度，相比标准预训练 ViT-B 模型提高了 3.7%。我们还将 AMD 预训练模型转移到下游任务上，并获得了一致的性能改进。

Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances

paper_url: http://arxiv.org/abs/2311.03140
repo_url: None
paper_authors: Paul Knoll, Wieland Morgenstern, Anna Hilsmann, Peter Eisert
for: 本研究旨在提出一种基于NeRF的人体动作控制 renderering框架，以实现 pose-dependent 的人体表现。
methods: 本方法基于 NeRF 的渲染场，将场面绘制在 SMPL 人体模型上，并通过skeletal 关节参数来控制人体的动作表现。
results: 实验结果显示，本方法可以实现高质量的新视角和新姿势 synthesis，并且能够efficiently 学习并渲染 despite mapping ambiguities和Random visual variations。

Abstract
Creating high-quality controllable 3D human models from multi-view RGB videos poses a significant challenge. Neural radiance fields (NeRFs) have demonstrated remarkable quality in reconstructing and free-viewpoint rendering of static as well as dynamic scenes. The extension to a controllable synthesis of dynamic human performances poses an exciting research question. In this paper, we introduce a novel NeRF-based framework for pose-dependent rendering of human performances. In our approach, the radiance field is warped around an SMPL body mesh, thereby creating a new surface-aligned representation. Our representation can be animated through skeletal joint parameters that are provided to the NeRF in addition to the viewpoint for pose dependent appearances. To achieve this, our representation includes the corresponding 2D UV coordinates on the mesh texture map and the distance between the query point and the mesh. To enable efficient learning despite mapping ambiguities and random visual variations, we introduce a novel remapping process that refines the mapped coordinates. Experiments demonstrate that our approach results in high-quality renderings for novel-view and novel-pose synthesis.

摘要
创建高质量可控3D人体模型从多视图RGB视频中提供了一个 significante挑战。神经辐射场（NeRF）已经表现出了remarkable的质量，可以重建和自由观点渲染静止和动态场景。在这篇论文中，我们介绍了一种基于NeRF的新的框架，用于基于pose的人体表现的可控渲染。在我们的方法中，辐射场被扭曲到了一个SMPL体幔网格上，创建了一个新的表面对应表示。我们的表示可以通过skeletal关节参数来动画，这些参数被传递给NeRF，以便根据pose来控制外观。为实现这一点，我们的表示包括UV坐标在Texture map上的对应2D坐标和查询点与网格之间的距离。为了实现高效的学习，我们引入了一种新的映射过程，用于修正映射的坐标。实验结果表明，我们的方法可以生成高质量的新视图和新pose синтеesis。

TAMPAR: Visual Tampering Detection for Parcel Logistics in Postal Supply Chains

paper_url: http://arxiv.org/abs/2311.03124
repo_url: None
paper_authors: Alexander Naumann, Felix Hertlein, Laura Dörr, Kai Furmans
for: 本研究探讨了用于最后一英里配送的邮件检测 tampering 的方法，使用单个 RGB 图像与现有数据库中的参考图像进行比较，检测可能出现的外观变化。
methods: 本研究提议了一种 tampering 检测管道，利用锚点检测来确定包裹的八个角点，然后应用平行变换创建正规化的前视图，以便对包裹的每个可见面进行比较。
results: 实验结果表明，锚点检测和变换检测分别达到了 75.76% AP 和 81% 的准确率，F1 分数为 0.83，在实际图像中显示了良好的结果。此外，对不同的披雨、镜头偏角和检测方法进行了敏感性分析。

Abstract
Due to the steadily rising amount of valuable goods in supply chains, tampering detection for parcels is becoming increasingly important. In this work, we focus on the use-case last-mile delivery, where only a single RGB image is taken and compared against a reference from an existing database to detect potential appearance changes that indicate tampering. We propose a tampering detection pipeline that utilizes keypoint detection to identify the eight corner points of a parcel. This permits applying a perspective transformation to create normalized fronto-parallel views for each visible parcel side surface. These viewpoint-invariant parcel side surface representations facilitate the identification of signs of tampering on parcels within the supply chain, since they reduce the problem to parcel side surface matching with pair-wise appearance change detection. Experiments with multiple classical and deep learning-based change detection approaches are performed on our newly collected TAMpering detection dataset for PARcels, called TAMPAR. We evaluate keypoint and change detection separately, as well as in a unified system for tampering detection. Our evaluation shows promising results for keypoint (Keypoint AP 75.76) and tampering detection (81% accuracy, F1-Score 0.83) on real images. Furthermore, a sensitivity analysis for tampering types, lens distortion and viewing angles is presented. Code and dataset are available at https://a-nau.github.io/tampar.

摘要
Due to the steadily rising amount of valuable goods in supply chains, detecting tampering for parcels has become increasingly important. In this work, we focus on the use-case of last-mile delivery, where only a single RGB image is taken and compared against a reference from an existing database to detect potential appearance changes that indicate tampering. We propose a tampering detection pipeline that utilizes keypoint detection to identify the eight corner points of a parcel. This permits applying a perspective transformation to create normalized fronto-parallel views for each visible parcel side surface. These viewpoint-invariant parcel side surface representations facilitate the identification of signs of tampering on parcels within the supply chain, since they reduce the problem to parcel side surface matching with pair-wise appearance change detection. Experiments with multiple classical and deep learning-based change detection approaches are performed on our newly collected TAMpering detection dataset for PARcels, called TAMPAR. We evaluate keypoint and change detection separately, as well as in a unified system for tampering detection. Our evaluation shows promising results for keypoint (Keypoint AP 75.76) and tampering detection (81% accuracy, F1-Score 0.83) on real images. Furthermore, a sensitivity analysis for tampering types, lens distortion, and viewing angles is presented. Code and dataset are available at https://a-nau.github.io/tampar.Here is the translation in Traditional Chinese:因为供应链中的高值货物量不断增加，该货物运输中的过程遗传检测已经变得非常重要。在这个工作中，我们专注在最后一英里的运输use case中，只有单一的RGB图像和现有数据库中的参考进行比较，以探测可能的外观变化，以探测遗传。我们提出了一个遗传检测管线，使用关键点检测来识别八个角点的货物。这允许我们将图像应用到对每个可见的货物侧面进行正规化平行投影。这些对货物侧面的投影具有视角不受影响的特性，因此可以将遗传检测问题降低到货物侧面匹配的对比检测。我们在多种古典和深度学习基于的变化检测方法上进行了实验，并使用了我们 newly collected TAMpering detection dataset for PARcels，called TAMPAR。我们分别评估了关键点和变化检测，以及它们在联合系统中的表现。我们的评估结果显示，关键点精度高（Keypoint AP 75.76）和遗传检测精度高（81%准确率、F1-Score 0.83）。此外，我们还进行了遗传类型、镜头扭曲和视角的敏感分析。代码和数据可以在https://a-nau.github.io/tampar上获取。

paper_url: http://arxiv.org/abs/2311.03106
repo_url: https://github.com/huiguanlab/umurl
paper_authors: Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun Yang, Xun Wang, Meng Wang
for: 本研究旨在提出一种多模态无监督学习框架，以提高skeleton基于动作理解的robust性和效率。
methods: 本研究使用一种称为Unified Multimodal Unsupervised Representation Learning（UmURL）的方法，它通过早期融合策略将多 modal的特征编码在单流程中，从而降低模型复杂性。此外，本研究还提出了内部和外部一致性学习来保证多modal特征不受modal bias的影响。
results: 实验结果表明，UmURL可以具有高效率和低复杂性，同时在不同的下游任务场景中 achieve新的state-of-the-art表现。

Abstract
Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models, then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning.

摘要
近些年来，无监督预训练在基于骨架的动作理解中得到了很大的成功。现有的方法通常是将不同的感知Modalities分开训练，然后通过较晚的融合策略进行动作理解。虽然这些方法已经实现了显著的性能提升，但它们受到复杂且重复的多流程模型设计的限制，每个模型都受到固定输入骨架模式的限制。为了解决这些问题，在本文中，我们提出了一种统一多模态无监督学习框架，名为 UmURL，它利用有效的早期融合策略来同时编码多个感知Modalities的多模态特征。具体来说，而不是为每个模式分别设计单modal无监督学习过程，我们将不同的模式输入feed到同一个流程中，并使用早期融合策略来学习它们的多模态特征，以降低模型复杂度。此外，为确保融合的多模态特征不受任何一个模式输入的干扰，我们还提出了内部和外部模式一致性学习，以保证每个模式的多模态特征含有完整的 semantics。因此，我们的框架能够学习单模态或多模态骨架输入的统一表示，这是对实际中不同类型的模式输入的灵活应用。我们在NTU-60、NTU-120和PKU-MMD II等三个大规模数据集上进行了广泛的实验，结果表明，UmURL具有高效性，与单 modal 方法相当，而实现了多个下游任务enario中的新的顶峰性能。

A survey and classification of face alignment methods based on face models

paper_url: http://arxiv.org/abs/2311.03082
repo_url: https://github.com/nordlinglab/facealignment-survey
paper_authors: Jagmohan Meher, Hector Allende-Cid, Torbjörn E. M. Nordling
for: 这篇论文的目的是对不同类型的读者（ beginner、实践者和研究人员）提供关于面部对齐的综述，包括面部模型的解释和训练，以及将面部模型适应到新的face图像中。
methods: 这篇论文使用了多种面部模型，包括基于3D的面部模型和基于深度学习的方法。这些方法的训练和应用都有所不同，例如使用热图来表示面部特征。
results: 研究发现，在极大的面 pose 情况下，3D-based face models更为有效，而深度学习-based方法通常使用热图来表示面部特征。此外，文章还讨论了面部模型在面 alignment 领域的未来发展方向。

Abstract
A face model is a mathematical representation of the distinct features of a human face. Traditionally, face models were built using a set of fiducial points or landmarks, each point ideally located on a facial feature, i.e., corner of the eye, tip of the nose, etc. Face alignment is the process of fitting the landmarks in a face model to the respective ground truth positions in an input image containing a face. Despite significant research on face alignment in the past decades, no review analyses various face models used in the literature. Catering to three types of readers - beginners, practitioners and researchers in face alignment, we provide a comprehensive analysis of different face models used for face alignment. We include the interpretation and training of the face models along with the examples of fitting the face model to a new face image. We found that 3D-based face models are preferred in cases of extreme face pose, whereas deep learning-based methods often use heatmaps. Moreover, we discuss the possible future directions of face models in the field of face alignment.

摘要
一个面模型是一个数学表示人脸的特征。传统上，面模型通过一组标准点或特征点建立，每个点理想位于人脸中的一个特征处，例如眼角或鼻头等。人脸对适应是将标准点在面模型与输入图像中的真实位置进行适应的过程。Despite significant research on face alignment in the past decades, no review has analyzed various face models used in the literature. To cater to three types of readers - beginners, practitioners, and researchers in face alignment, we provide a comprehensive analysis of different face models used for face alignment. We include the interpretation and training of the face models along with examples of fitting the face model to a new face image. We found that 3D-based face models are preferred in cases of extreme face pose, whereas deep learning-based methods often use heatmaps. Moreover, we discuss the possible future directions of face models in the field of face alignment.

CogVLM: Visual Expert for Pretrained Language Models

paper_url: http://arxiv.org/abs/2311.03079
repo_url: https://github.com/thudm/cogvlm
paper_authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang
for: This paper presents a powerful open-source visual language foundation model called CogVLM, which aims to bridge the gap between frozen pretrained language models and image encoders.
methods: The CogVLM model uses a trainable visual expert module in the attention and FFN layers to enable deep fusion of vision language features without sacrificing any performance on NLP tasks.
results: The CogVLM-17B model achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B.

Abstract
We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

摘要
我们介绍CogVLM，一个强大的开源视觉语言基础模型。与流行的浅层对应方法不同，CogVLM通过在注意力和FFN层中添加可学习的视觉专家模块，将冰格预训练语言模型和图像编码器连接起来。这使得CogVLM可以深度融合视觉语言特征，不会失去任何表现力在NLPTasks中。CogVLM-17B在10个经典跨模态benchmark上达到了状态机器人的表现，包括NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA和TDIUC，并在VQAv2、OKVQA、TextVQA、COCO captioning等排名第2，超过或匹配PaLI-X 55B。代码和Checkpoint可以在https://github.com/THUDM/CogVLM中找到。

A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection

paper_url: http://arxiv.org/abs/2311.03074
repo_url: https://github.com/zhyjsiat/a-two-stage-cyclegan-ve-brats2020
paper_authors: Wenxin Wang, Zhuo-Xu Cui, Guanxun Cheng, Chentao Cao, Xi Xu, Ziwei Liu, Haifeng Wang, Yulong Qi, Dong Liang, Yanjie Zhu
for:* 这个 paper 的目的是提高脑肿瘤检测和分类的精度。methods:* 本 paper 使用了两种方法：CycleGAN 和 VE-JP。CycleGAN 是在不对应的数据上训练的，以生成异常的图像作为数据先验。VE-JP 则是使用伪随机分布来重建健康的图像，并将伪随机分布与真正的分布融合在一起。results:* 本 paper 的结果显示，使用 TSGM 可以提高脑肿瘤检测和分类的精度。在 BraTs2020 数据集上，DSC 分数为 0.8590，在 ITCS 数据集上分数为 0.6226，在内部数据集上分数为 0.7403。这些结果显示 TSGM 的检测和分类性能较好，并且具有更好的扩展性。

Abstract
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and Variance Exploding stochastic differential equation using joint probability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS dataset and 0.7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.

摘要
现代医学诊断中，检测和分类脑肿的精准性非常重要。然而，现有的指导学习方法需要大量的标注图像，而状态的艺术模型在无监督方法中经常无法覆盖整个数据分布。本文提出了一种新的框架Two-Stage Generative Model（TSGM），它结合了Cycling Generative Adversarial Network（CycleGAN）和变量爆发杂化方程（VE-JP）以提高脑肿检测和分类。CycleGAN在无对应数据上训练，将健康图像转换成病理图像作为数据先验。然后，VE-JP被实现以重建健康图像，使用生成的假病理图像作为引导，只有病理区域受到修改，而非健康区域。值得注意的是，我们的方法直接学习了联合概率分布 для条件生成。输入图像与重建图像之间的差异指示病理，并应用阈值方法以获取分 segmentation 结果。此外，我们使用多Modal的结果进行权重，以进一步提高分 segmentation 精度。我们在三个数据集上验证了我们的方法，并与其他无监督方法进行比较。BraTs2020 数据集的 DSC 分数为 0.8590，ITCS 数据集的 DSC 分数为 0.6226，In-house 数据集的 DSC 分数为 0.7403，表明我们的方法在 segmentation 性能方面表现出色，并且具有更好的泛化能力。

OrthoNets: Orthogonal Channel Attention Networks

paper_url: http://arxiv.org/abs/2311.03071
repo_url: https://github.com/hady1011/orthonets
paper_authors: Hadi Salman, Caleb Parks, Matthew Swan, John Gauch
for: 提高频道注意机制的效iveness，寻找一种lossy压缩方法以获得最佳特征表示。
methods: 使用 randomly initialized orthogonal filters 构建注意机制，并将其集成到 ResNet 中。
results: 在 Birds、MS-COCO 和 Places356 datasets 上比较表现出色，与 FcaNet 和其他注意机制相比。在 ImageNet dataset 上与当前状态码头齐。

Abstract
Designing an effective channel attention mechanism implores one to find a lossy-compression method allowing for optimal feature representation. Despite recent progress in the area, it remains an open problem. FcaNet, the current state-of-the-art channel attention mechanism, attempted to find such an information-rich compression using Discrete Cosine Transforms (DCTs). One drawback of FcaNet is that there is no natural choice of the DCT frequencies. To circumvent this issue, FcaNet experimented on ImageNet to find optimal frequencies. We hypothesize that the choice of frequency plays only a supporting role and the primary driving force for the effectiveness of their attention filters is the orthogonality of the DCT kernels. To test this hypothesis, we construct an attention mechanism using randomly initialized orthogonal filters. Integrating this mechanism into ResNet, we create OrthoNet. We compare OrthoNet to FcaNet (and other attention mechanisms) on Birds, MS-COCO, and Places356 and show superior performance. On the ImageNet dataset, our method competes with or surpasses the current state-of-the-art. Our results imply that an optimal choice of filter is elusive and generalization can be achieved with a sufficiently large number of orthogonal filters. We further investigate other general principles for implementing channel attention, such as its position in the network and channel groupings. Our code is publicly available at https://github.com/hady1011/OrthoNets/

摘要
设计有效的通道注意机制需要找到一种lossy压缩方法，以便获得优化的特征表示。尽管最近在这个领域的进展不断，但这问题仍然未得到解决。FcaNet，当前领先的通道注意机制，尝试使用Discrete Cosine Transforms（DCTs）来找到这样的信息充足压缩。FcaNet的一个缺点是没有自然的DCT频率选择。为了解决这个问题，FcaNet在ImageNet上进行了实验。我们假设，选择的频率只是支持性的角色，主要的驱动力是DCT核函数的正交性。为了测试这个假设，我们构建了一个使用随机初始化的正交滤波器的注意机制。将这个机制 integrate into ResNet，我们创建了OrthoNet。我们与FcaNet（以及其他注意机制）在Birds、MS-COCO和Places356上进行比较，并显示了超越性。在ImageNet dataset上，我们的方法与当前领先的状态相竞争。我们的结果表明，优化的筛选器是极其困难的，但通过一个足够大的正交滤波器数量，可以实现总体的泛化。我们进一步调查了其他实现通道注意的一般原则，如其在网络中的位置和通道分组。我们的代码可以在https://github.com/hady1011/OrthoNets/上获取。

Forest aboveground biomass estimation using GEDI and earth observation data through attention-based deep learning

paper_url: http://arxiv.org/abs/2311.03067
repo_url: None
paper_authors: Wenquan Dong, Edward T. A. Mitchard, Hao Yu, Steven Hancock, Casey M. Ryan
for: 本研究旨在使用卫星数据进行森林上空生物质（AGB）估算，以了解气候变化下的碳账户。methods: 本研究使用了开源的卫星数据，包括GEDI LiDAR数据、C频段Sentinel-1 SAR数据、ALOS-2 PALSAR-2数据和Sentinel-2多спектраль数据，并采用了注意力基于深度学习模型（AU）进行AGB估算。results: 相比传统的RF算法，AU模型在AGB估算中显示了明显的高精度。AU模型的R2为0.66，RMSE为43.66亿分之一，偏差为0.14亿分之一，而RF的R2为0.62，RMSE为45.87亿分之一，偏差为1.09亿分之一。然而，深度学习方法的优势不uniform地出现在所有测试模型中。ResNet101只有R2为0.50，RMSE为52.93亿分之一，偏差为0.99亿分之一，而UNet Reported R2为0.65，RMSE为44.28亿分之一，并且显示了较大的偏差（1.84亿分之一）。此外，为了探讨AU在不含空间信息的情况下的性能，FC层被使用，以消除卫星数据中的空间信息。AU-FC实现了中间的R2为0.64，RMSE为44.92亿分之一，偏差为-0.56亿分之一，超过了RF，但是下过AU模型使用空间信息。

Abstract
Accurate quantification of forest aboveground biomass (AGB) is critical for understanding carbon accounting in the context of climate change. In this study, we presented a novel attention-based deep learning approach for forest AGB estimation, primarily utilizing openly accessible EO data, including: GEDI LiDAR data, C-band Sentinel-1 SAR data, ALOS-2 PALSAR-2 data, and Sentinel-2 multispectral data. The attention UNet (AU) model achieved markedly higher accuracy for biomass estimation compared to the conventional RF algorithm. Specifically, the AU model attained an R2 of 0.66, RMSE of 43.66 Mg ha-1, and bias of 0.14 Mg ha-1, while RF resulted in lower scores of R2 0.62, RMSE 45.87 Mg ha-1, and bias 1.09 Mg ha-1. However, the superiority of the deep learning approach was not uniformly observed across all tested models. ResNet101 only achieved an R2 of 0.50, an RMSE of 52.93 Mg ha-1, and a bias of 0.99 Mg ha-1, while the UNet reported an R2 of 0.65, an RMSE of 44.28 Mg ha-1, and a substantial bias of 1.84 Mg ha-1. Moreover, to explore the performance of AU in the absence of spatial information, fully connected (FC) layers were employed to eliminate spatial information from the remote sensing data. AU-FC achieved intermediate R2 of 0.64, RMSE of 44.92 Mgha-1, and bias of -0.56 Mg ha-1, outperforming RF but underperforming AU model using spatial information. We also generated 10m forest AGB maps across Guangdong for the year 2019 using AU and compared it with that produced by RF. The AGB distributions from both models showed strong agreement with similar mean values; the mean forest AGB estimated by AU was 102.18 Mg ha-1 while that of RF was 104.84 Mg ha-1. Additionally, it was observed that the AGB map generated by AU provided superior spatial information. Overall, this research substantiates the feasibility of employing deep learning for biomass estimation based on satellite data.

摘要
“精确量化森林上空生物质量（AGB）是对于气候变化的理解 critical。在本研究中，我们提出了一种新的注意力基于深度学习方法来估算森林AGB，主要使用开放 accessible 的 Earth observation（EO）数据，包括：GEDI LiDAR数据、C-band Sentinel-1 SAR数据、ALOS-2 PALSAR-2数据和Sentinel-2多spectral数据。我们的注意力UNet（AU）模型在比较传统RF算法时表现出了明显的高准确性，具体来说，AU模型在R2、RMSE和偏差方面均有所提高，具体来说，AU模型的R2为0.66，RMSE为43.66 Mg ha-1，偏差为0.14 Mg ha-1，而RF的R2为0.62，RMSE为45.87 Mg ha-1，偏差为1.09 Mg ha-1。然而，深度学习方法的优势不是所有模型中均能实现。ResNet101只有R2为0.50，RMSE为52.93 Mg ha-1，偏差为0.99 Mg ha-1，而UNet的R2为0.65，RMSE为44.28 Mg ha-1，偏差为1.84 Mg ha-1。此外，为了探讨AU的表现在没有空间信息的情况下，我们运用了全连接（FC）层来消除 remote sensing 数据中的空间信息。AU-FC获得了中位R2值0.64，RMSE值44.92 Mg ha-1，偏差值-0.56 Mg ha-1，比RF表现更好，但比AU模型使用空间信息的表现下降。我们还使用AU生成了2019年在广东省的10米森林AGB地图，并与RF生成的地图进行比较。AU的AGB分布和RF的AGB分布都有相似的平均值，AU的AGB估算值为102.18 Mg ha-1，RF的AGB估算值为104.84 Mg ha-1。此外，AU生成的AGB地图提供了更好的空间信息。总之，这项研究证明了深度学习可以实现基于卫星数据的生物质量估算。”

AnyText: Multilingual Visual Text Generation And Editing

paper_url: http://arxiv.org/abs/2311.03054
repo_url: None
paper_authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie
for: 这个论文主要 targets 的问题是如何使用扩散模型生成高质量的文本图像，尤其是在文本区域上。
methods: 该论文提出了一种基于扩散模型的多语言视觉文本生成和编辑模型，称为AnyText，它可以准确地渲染文本在图像中，并且可以在多种语言中生成文本。
results: 该论文通过对多种语言的文本图像进行评测，得出了与其他方法相比的显著性能优势。此外，该论文还提供了一个大规模的多语言文本图像集合（AnyWord-3M）和一个评测标准（AnyText-benchmark），以便进一步推动文本生成技术的发展。

Abstract
Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

摘要
Diffusion模型基于Text-to-Image技术在最近几年内具有很高的成就。虽然目前的图像生成技术非常高级，可以生成高质量的图像，但是当注意力集中在生成图像中的文本区域时，仍然可以发现问题。为解决这个问题，我们介绍了AnyText，一种基于扩散的多语言视觉文本生成和编辑模型，它专注于在图像中准确和一致地生成文本。AnyText包括一个扩散管道，其中有两个主要元素：一个辅助隐藏模块和一个文本嵌入模块。前者使用文本字形、位置和遮盖图像作为输入，生成文本生成或编辑的隐藏特征。后者使用一个OCR模型将字roke数据编码为嵌入，这些嵌入与图像标签的嵌入结合生成文本，以便文本与背景融合。我们在训练时使用文本扩散损失和文本感知损失，以进一步提高文本准确性。AnyText可以在多种语言中写字，据我们所知，这是首次对多语言视觉文本生成进行了研究。此外，AnyText可以与现有的扩散模型集成，以提供更高质量的文本生成和编辑功能。经过广泛的评估实验，我们的方法在所有其他方法之上减分了较大的差距。此外，我们还提供了首个大规模的多语言文本图像集，AnyWord-3M，包含300万个图像文本对，其中每个对包含多种语言的OCR注解。基于AnyWord-3M集，我们提出了AnyText-benchmark，用于评估视觉文本生成准确性和质量。我们的项目将在https://github.com/tyxsspa/AnyText上开源，以促进和提高文本生成技术的发展。

MixUp-MIL: A Study on Linear & Multilinear Interpolation-Based Data Augmentation for Whole Slide Image Classification

paper_url: http://arxiv.org/abs/2311.03052
repo_url: None
paper_authors: Michael Gadermayr, Lukas Koller, Maximilian Tschuchnig, Lea Maria Stangassinger, Christina Kreutzer, Sebastien Couillard-Despres, Gertie Janneke Oostingh, Anton Hittmair
for: 本研究旨在 investigate linear and multilinear interpolation between feature vectors, a data augmentation technique, 以提高分类网络和多例学习的泛化性性能。
methods: 本研究使用了多例学习方法，并对10个不同的数据集配置和两种特征提取方法（批处理和自动提取）进行了研究。
results: 研究发现了Extraordinarily high variability in the effect of the method, 并发现了一些有趣的方向，提出了一些新的研究方向。

Abstract
For classifying digital whole slide images in the absence of pixel level annotation, typically multiple instance learning methods are applied. Due to the generic applicability, such methods are currently of very high interest in the research community, however, the issue of data augmentation in this context is rarely explored. Here we investigate linear and multilinear interpolation between feature vectors, a data augmentation technique, which proved to be capable of improving the generalization performance classification networks and also for multiple instance learning. Experiments, however, have been performed on only two rather small data sets and one specific feature extraction approach so far and a strong dependence on the data set has been identified. Here we conduct a large study incorporating 10 different data set configurations, two different feature extraction approaches (supervised and self-supervised), stain normalization and two multiple instance learning architectures. The results showed an extraordinarily high variability in the effect of the method. We identified several interesting aspects to bring light into the darkness and identified novel promising fields of research.

摘要
For 分类数字整幅图像在缺乏像素级别标注的情况下，通常使用多个实例学习方法。由于这种方法的通用性，目前在研究community中具有很高的兴趣，但数据扩充在这个上下文中的问题rarely explored。我们在这里调查了线性和多线性 interpolate between feature vectors，一种数据扩充技术， Proof to be capable of improving the generalization performance of classification networks and also for multiple instance learning。经过实验，但只在两个较小的数据集和一种特定的特征提取方法上进行了测试，并且数据集的依赖性被识别出来。我们在这里进行了大规模的研究，包括10个不同的数据集配置、两种不同的特征提取方法（supervised和self-supervised）、染料normalization和两种多个实例学习架构。结果显示了极高的变化性，我们identified several interesting aspects to bring light into the darkness and identified novel promising fields of research.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

COLA: COarse-LAbel multi-source LiDAR semantic segmentation for autonomous driving

paper_url: http://arxiv.org/abs/2311.03017
repo_url: None
paper_authors: Jules Sanchez, Jean-Emmanuel Deschaud, François Goulette
for: 这paper是为了提高LiDAR semantic segmentation的自动驾驶而写的。
methods: 这paper使用了多源训练方法，利用了多个数据集在训练时使用。
results: 这paper实现了对域泛化、源到源分类和预训练等三个子领域的系统改进，并在这些领域中达到了最高的性能 (+10%、+5.3% 和 +12% 分别).

Abstract
LiDAR semantic segmentation for autonomous driving has been a growing field of interest in the past few years. Datasets and methods have appeared and expanded very quickly, but methods have not been updated to exploit this new availability of data and continue to rely on the same classical datasets. Different ways of performing LIDAR semantic segmentation training and inference can be divided into several subfields, which include the following: domain generalization, the ability to segment data coming from unseen domains ; source-to-source segmentation, the ability to segment data coming from the training domain; and pre-training, the ability to create re-usable geometric primitives. In this work, we aim to improve results in all of these subfields with the novel approach of multi-source training. Multi-source training relies on the availability of various datasets at training time and uses them together rather than relying on only one dataset. To overcome the common obstacles found for multi-source training, we introduce the coarse labels and call the newly created multi-source dataset COLA. We propose three applications of this new dataset that display systematic improvement over single-source strategies: COLA-DG for domain generalization (up to +10%), COLA-S2S for source-to-source segmentation (up to +5.3%), and COLA-PT for pre-training (up to +12%).

摘要
隐藏的文本：LiDARSemantic分类对于自动驾驶 field of interest在过去几年来快速增长。dataset和方法快速出现和扩展，但方法没有更新以利用这些新的数据，仍然 rely on classical datasets。不同的方法在 LiDARSemantic分类训练和推理中的几个子领域中进行不同的方式，包括：领域泛化、来自训练领域的数据分类和预训练。在这项工作中，我们目的是提高所有这些子领域的结果，使用新的多源训练方法。翻译结果：LiDAR Semantic分类在自动驾驶领域的过去几年来快速增长。不同的dataset和方法快速出现和扩展，但方法没有更新以利用这些新的数据，仍然 rely on classical datasets。在 LiDAR Semantic分类训练和推理中，有几个不同的方法，包括领域泛化、来自训练领域的数据分类和预训练。在这项工作中，我们目的是通过新的多源训练方法提高所有这些子领域的结果。具体来说，我们提出了一种新的多源训练方法，称为COLA。COLA方法利用了不同的dataset在训练时的共同使用，而不是仅仅靠待训练的 dataset。为了解决多源训练中常见的障碍，我们引入了粗略标签，并创建了一个新的多源dataset，称为COLA。我们提出了三个COLAdataset的应用， Display systematic improvement over single-source strategies：COLA-DG for domain generalization (up to +10%), COLA-S2S for source-to-source segmentation (up to +5.3%), and COLA-PT for pre-training (up to +12%).

Exploring the Capability of Text-to-Image Diffusion Models with Structural Edge Guidance for Multi-Spectral Satellite Image Inpainting

paper_url: http://arxiv.org/abs/2311.03008
repo_url: None
paper_authors: Mikolaj Czerkawski, Christos Tachtatzis
for: 这个论文研究了卫星图像数据中的文本到图像填充模型的实用性。
methods: 论文提出了一种基于StableDiffusion和ControlNet的新填充框架，以及一种RGB到多spectral射频（MSI）转换方法。
results: 实验结果表明，通过StableDiffusion进行填充可能会出现不 DESirable的artefacts，而self-supervised internal inpainting的简单实现可以 achieve higher quality of synthesis。

Abstract
The paper investigates the utility of text-to-image inpainting models for satellite image data. Two technical challenges of injecting structural guiding signals into the generative process as well as translating the inpainted RGB pixels to a wider set of MSI bands are addressed by introducing a novel inpainting framework based on StableDiffusion and ControlNet as well as a novel method for RGB-to-MSI translation. The results on a wider set of data suggest that the inpainting synthesized via StableDiffusion suffers from undesired artefacts and that a simple alternative of self-supervised internal inpainting achieves higher quality of synthesis.

摘要
文章研究文本到图像填充模型在卫星图像数据中的可用性。两个技术挑战：在生成过程中插入结构导向信号以及将RGB像素翻译到更广泛的MSI频谱上——通过介绍一种基于StableDiffusion和ControlNet的新填充框架以及一种RGB-to-MSI翻译方法。研究结果表明，通过StableDiffusion进行填充会产生不良artefacts，而自动内部填充的简单方法可以达到更高质量的synthesis。

Zero-Shot Enhancement of Low-Light Image Based on Retinex Decomposition

paper_url: http://arxiv.org/abs/2311.02995
repo_url: None
paper_authors: Wenchao Li, Bangshu Xiong, Qiaofeng Ou, Xiaoyun Long, Jinhao Zhu, Jiabao Chen, Shuyuan Wen
For: Zero-shot low-light image enhancement* Methods: Learning-based Retinex decomposition with N-Net network, noise loss term, RI-Net, texture loss term, and segmented smoothing loss* Results: Improved generalization performance with a homemade real-life low-light dataset and advanced vision tasks such as face detection, target recognition, and instance segmentation, competitive performance compared to current state-of-the-art methods.Here is the Simplified Chinese translation of the three key information points:* For: 低光照图像改善* Methods: 基于学习的Retinex分解方法，包括N-Net网络、噪声损失项、RI-Net、тексту准确损失项和分割缓和损失项* Results: 改进了基于自己实际低光照 dataset 和高级视觉任务（如人脸检测、目标识别和实例分割）的普适性，与当前状态级方法竞争。代码可以在 GitHub 上获取：https://github.com/liwenchao0615/ZERRINNet

Abstract
Two difficulties here make low-light image enhancement a challenging task; firstly, it needs to consider not only luminance restoration but also image contrast, image denoising and color distortion issues simultaneously. Second, the effectiveness of existing low-light enhancement methods depends on paired or unpaired training data with poor generalization performance. To solve these difficult problems, we propose in this paper a new learning-based Retinex decomposition of zero-shot low-light enhancement method, called ZERRINNet. To this end, we first designed the N-Net network, together with the noise loss term, to be used for denoising the original low-light image by estimating the noise of the low-light image. Moreover, RI-Net is used to estimate the reflection component and illumination component, and in order to solve the color distortion and contrast, we use the texture loss term and segmented smoothing loss to constrain the reflection component and illumination component. Finally, our method is a zero-reference enhancement method that is not affected by the training data of paired and unpaired datasets, so our generalization performance is greatly improved, and in the paper, we have effectively validated it with a homemade real-life low-light dataset and additionally with advanced vision tasks, such as face detection, target recognition, and instance segmentation. We conducted comparative experiments on a large number of public datasets and the results show that the performance of our method is competitive compared to the current state-of-the-art methods. The code is available at:https://github.com/liwenchao0615/ZERRINNet

摘要
两个问题使低光照图像增强成为一项困难任务：首先，它需要同时考虑照度恢复、图像对比度、雷达噪声和色偏移问题。其次，现有的低光照增强方法的效果取决于对照或无照训练数据的学习，而且对泛化性表现不佳。为解决这些困难问题，我们在本文提出了一种新的学习基于Retinex分解的零参数低光照增强方法，称为ZERRINNet。为此，我们首先设计了N-Net网络，并与噪声损失项一起使用来降噪原始低光照图像。此外，我们使用RI-Net来估计反射组件和照明组件，以解决颜色扭曲和对比度问题。最后，我们的方法是一种零参考增强方法，不受训练数据的对照或无照数据的影响，因此我们的泛化性得到了大幅提高。在文章中，我们有效地验证了我们的方法，使用自己制作的真实生活低光照数据，以及高级视觉任务，如人脸检测、目标识别和实例分割。我们对大量公共数据进行了比较实验，结果显示了我们的方法与当前状态艺技术的竞争力。代码可以在：https://github.com/liwenchao0615/ZERRINNet 获取。

NEURO HAND: A weakly supervised Hierarchical Attention Network for neuroimaging abnormality Detection

paper_url: http://arxiv.org/abs/2311.02992
repo_url: None
paper_authors: David A. Wood
for: 这个论文是用于检测临床神经成像数据中的异常的。
methods: 这个方法使用了层次注意力网络，适用于非体积数据（即高分辨率MRI扫描序列），并可以从二分类评估级别标签进行训练。
results: 该方法可以提高分类精度，并提供可解释性，可以通过粗略的扫描水平和镜像级别异常Localization，或者给出不同扫描和序列的重要性分数，使其适用于自动化Radiology部门的检测系统。

Abstract
Clinical neuroimaging data is naturally hierarchical. Different magnetic resonance imaging (MRI) sequences within a series, different slices covering the head, and different regions within each slice all confer different information. In this work we present a hierarchical attention network for abnormality detection using MRI scans obtained in a clinical hospital setting. The proposed network is suitable for non-volumetric data (i.e. stacks of high-resolution MRI slices), and can be trained from binary examination-level labels. We show that this hierarchical approach leads to improved classification, while providing interpretability through either coarse inter- and intra-slice abnormality localisation, or giving importance scores for different slices and sequences, making our model suitable for use as an automated triaging system in radiology departments.

摘要
临床神经成像数据自然归于层次结构。不同的磁共振成像（MRI）序列内一系列、不同的脑部slice覆盖头部、和每个slice中的不同区域都提供不同的信息。在这种工作中，我们提出了一种层次注意力网络用于使用MRI扫描图像进行异常检测。该提案的网络适用于非材料数据（即高分辨率MRI扫描图像的栈），可以从二进制检查级别标签进行训练。我们表明，这种层次方法可以提高分类性能，同时提供可读性通过每个slice和每个序列的重要性分数或者粗略的脑部异常定位。因此，我们的模型适用于辐射部门中的自动检测系统。

Diffusion-based Radiotherapy Dose Prediction Guided by Inter-slice Aware Structure Encoding

paper_url: http://arxiv.org/abs/2311.02991
repo_url: None
paper_authors: Zhenghao Feng, Lu Wen, Jianghong Xiao, Yuanyuan Xu, Xi Wu, Jiliu Zhou, Xingchen Peng, Yan Wang
for: 这篇论文的目的是为了提高放射治疗规划中的剂量分布预测，并且解决现有方法的过滤问题。
methods: 这篇论文提出了一个扩散模型基本的方法（DiffDose），它包括一个前进过程和一个反向过程。在前进过程中，DiffDose将剂量分布图transform为纯 Gaussian 噪声，并且同时训练一个噪声预测器来估计附加的噪声。在反向过程中，它逐步除去附加的噪声，最终输出预测的剂量分布图。
results: 这篇论文的结果显示，DiffDose 方法可以很好地解决现有方法的过滤问题，并且可以提高放射治疗规划中的剂量分布预测精度。

Abstract
Deep learning (DL) has successfully automated dose distribution prediction in radiotherapy planning, enhancing both efficiency and quality. However, existing methods suffer from the over-smoothing problem for their commonly used L1 or L2 loss with posterior average calculations. To alleviate this limitation, we propose a diffusion model-based method (DiffDose) for predicting the radiotherapy dose distribution of cancer patients. Specifically, the DiffDose model contains a forward process and a reverse process. In the forward process, DiffDose transforms dose distribution maps into pure Gaussian noise by gradually adding small noise and a noise predictor is simultaneously trained to estimate the noise added at each timestep. In the reverse process, it removes the noise from the pure Gaussian noise in multiple steps with the well-trained noise predictor and finally outputs the predicted dose distribution maps...

摘要
深度学习（DL）已成功地自动预测辐射治疗规划中的剂量分布，提高了效率和质量。然而，现有方法受到L1或L2损失函数中的平均 posterior 计算的限制。为了解决这些限制，我们提出了基于扩散模型的剂量分布预测方法（DiffDose）。具体来说，DiffDose模型包括一个前向过程和一个反向过程。在前向过程中，DiffDose将剂量分布图像转化为纯 Gaussian 噪声，逐步添加小噪声，并同时训练噪声预测器来估计添加的噪声。在反向过程中，它将纯 Gaussian 噪声中的噪声除掉，并在多个步骤中使用已经良好训练的噪声预测器来除掉噪声，最终输出预测的剂量分布图像。

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

paper_url: http://arxiv.org/abs/2311.02960
repo_url: None
paper_authors: Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, Qing Qu
for: 本研究目的是探讨深度学习网络中层次特征学习的机制。
methods: 本研究使用深度线性网络来探讨输入数据的转化。
results: 研究发现，深度线性网络中每层的输出都会具有 géometric 率的内类压缩和 linear 率的 между类分化。这种特征演化的 Pattern 在深度网络中具有可衡量的特征，并且在实际应用中有重要的实质性。

Abstract
Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}.

摘要
Motivated by our findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems.To achieve this, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank. Specifically, each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through.To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks, which aligns well with recent empirical studies.Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at \url{https://github.com/Heimine/PNC_DLN}.

Multi-view learning for automatic classification of multi-wavelength auroral images

paper_url: http://arxiv.org/abs/2311.02947
repo_url: None
paper_authors: Qiuju Yang, Hang Su, Lili Liu, Yixuan Wang, Ze-Jun Hu
for: auroral classification, polar research
methods: lightweight feature extraction backbone (LCTNet), multi-scale reconstructed feature module (MSRM), lightweight attention feature enhancement module (LAFE)
results: state-of-the-art classification accuracy, superior results in terms of accuracy and computational efficiency compared to existing multi-view methods

Abstract
Auroral classification plays a crucial role in polar research. However, current auroral classification studies are predominantly based on images taken at a single wavelength, typically 557.7 nm. Images obtained at other wavelengths have been comparatively overlooked, and the integration of information from multiple wavelengths remains an underexplored area. This limitation results in low classification rates for complex auroral patterns. Furthermore, these studies, whether employing traditional machine learning or deep learning approaches, have not achieved a satisfactory trade-off between accuracy and speed. To address these challenges, this paper proposes a lightweight auroral multi-wavelength fusion classification network, MLCNet, based on a multi-view approach. Firstly, we develop a lightweight feature extraction backbone, called LCTNet, to improve the classification rate and cope with the increasing amount of auroral observation data. Secondly, considering the existence of multi-scale spatial structures in auroras, we design a novel multi-scale reconstructed feature module named MSRM. Finally, to highlight the discriminative information between auroral classes, we propose a lightweight attention feature enhancement module called LAFE. The proposed method is validated using observational data from the Arctic Yellow River Station during 2003-2004. Experimental results demonstrate that the fusion of multi-wavelength information effectively improves the auroral classification performance. In particular, our approach achieves state-of-the-art classification accuracy compared to previous auroral classification studies, and superior results in terms of accuracy and computational efficiency compared to existing multi-view methods.

摘要
极光分类对极地研究起到关键作用，但现有的极光分类研究主要基于单一波长的图像，通常为557.7纳米。其他波长的图像尚未得到足够的关注，而多波长信息的集成仍然是一个未发掘的领域。这种局限性导致复杂的极光图像分类率较低。此外，这些研究，无论使用传统机器学习还是深度学习方法，都没有实现满意的准确率和速度协调。为解决这些挑战，本文提出了一种轻量级的极光多波长融合分类网络（MLCNet），基于多视图方法。首先，我们开发了一种轻量级的特征提取背bone（LCTNet），以提高分类率并处理逐渐增长的极光观测数据量。其次，因为极光存在多尺度空间结构，我们设计了一种新的多尺度重构特征模块（MSRM）。最后，为强调极光类别之间的区别信息，我们提出了一种轻量级的注意力特征增强模块（LAFE）。我们的方法在2003-2004年由北极黄河站的观测数据进行验证，实验结果表明，将多波长信息融合分类效果显著提高了极光分类性能。特别是，我们的方法与过去的极光分类研究相比，实现了状态机器学习的最佳分类率，并在计算效率和多视图方法之间具有优势。

Truly Scale-Equivariant Deep Nets with Fourier Layers

paper_url: http://arxiv.org/abs/2311.02922
repo_url: https://github.com/ashiq24/scale_equivarinat_fourier_layer
paper_authors: Md Ashiqur Rahman, Raymond A. Yeh
for: 这个论文的目的是提出一种具有缩放平衡性的深度学习模型，以便在图像分割等任务中实现更好的效果。
methods: 该论文使用了 Fourier 层来实现缩放平衡性，并考虑了抗锯齿处理。
results: 该模型在 MNIST-scale 和 STL-10 数据集上实现了竞争力的分类性能，同时保持着缩放平衡性。Here’s the full text in Simplified Chinese:
for: 这个论文的目的是提出一种具有缩放平衡性的深度学习模型，以便在图像分割等任务中实现更好的效果。
methods: 该论文使用了 Fourier 层来实现缩放平衡性，并考虑了抗锯齿处理。
results: 该模型在 MNIST-scale 和 STL-10 数据集上实现了竞争力的分类性能，同时保持着缩放平衡性。

Abstract
In computer vision, models must be able to adapt to changes in image resolution to effectively carry out tasks such as image segmentation; This is known as scale-equivariance. Recent works have made progress in developing scale-equivariant convolutional neural networks, e.g., through weight-sharing and kernel resizing. However, these networks are not truly scale-equivariant in practice. Specifically, they do not consider anti-aliasing as they formulate the down-scaling operation in the continuous domain. To address this shortcoming, we directly formulate down-scaling in the discrete domain with consideration of anti-aliasing. We then propose a novel architecture based on Fourier layers to achieve truly scale-equivariant deep nets, i.e., absolute zero equivariance-error. Following prior works, we test this model on MNIST-scale and STL-10 datasets. Our proposed model achieves competitive classification performance while maintaining zero equivariance-error.

摘要
Simplified Chinese:在计算机视觉中，模型需要适应图像分辨率变化以实现图像分割等任务，这被称为缩放相似性。最近的研究已经在发展缩放相似性卷积神经网络，如通过共享权重和核心缩放。然而，这些网络在实践中并不是真正的缩放相似性。具体来说，它们没有考虑抗锯齿处理。为解决这点，我们直接在逻辑域中表述下降操作，考虑抗锯齿处理。我们then propose一种基于傅里叶层的新架构，以实现真正的缩放相似性深度网络，即绝对零相似性错误。 seguir Prior works, we test this model on MNIST-scale and STL-10 datasets. Our proposed model achieves competitive classification performance while maintaining zero equivariance-error.

Benchmarking Deep Facial Expression Recognition: An Extensive Protocol with Balanced Dataset in the Wild

paper_url: http://arxiv.org/abs/2311.02910
repo_url: None
paper_authors: Gianmarco Ipinze Tutuianu, Yang Liu, Ari Alamäki, Janne Kauttonen
for: 这篇论文旨在为人计算机交互中的表情识别（FER）技术提供实用的研究和推荐。
methods: 这篇论文使用了23种常见的网络架构，并按照一种统一的协议进行评估。具体来说，研究人员在不同的输入分辨率、类别均衡管理和预训练策略下进行了多种设置的研究，以描述对应的性能贡献。
results: 经过广泛的实验和过滤，研究人员得出了一些关于深度FER方法在真实应用中的推荐，以及一些有关表情识别应用中的伦理规则、隐私问题和法规的讨论。

Abstract
Facial expression recognition (FER) is a crucial part of human-computer interaction. Existing FER methods achieve high accuracy and generalization based on different open-source deep models and training approaches. However, the performance of these methods is not always good when encountering practical settings, which are seldom explored. In this paper, we collected a new in-the-wild facial expression dataset for cross-domain validation. Twenty-three commonly used network architectures were implemented and evaluated following a uniform protocol. Moreover, various setups, in terms of input resolutions, class balance management, and pre-trained strategies, were verified to show the corresponding performance contribution. Based on extensive experiments on three large-scale FER datasets and our practical cross-validation, we ranked network architectures and summarized a set of recommendations on deploying deep FER methods in real scenarios. In addition, potential ethical rules, privacy issues, and regulations were discussed in practical FER applications such as marketing, education, and entertainment business.

摘要
面部表达识别（FER）是人机交互的关键部分。现有的FER方法在不同的开源深度学习模型和训练方法上达到了高准确率和泛化。然而，这些方法在实际场景中的性能不总是好的，这些场景通常被忽略。在这篇论文中，我们收集了一个新的在野 facial expression 数据集，用于跨领域验证。我们实现了23种常用的网络架构，并按照一个固定的协议进行评估。此外，我们还对输入分辨率、类别平衡管理和预训练策略进行了不同的设置，以显示它们对性能的贡献。基于大量的实验和我们的实际核心验证，我们对深度FER方法的部署在实际场景中进行了排名和总结，并提出了一些应用中的建议。此外，我们还讨论了实际应用中的伦理规则、隐私问题和法规。

Human as Points: Explicit Point-based 3D Human Reconstruction from Single-view RGB Images

paper_url: http://arxiv.org/abs/2311.02892
repo_url: https://github.com/yztang4/hap
paper_authors: Yingzhi Tang, Qijian Zhang, Junhui Hou, Yebin Liu
for: This paper aims to improve the performance of single-view human reconstruction by proposing an explicit point-based framework called HaP, which leverages point clouds as the intermediate representation of the target geometric structure.
methods: The proposed HaP framework uses fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, rather than implicit learning processes that can be ambiguous and less controllable. The framework also includes dedicated designs of specialized learning components and processing procedures.
results: The authors report quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results, demonstrating the effectiveness of the proposed framework. The results suggest a paradigm rollback to fully-explicit and geometry-centric algorithm design, which enables the use of various powerful point cloud modeling architectures and processing techniques.

Abstract
The latest trends in the research field of single-view human reconstruction devote to learning deep implicit functions constrained by explicit body shape priors. Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still show different aspects of limitations in terms of flexibility, generalizability, robustness, and/or representation capability. To comprehensively address the above issues, in this paper, we investigate an explicit point-based human reconstruction framework called HaP, which adopts point clouds as the intermediate representation of the target geometric structure. Technically, our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. The overall workflow is carefully organized with dedicated designs of the corresponding specialized learning components as well as processing procedures. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design, which enables to exploit various powerful point cloud modeling architectures and processing techniques. We will make our code and data publicly available at https://github.com/yztang4/HaP.

摘要
最新的研究趋势在人体单视重建领域是学习深度隐函数，受到明确的体型先验规则约束。Despite the remarkable performance improvements compared with traditional processing pipelines, existing learning approaches still have different limitations, such as flexibility, generalizability, robustness, and representation capability. To comprehensively address these issues, in this paper, we investigate an explicit point-based human reconstruction framework called HaP, which adopts point clouds as the intermediate representation of the target geometric structure. Technically, our approach is featured by fully-explicit point cloud estimation, manipulation, generation, and refinement in the 3D geometric space, instead of an implicit learning process that can be ambiguous and less controllable. The overall workflow is carefully organized with dedicated designs of the corresponding specialized learning components as well as processing procedures. Extensive experiments demonstrate that our framework achieves quantitative performance improvements of 20% to 40% over current state-of-the-art methods, and better qualitative results. Our promising results may indicate a paradigm rollback to the fully-explicit and geometry-centric algorithm design, which enables to exploit various powerful point cloud modeling architectures and processing techniques. We will make our code and data publicly available at https://github.com/yztang4/HaP.

Stacked Autoencoder Based Feature Extraction and Superpixel Generation for Multifrequency PolSAR Image Classification

paper_url: http://arxiv.org/abs/2311.02887
repo_url: None
paper_authors: Tushar Gadhiya, Sumanth Tangirala, Anil K. Roy
for: 本研究提出了一种多频度 polarimetric synthetic aperture radar（PolSAR）图像分类算法。
methods: 使用PolSAR分解算法提取了每个频率带的33特征，然后使用两层自适应神经网络减少输入特征向量，保留有用的输入特征。接着，使用SLIC算法生成了超像素，并使用这些超像素生成了一个强健的特征表示。最后，使用softmax分类器进行分类任务。
results: 在Flevoland数据集上进行了实验，并发现提议方法在文献中available的方法之上。

Abstract
In this paper we are proposing classification algorithm for multifrequency Polarimetric Synthetic Aperture Radar (PolSAR) image. Using PolSAR decomposition algorithms 33 features are extracted from each frequency band of the given image. Then, a two-layer autoencoder is used to reduce the dimensionality of input feature vector while retaining useful features of the input. This reduced dimensional feature vector is then applied to generate superpixels using simple linear iterative clustering (SLIC) algorithm. Next, a robust feature representation is constructed using both pixel as well as superpixel information. Finally, softmax classifier is used to perform classification task. The advantage of using superpixels is that it preserves spatial information between neighbouring PolSAR pixels and therefore minimises the effect of speckle noise during classification. Experiments have been conducted on Flevoland dataset and the proposed method was found to be superior to other methods available in the literature.

摘要
在这篇论文中，我们提出了一种多频波动Synthetic Aperture Radar（PolSAR）图像的分类算法。使用PolSAR分解算法提取了每个频率带的图像中的33个特征。然后，使用两层自适应神经网络减少输入特征向量的维度，保留输入特征的有用信息。这个减少的特征向量然后用simple linear iterative clustering（SLIC）算法生成超像。接下来，使用像素和超像信息构建了一种强健的特征表示。最后，使用softmax分类器进行分类任务。使用超像有利于保留邻近PolSAR像素之间的空间信息，因此减少了零点噪声的影响，进一步提高了分类的精度。我们在Flevoland数据集上进行了实验，并发现提出的方法在相关文献中比其他方法更为出色。

Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box

paper_url: http://arxiv.org/abs/2311.02877
repo_url: None
paper_authors: Hao Zhang, Cong Xu, Shuaijie Zhang
for: This paper aims to improve the bounding box regression process in object detection by proposing a new loss function called Inner-IoU loss.methods: The paper analyzes the BBR model and proposes using different scales of auxiliary bounding boxes to calculate losses, as well as introducing a scaling factor ratio to control the scale size of the auxiliary bounding boxes.results: The proposed Inner-IoU loss function enhances the detection performance of object detection models, demonstrating its effectiveness and generalization ability.

Abstract
With the rapid development of detectors, Bounding Box Regression (BBR) loss function has constantly updated and optimized. However, the existing IoU-based BBR still focus on accelerating convergence by adding new loss terms, ignoring the limitations of IoU loss term itself. Although theoretically IoU loss can effectively describe the state of bounding box regression,in practical applications, it cannot adjust itself according to different detectors and detection tasks, and does not have strong generalization. Based on the above, we first analyzed the BBR model and concluded that distinguishing different regression samples and using different scales of auxiliary bounding boxes to calculate losses can effectively accelerate the bounding box regression process. For high IoU samples, using smaller auxiliary bounding boxes to calculate losses can accelerate convergence, while larger auxiliary bounding boxes are suitable for low IoU samples. Then, we propose Inner-IoU loss, which calculates IoU loss through auxiliary bounding boxes. For different datasets and detectors, we introduce a scaling factor ratio to control the scale size of the auxiliary bounding boxes for calculating losses. Finally, integrate Inner-IoU into the existing IoU-based loss functions for simulation and comparative experiments. The experiment result demonstrate a further enhancement in detection performance with the utilization of the method proposed in this paper, verifying the effectiveness and generalization ability of Inner-IoU loss.

摘要
随着检测器的快速发展，矩形框回归（BBR）损失函数不断更新和优化。然而，现有的IoU基于的BBR仍然围绕加入新的损失项来加速快损集中精度。虽然理论上IoU损失函数可以有效描述矩形框回归的状态，但在实际应用中，它无法根据不同的检测器和检测任务自适应调整，并且不具备强大的泛化能力。基于以上分析，我们首先分析了BBR模型，并结论出使用不同的检测器和检测任务的不同描述性框架可以更好地加速矩形框回归过程。为高IoU样本，使用较小的卫星框架来计算损失可以加速收敛，而对低IoU样本，使用较大的卫星框架更适合。然后，我们提出了Inner-IoU损失函数，通过卫星框架来计算IoU损失。为不同的数据集和检测器，我们引入了涉及因子比例来控制卫星框架的大小。最后，我们将Inner-IoU集成到现有的IoU基于损失函数中进行 simulations和比较实验。实验结果表明，通过使用本文提出的方法可以进一步提高检测性能，证明了Inner-IoU损失函数的有效性和泛化能力。

Dynamic Neural Fields for Learning Atlases of 4D Fetal MRI Time-series

paper_url: http://arxiv.org/abs/2311.02874
repo_url: https://github.com/kidrauh/neural-atlasing
paper_authors: Zeen Chi, Zhongxiao Cong, Clinton J. Wang, Yingcheng Liu, Esra Abaci Turk, P. Ellen Grant, S. Mazdak Abulnaga, Polina Golland, Neel Dey
for: 用于快速构建生物医学影像 атла斯，使用神经场。
methods: 使用神经场来学习可变的空间时间观察，实现主动态脉络MRI时序序列的个体化 атла斯建立和动态稳定。
results: 对妊娠期动态BOLD MRI时序列实现高质量的个体化 атла斯建立，相比现有方法快速 convergence，但略微下降一些骨骼匠 overlap。

Abstract
We present a method for fast biomedical image atlas construction using neural fields. Atlases are key to biomedical image analysis tasks, yet conventional and deep network estimation methods remain time-intensive. In this preliminary work, we frame subject-specific atlas building as learning a neural field of deformable spatiotemporal observations. We apply our method to learning subject-specific atlases and motion stabilization of dynamic BOLD MRI time-series of fetuses in utero. Our method yields high-quality atlases of fetal BOLD time-series with $\sim$5-7$\times$ faster convergence compared to existing work. While our method slightly underperforms well-tuned baselines in terms of anatomical overlap, it estimates templates significantly faster, thus enabling rapid processing and stabilization of large databases of 4D dynamic MRI acquisitions. Code is available at https://github.com/Kidrauh/neural-atlasing

摘要
我们提出了一种快速生成生物医学影像 атла斯使用神经场的方法。 Atlases 是生物医学影像分析任务的关键，但是传统的深度网络估算方法和深度网络估算方法仍然具有较长的计算时间。在这项初步工作中，我们将主动扮演为学习一种可变的空间时间观察神经场。我们应用我们的方法来学习主动者特定的 атла斯和动态 BOLD MRI 时序列的运动稳定。我们的方法可以高效地生成高质量的胎儿 BOLD 时序列的 атла斯，与现有工作相比，它的整合时间约为 5-7 倍快。虽然我们的方法对于 анатомиче匹配略有下降，但它可以更快地估算模板，从而实现大规模的4D动态 MRI 数据库的快速处理和稳定。可以在上下载代码。

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

paper_url: http://arxiv.org/abs/2311.02873
repo_url: https://github.com/shiyoung77/ovir-3d
paper_authors: Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boularias, Kostas Bekris
for: 开发了一种基于文本查询的开 vocabulary 3D объек实体检索方法，不需要使用任何3D数据进行训练。
methods: 方法使用了文本对齐的2D区域提档网络，将2D区域提档Network的特征相似性与文本查询进行对比，并通过多视图拟合来将2D区域提档转化为3D空间中的3D объек实体段落。
results: 实验结果表明，该方法可以快速和高效地在大多数indoor 3D场景下进行实时多视图拟合，并且不需要额外训练在3D空间。实验结果还表明，该方法在机器人导航和抓取等应用中具有广泛的应用前景。

Abstract
This work presents OVIR-3D, a straightforward yet effective method for open-vocabulary 3D object instance retrieval without using any 3D data for training. Given a language query, the proposed method is able to return a ranked set of 3D object instance segments based on the feature similarity of the instance and the text query. This is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space, where the 2D region proposal network could leverage 2D datasets, which are more accessible and typically larger than 3D datasets. The proposed fusion process is efficient as it can be performed in real-time for most indoor 3D scenes and does not require additional training in 3D space. Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.

摘要
这个工作介绍了 OVIR-3D，一种简单又有效的方法，用于无需使用任何3D数据进行训练的开 vocabulary 3D对象实例检索。给定一个语言查询，提议的方法能够返回一个根据实例和文本查询之间的特征相似性排序的3D对象实例分割。这是通过多视图融合文本对齐的2D区域提档到3D空间中进行的，其中2D区域提档网络可以利用2D数据集，这些数据集通常更容易获得和更大规模。我们的融合过程是实时可行的，不需要额外训练在3D空间。我们在公共数据集和真实的 робоット上进行了实验，并证明了该方法的有效性和其在机器人导航和操作中的潜在应用。

FocusTune: Tuning Visual Localization through Focus-Guided Sampling

paper_url: http://arxiv.org/abs/2311.02872
repo_url: https://github.com/sontung/focus-tune
paper_authors: Son Tung Nguyen, Alejandro Fontan, Michael Milford, Tobias Fischer
for: 提高视觉地标算法性能
methods: 使用强调导航的采样技术，指导Scene coordinate regression模型在重要的3D点三角形计算中做出更好的预测
results: 与现有的状态艺术模型匹配或超越其性能，同时保持ACE模型的低存储和计算需求，例如在 Cambridge Landmarks 数据集上减少了译偏误值从25到19和17到15 cm，提高了应用在移动 робо扮和增强现实等领域的可行性。I hope that helps!

Abstract
We propose FocusTune, a focus-guided sampling technique to improve the performance of visual localization algorithms. FocusTune directs a scene coordinate regression model towards regions critical for 3D point triangulation by exploiting key geometric constraints. Specifically, rather than uniformly sampling points across the image for training the scene coordinate regression model, we instead re-project 3D scene coordinates onto the 2D image plane and sample within a local neighborhood of the re-projected points. While our proposed sampling strategy is generally applicable, we showcase FocusTune by integrating it with the recently introduced Accelerated Coordinate Encoding (ACE) model. Our results demonstrate that FocusTune both improves or matches state-of-the-art performance whilst keeping ACE's appealing low storage and compute requirements, for example reducing translation error from 25 to 19 and 17 to 15 cm for single and ensemble models, respectively, on the Cambridge Landmarks dataset. This combination of high performance and low compute and storage requirements is particularly promising for applications in areas like mobile robotics and augmented reality. We made our code available at \url{https://github.com/sontung/focus-tune}.

摘要
我们提出了FocusTune，一种帮助视觉地标定算法性能提高的注意力导向抽象技术。FocusTune利用场景坐标重构模型中的关键几何约束，将注意力集中在点云三角形插值中的关键区域。具体来说，我们不是将图像中的所有点用于场景坐标重构模型的训练，而是将3D场景坐标重新 проек到图像平面上，然后在该地方采样。我们的提议的采样策略可以普遍应用，但我们在ACE模型中实现了它。我们的结果表明，FocusTune可以提高或与现有状态的性能匹配，同时保持ACE模型的低存储和计算需求。例如，在 cambridge 景点集中，FocusTune可以将翻译错误从25减少到19和17cm，对单个和集成模型进行比较。这种高性能且低计算存储需求的组合特别适用于移动 робо扮和增强现实应用。我们的代码可以在https://github.com/sontung/focus-tune上下载。

Neural-based Compression Scheme for Solar Image Data

paper_url: http://arxiv.org/abs/2311.02855
repo_url: None
paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Jeremy A. Grajeda, Piyush M. Mehta, Nasser M. Nasrabadi, Laura E. Boucheron, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
For: The paper is written for the purpose of proposing a neural network-based lossy compression method for data-intensive imagery missions, specifically for NASA’s SDO mission.* Methods: The proposed method uses an adversarially trained neural network with local and non-local attention modules to capture the local and global structure of the image, resulting in a better trade-off in rate-distortion (RD) compared to conventional hand-engineered codecs. The RD variational autoencoder is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective.* Results: The proposed algorithm outperforms currently-in-use and state-of-the-art codecs such as JPEG and JPEG-2000 in terms of RD performance when compressing extreme-ultraviolet (EUV) data. The algorithm is able to achieve consistent segmentations of coronal holes (CH) in the compressed images, even at a compression rate of $\sim0.1$ bits per pixel.

Abstract
Studying the solar system and especially the Sun relies on the data gathered daily from space missions. These missions are data-intensive and compressing this data to make them efficiently transferable to the ground station is a twofold decision to make. Stronger compression methods, by distorting the data, can increase data throughput at the cost of accuracy which could affect scientific analysis of the data. On the other hand, preserving subtle details in the compressed data requires a high amount of data to be transferred, reducing the desired gains from compression. In this work, we propose a neural network-based lossy compression method to be used in NASA's data-intensive imagery missions. We chose NASA's SDO mission which transmits 1.4 terabytes of data each day as a proof of concept for the proposed algorithm. In this work, we propose an adversarially trained neural network, equipped with local and non-local attention modules to capture both the local and global structure of the image resulting in a better trade-off in rate-distortion (RD) compared to conventional hand-engineered codecs. The RD variational autoencoder used in this work is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective. Our neural image compression algorithm outperforms currently-in-use and state-of-the-art codecs such as JPEG and JPEG-2000 in terms of the RD performance when compressing extreme-ultraviolet (EUV) data. As a proof of concept for use of this algorithm in SDO data analysis, we have performed coronal hole (CH) detection using our compressed images, and generated consistent segmentations, even at a compression rate of $\sim0.1$ bits per pixel (compared to 8 bits per pixel on the original data) using EUV data from SDO.

摘要
studying the solar system and especially the sun relies on data gathered daily from space missions. these missions are data-intensive, and compressing this data to make it efficiently transferable to the ground station is a twofold decision. stronger compression methods can increase data throughput at the cost of accuracy, which could affect scientific analysis of the data. on the other hand, preserving subtle details in the compressed data requires a high amount of data to be transferred, reducing the desired gains from compression. in this work, we propose a neural network-based lossy compression method for use in nasa's data-intensive imagery missions. we chose nasa's sdo mission, which transmits 1.4 terabytes of data each day, as a proof of concept for the proposed algorithm.our proposed algorithm uses an adversarially trained neural network equipped with local and non-local attention modules to capture both the local and global structure of the image, resulting in a better trade-off in rate-distortion (rd) compared to conventional hand-engineered codecs. the rd variational autoencoder used in this work is jointly trained with a channel-dependent entropy model as a shared prior between the analysis and synthesis transforms to make the entropy coding of the latent code more effective. our neural image compression algorithm outperforms currently-in-use and state-of-the-art codecs such as jpeg and jpeg-2000 in terms of rd performance when compressing extreme-ultraviolet (euv) data.as a proof of concept for the use of this algorithm in sdo data analysis, we have performed coronal hole (ch) detection using our compressed images and generated consistent segmentations, even at a compression rate of approximately 0.1 bits per pixel (compared to 8 bits per pixel on the original data) using euv data from sdo.

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

paper_url: http://arxiv.org/abs/2311.02848
repo_url: https://github.com/yanqinJiang/Consistent4D
paper_authors: Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, Yao Yao
for: 本研究は、无测量モノクロビデオから4D动的オブジェクトを生成する新しいアプローチを提案します。
methods: 我们は、360度动的オブジェクト再现问题を4D生成问题として捉え、多视点データ収集とカメラ测定を不要にします。これは、物体レベルの3D意识 Image Diffusion Modelを主たるスーパーバイジョン信号として使用して、Dynamic Neural Radiance Fields（DyNeRF）をトレーニングします。特に、Cascade DyNeRFを提案して、时间轴上のディスクレットなスーパーバイジョン信号の下で安定した受动と时间継続性を実现します。さらに、空间的な一贯性と时间的な一贯性を実现するために、Interpolation-driven Consistency Lossを导入します。
results: 我们のConsistent4Dは、先行研究と竞合する性能を示し、新しい可能性を开拓します。また、普通の文字-3D生成タスクにも优れた性能を示しています。プロジェクトページはhttps://consistent4d.github.io/です。

Abstract
In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is https://consistent4d.github.io/.

摘要
在这篇论文中，我们提出了一种新的方法，即Consistent4D，用于从单视图视频中生成4D动态对象。我们Uniquely将360度动态对象重建问题作为4D生成问题来处理，这意味着不需要繁琐的多视图数据收集和摄像头卡利布ر。我们通过利用物体层3D意识图像扩散模型作为主要超视图信号来培训动态神经辐射场（DyNeRF）。我们提出了一种升级的DyNeRF来实现稳定的整合和时间连续性，并引入了一种 interpolate-driven 一致损失来保证空间和时间一致性。我们通过对DyNeRF和预训练视频 interpolate模型生成的帧进行对比来优化这种损失函数。我们的项目页面是https://consistent4d.github.io/.Here's the translation in Traditional Chinese:在这篇论文中，我们提出了一种新的方法，即Consistent4D，用于从单视角影像中生成4D动态物件。我们Uniquely将360度动态物件重建问题作为4D生成问题来处理，这意味着不需要繁琐的多视角数据收集和摄像头卡利布。我们通过利用物体层3D意识图像扩散模型作为主要超视射信号来培训动态神经辐射场（DyNeRF）。我们提出了一种升级的DyNeRF来实现稳定的整合和时间连续性，并引入了一种 interpolate-driven 一致损失来保证空间和时间一致性。我们通过对DyNeRF和预训练影像 interpolate模型生成的帧进行比较来优化这种损失函数。我们的项目页面是https://consistent4d.github.io/.

Leveraging sinusoidal representation networks to predict fMRI signals from EEG

paper_url: http://arxiv.org/abs/2311.04234
repo_url: None
paper_authors: Yamin Li, Ange Lou, Catie Chang
for: 这个论文的目的是用多通道EEG来预测fMRI信号，以提高EEG的空间分辨率和扩展fMRI的应用范围。methods: 这个论文使用了一种新的SIREN网络，通过学习电阻图谱信息来减少feature engineering，并使用encoder-decoder结构来重建fMRI信号。results: 试验结果表明，这个模型在8名参与者 simultaneous EEG-fMRI数据集上表现出色，并且超越了一个现有的状态略模型。这些结果表明了使用 periodic activation functions 在深度神经网络中模型功能神经成像数据的潜在优势。

Abstract
In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.

摘要
现代神经科学中，功能核磁共振成像（fMRI）已成为非侵入式窗口，提供整个大脑活动的动态图像。然而，fMRI受到血液干扰和高成本、固定性和金属设备不兼容等限制。电enzephalography（EEG）可以直接记录 cortical 电动力谱高时间分辨率，但是它的空间分辨率较有限，无法回归深部脑结构信息。能够从 EEG 获取 fMRI 信息，将可以实现成本下降，扫描更广泛的脑区域。此外，跨模态模型可以促进 fMRI 信号的解释。然而，由于 EEG 和 fMRI 都是高维度和易受损的，目前是困难的模型 fMRI 从 EEG。为解决这个挑战，我们提出了一种新的建议，可以直接从多通道 EEG 预测 fMRI 信号，不需要显式的特征工程。我们的模型实现了声律表示网络（SIREN）来学习 brain 动力学中的频率信息，该信息作为 EEG 输入，并由后续的编码器-解码器组合来有效地重建 fMRI 信号。我们使用了 8 名参与者的同时 EEG-fMRI 数据集进行评估，并研究了其在预测深部 fMRI 信号方面的潜力。结果表明，我们的模型超过了当前状态的最佳模型，并表明了在深度神经网络中使用律动函数可以有效地模型功能神经成像数据。

Flexible Multi-Generator Model with Fused Spatiotemporal Graph for Trajectory Prediction

paper_url: http://arxiv.org/abs/2311.02835
repo_url: None
paper_authors: Peiyuan Zhu, Fengxia Han, Hao Deng
for: 预测行程在自动驾驶系统中扮演着关键作用，帮助汽车实现精准跟踪和决策。
methods: 我们提出了一种行程预测框架，可以捕捉人群之间的社交交互变化和分支 manifold的模型。我们的框架基于综合时空图来更好地模型场景中人群的复杂交互，并使用多生成器架构，其中包括一个灵活的生成器选择器网络来学习多个生成器的分布。
results: 我们的框架在不同的挑战性数据集上比较多的基准方法达到了状态革新的表现。

Abstract
Trajectory prediction plays a vital role in automotive radar systems, facilitating precise tracking and decision-making in autonomous driving. Generative adversarial networks with the ability to learn a distribution over future trajectories tend to predict out-of-distribution samples, which typically occurs when the distribution of forthcoming paths comprises a blend of various manifolds that may be disconnected. To address this issue, we propose a trajectory prediction framework, which can capture the social interaction variations and model disconnected manifolds of pedestrian trajectories. Our framework is based on a fused spatiotemporal graph to better model the complex interactions of pedestrians in a scene, and a multi-generator architecture that incorporates a flexible generator selector network on generated trajectories to learn a distribution over multiple generators. We show that our framework achieves state-of-the-art performance compared with several baselines on different challenging datasets.

摘要
文本翻译为简化中文。自动驾驶需要trajectory prediction来确定汽车的路径，以便准确地跟踪和做出决策。生成对抗网络可以学习未来路径的分布，但是它们通常会预测外部样本，这通常发生在未来路径的分布中包含多种不同的拟合 manifold，这些拟合 manifold可能是分立的。为了解决这个问题，我们提出了一个trajectory prediction框架，可以捕捉人行行为的社会交互变化，以及人行道径的分立拟合 manifold。我们的框架基于一个综合的空间时间图，更好地模型了场景中人行行为的复杂交互，以及一个多个生成器架构，其中包括一个灵活的生成器选择网络，可以学习多个生成器的分布。我们展示了我们的框架在不同的难度 datasets 上达到了现状最佳性能。

SemanticTopoLoop: Semantic Loop Closure With 3D Topological Graph Based on Quadric-Level Object Map

paper_url: http://arxiv.org/abs/2311.02831
repo_url: None
paper_authors: Zhenzhong Cao
for: 提高 réal-world scenario 中 SLAM 系统的精度和Robustness
methods: 基于多层验证的对象级数据协调方法和基于 quadric-level 对象地图 тополоジ的semantic loop closure方法
results: 在宽视场下实现高精度的loop closure，并且在对比 existed state-of-the-art 方法时显示出更高的精度、再现率和地图定位精度 metricIn English, this means:
for: Improving the accuracy and robustness of SLAM systems in real-world scenarios
methods: An object-level data association method based on multi-level verification and a semantic loop closure method based on a quadric-level object map topology
results: Achieving high-precision loop closure over a wide field of view, and outperforming existing state-of-the-art methods in terms of precision, recall, and localization accuracy metrics.

Abstract
Loop closure, as one of the crucial components in SLAM, plays an essential role in correcting the accumulated errors. Traditional appearance-based methods, such as bag-of-words models, are often limited by local 2D features and the volume of training data, making them less versatile and robust in real-world scenarios, leading to missed detections or false positives detections in loop closure. To address these issues, we first propose a object-level data association method based on multi-level verification, which can associate 2D semantic features of current frame with 3D objects landmarks of map. Next, taking advantage of these association relations, we introduce a semantic loop closure method based on quadric-level object map topology, which represents scenes through the topological graph of objects and achieves accurate loop closure at a wide field of view by comparing differences in the topological graphs. Finally, we integrate these two methods into a complete object-aware SLAM system. Qualitative experiments and ablation studies demonstrate the effectiveness and robustness of the proposed object-level data association algorithm. Quantitative experiments show that our semantic loop closure method outperforms existing state-of-the-art methods in terms of precision, recall and localization accuracy metrics.

摘要
<>Translate given text into Simplified Chinese.<>路径关闭，作为SLAM中一个关键组件，对于消除积累错误起到了关键作用。传统的外观基于方法，如袋子模型，通常受到当地2D特征的限制，以及训练数据的量，使其在实际场景中 menos versatile 和robust，导致过滤或假阳性检测在路径关闭中出现。为解决这些问题，我们首先提出了基于多级验证的对象水平数据协调方法，可以将当前帧的2Dsemantic特征与地图中的3D对象标记相关联。接着，通过这些关系，我们引入了基于四元数平面的对象地图 тоポ多特征，可以在宽视野中高精度地实现路径关闭。最后，我们将这两种方法 integrate into a complete object-aware SLAM system。Qualitative experiments and ablation studies demonstrate the effectiveness and robustness of the proposed object-level data association algorithm。Quantitative experiments show that our semantic loop closure method outperforms existing state-of-the-art methods in terms of precision, recall, and localization accuracy metrics.

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

paper_url: http://arxiv.org/abs/2311.02826
repo_url: https://github.com/mybabyyh/instructpix2nerf
paper_authors: Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, Jun Zhu
for: This paper aims to solve the problem of human-instructed 3D-aware portrait editing for open-world images, which has been under-explored due to the lack of labeled human face 3D datasets and effective architectures.
methods: The proposed method, InstructPix2NeRF, is an end-to-end diffusion-based framework that enables instructed 3D-aware portrait editing from a single open-world image with human instructions. It uses a conditional latent 3D diffusion process to lift 2D editing to 3D space and learn the correlation between the paired images’ difference and the instructions via triplet data.
results: The proposed method achieves effective and multi-semantic editing through one single pass with the portrait identity well-preserved. Additionally, an identity consistency module is proposed to increase the multi-view 3D identity consistency. Extensive experiments show the effectiveness of the method and its superiority against strong baselines quantitatively and qualitatively.Here’s the Chinese version of the three key points:
for: 这篇论文目标是解决人类指令下的开放世界图像上的3D-aware肖像编辑问题，这个问题由于缺乏人脸3D标注数据和有效架构而未得到充分研究。
methods: 提议的方法是一种基于扩散的终端推广框架，称为InstructPix2NeRF，它可以从单个开放世界图像上接受人类指令，并实现3D-aware肖像编辑。它使用一种条件隐藏3D扩散过程来提升2D编辑到3D空间，并通过 triplet 数据学习对差异和指令之间的相关性。
results: 提议的方法可以实现高效和多Semantic的编辑，同时保持肖像的identify完好。此外，还提出了一种人脸identidadityModule，它直接将提取的identify信号 Modulates到扩散过程中，从而提高多视图3D人脸identify一致性。广泛的实验证明了方法的效果和对强基eline的超越。

Abstract
With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively.

摘要
“受到神经辐射场（NeRF）的成功，3D意识编辑技术已经取得了显著的进步，但这些方法仍然高度依赖于每个提示的优化。因为缺乏人类脸部3D数据集和有效的建筑，人类指令下的开放世界肖像3D意识编辑仍然处于未explored阶段。为解决这个问题，我们提出了一种终端扩散基于的框架，称之为InstructPix2NeRF，它可以在单个开放世界图像上实现人类指令下的3D意识编辑。核心 liegt在一种 conditional latent 3D 扩散过程中，通过学习对带有对应图像差异和指令的对应关系来提升2D编辑到3D空间。通过我们提出的token位置随机Strategy，我们可以在一次通过中实现多Semantic editing，并且保持肖像的身份完整。此外，我们还提出了一种人类身份归一模块，它直接将提取的身份信号注入到我们的扩散过程中，从而提高多视图3D身份一致性。我们的实验证明了我们的方法的有效性，并与强基eline比较示出了我们的方法的超越性。”

Efficient, Self-Supervised Human Pose Estimation with Inductive Prior Tuning

paper_url: http://arxiv.org/abs/2311.02815
repo_url: https://github.com/princetonvisualai/hpe-inductive-prior-tuning
paper_authors: Nobline Yoo, Olga Russakovsky
for: 本研究旨在提高无监督人体姿势估算（HPE）的自动化性。
methods: 本研究使用了自我重构的方法，利用大量未标注的视觉数据，尽管当前精度不高。
results: 研究人员通过分析重建质量和姿势估算准确性之间的关系，开发了一个新的模型管道，使用了比基eline要少的训练数据，并提出了一个适合无监督设置的新评价指标。

Abstract
The goal of 2D human pose estimation (HPE) is to localize anatomical landmarks, given an image of a person in a pose. SOTA techniques make use of thousands of labeled figures (finetuning transformers or training deep CNNs), acquired using labor-intensive crowdsourcing. On the other hand, self-supervised methods re-frame the HPE task as a reconstruction problem, enabling them to leverage the vast amount of unlabeled visual data, though at the present cost of accuracy. In this work, we explore ways to improve self-supervised HPE. We (1) analyze the relationship between reconstruction quality and pose estimation accuracy, (2) develop a model pipeline that outperforms the baseline which inspired our work, using less than one-third the amount of training data, and (3) offer a new metric suitable for self-supervised settings that measures the consistency of predicted body part length proportions. We show that a combination of well-engineered reconstruction losses and inductive priors can help coordinate pose learning alongside reconstruction in a self-supervised paradigm.

摘要
目标是二维人姿估计（HPE）是将人体部位的坐标进行地图化，给定一张人体姿势的图像。现状技术使用了千张标注图像（finetuning transformers或训练深度CNN），通过劳动密集的人工审核来获得。然而，无监督方法将HPE任务视为一个重建问题，可以利用大量的未标注视觉数据，尽管目前精度有所偏低。在这项工作中，我们探讨了如何提高无监督HPE。我们（1）分析了重建质量和姿势估计精度之间的关系，（2）开发了一个比基eline更高效的模型管线，使用较少的训练数据，并（3）提出了适合无监督设置的一个新的度量，用于测量预测的身体部分长度准确性。我们表明，将Well-engineered重建损失和拟合约束结合在一起可以协调姿势学习和重建在无监督 парадигме中。

Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers

paper_url: http://arxiv.org/abs/2311.02803
repo_url: None
paper_authors: Hai Phan, Cindy Le, Vu Le, Yihui He, Anh Totti Nguyen
for: 本研究旨在提高face identification的精度和效率，提出了一种基于Vision Transformers（ViTs）的新方法。
methods: 本研究使用了两个图像的比较，首先在图像级别进行比较，然后在小块级别进行比较。在小块级别比较中，使用了交叉注意力来比较两个图像的patch。
results: 经过训练200万对图像，本研究的模型在对外部数据进行比较时达到了与DeepFace-EMD相同的准确率，但在执行速度方面比DeepFace-EMD更快，并且通过人类研究表明了模型的解释性。

Abstract
Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this technique can be subject to occlusion (e.g. faces with masks or sunglasses) and out-of-distribution data. DeepFace-EMD (Phan et al. 2022) reaches state-of-the-art accuracy on out-of-distribution data by first comparing two images at the image level, and then at the patch level. Yet, its later patch-wise re-ranking stage admits a large $O(n^3 \log n)$ time complexity (for $n$ patches in an image) due to the optimal transport optimization. In this paper, we propose a novel, 2-image Vision Transformers (ViTs) that compares two images at the patch level using cross-attention. After training on 2M pairs of images on CASIA Webface (Yi et al. 2014), our model performs at a comparable accuracy as DeepFace-EMD on out-of-distribution data, yet at an inference speed more than twice as fast as DeepFace-EMD (Phan et al. 2022). In addition, via a human study, our model shows promising explainability through the visualization of cross-attention. We believe our work can inspire more explorations in using ViTs for face identification.

摘要
大多数面部识别方法使用拟合网络进行图像嵌入水平的比较。然而，这种技术可能受到遮盖物（例如面具或太阳镜）和非典型数据的影响。深度脸部-EMD（Phan et al. 2022）达到了非典型数据上的状态态-of-the-art精度，但它的后续的质量排名阶段具有大 O（n^3 \* log n）的时间复杂度（对于图像中的 n 个质量），这是由优化运输优化引起的。在这篇论文中，我们提出了一种新的、使用视图变换器（ViTs）来比较两个图像的质量。经过在 CASIA Webface（Yi et al. 2014）上训练 200 万对图像，我们的模型在非典型数据上达到了与 DeepFace-EMD 相同的精度，但在推断速度方面比 DeepFace-EMD 更快速，大约两倍。此外，通过人类研究，我们的模型展示了可见的混合注意力可读性。我们认为我们的工作可以激励更多的人们在使用 ViTs 进行脸部识别。

2023-11-06

Toward Planet-Wide Traffic Camera Calibration

Unsupervised Region-Growing Network for Object Segmentation in Atmospheric Turbulence

Cal-DETR: Calibrated Detection Transformer

Sea You Later: Metadata-Guided Long-Term Re-Identification for UAV-Based Multi-Object Tracking

Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer’s Disease Progression using MRI Data

Leveraging point annotations in segmentation learning with boundary loss

High-resolution power equipment recognition based on improved self-attention

SoundCam: A Dataset for Finding Humans Using Room Acoustics

Predicting Age from White Matter Diffusivity with Residual Learning

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Rethinking Evaluation Metrics of Open-Vocabulary Segmentaion

Long-Term Invariant Local Features via Implicit Cross-Domain Correspondences

Cross-Image Attention for Zero-Shot Appearance Transfer

TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding

A Robust Bi-Directional Algorithm For People Count In Crowded Areas

FATE: Feature-Agnostic Transformer-based Encoder for learning generalized embedding spaces in flow cytometry data

A Single 2D Pose with Context is Worth Hundreds for 3D Human Pose Estimation

Machine Learning-Based Tea Leaf Disease Detection: A Comprehensive Review

Navigating Scaling Laws: Accelerating Vision Transformer’s Training via Adaptive Strategies

Segmentation of Drone Collision Hazards in Airborne RADAR Point Clouds Using PointNet

Leveraging Transformers to Improve Breast Cancer Classification and Risk Assessment with Multi-modal and Longitudinal Data

PainSeeker: An Automated Method for Assessing Pain in Rats Through Facial Expressions

LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

Few-shot Learning using Data Augmentation and Time-Frequency Transformation for Time Series Classification

Efficient and Low-Footprint Object Classification using Spatial Contrast

Frequency Domain Decomposition Translation for Enhanced Medical Image Translation Using GANs

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances

TAMPAR: Visual Tampering Detection for Parcel Logistics in Postal Supply Chains

Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

A survey and classification of face alignment methods based on face models

CogVLM: Visual Expert for Pretrained Language Models

A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection

OrthoNets: Orthogonal Channel Attention Networks

Forest aboveground biomass estimation using GEDI and earth observation data through attention-based deep learning

AnyText: Multilingual Visual Text Generation And Editing

MixUp-MIL: A Study on Linear & Multilinear Interpolation-Based Data Augmentation for Whole Slide Image Classification

COLA: COarse-LAbel multi-source LiDAR semantic segmentation for autonomous driving

Exploring the Capability of Text-to-Image Diffusion Models with Structural Edge Guidance for Multi-Spectral Satellite Image Inpainting

Zero-Shot Enhancement of Low-Light Image Based on Retinex Decomposition

NEURO HAND: A weakly supervised Hierarchical Attention Network for neuroimaging abnormality Detection

Diffusion-based Radiotherapy Dose Prediction Guided by Inter-slice Aware Structure Encoding

Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination

Multi-view learning for automatic classification of multi-wavelength auroral images

Truly Scale-Equivariant Deep Nets with Fourier Layers

Benchmarking Deep Facial Expression Recognition: An Extensive Protocol with Balanced Dataset in the Wild

Human as Points: Explicit Point-based 3D Human Reconstruction from Single-view RGB Images

Stacked Autoencoder Based Feature Extraction and Superpixel Generation for Multifrequency PolSAR Image Classification

Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box

Dynamic Neural Fields for Learning Atlases of 4D Fetal MRI Time-series

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

FocusTune: Tuning Visual Localization through Focus-Guided Sampling

Neural-based Compression Scheme for Solar Image Data

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

Leveraging sinusoidal representation networks to predict fMRI signals from EEG

Flexible Multi-Generator Model with Fused Spatiotemporal Graph for Trajectory Prediction

SemanticTopoLoop: Semantic Loop Closure With 3D Topological Graph Based on Quadric-Level Object Map

InstructPix2NeRF: Instructed 3D Portrait Editing from a Single Image

Efficient, Self-Supervised Human Pose Estimation with Inductive Prior Tuning

Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers