2023-10-04

cs.CV

cs.CV - 2023-10-04

ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements

paper_url: http://arxiv.org/abs/2310.03140
repo_url: https://github.com/bryanbocao/vifit
paper_authors: Bryan Bo Cao, Abrar Alali, Hansi Liu, Nicholas Meegan, Marco Gruteser, Kristin Dana, Ashwin Ashok, Shubham Jain
for:本研究旨在提高camera-based IoT应用程序中的人体跟踪功能，例如安全监测、智能城市交通安全提高、车辆到行人通信等。methods:本研究使用 transformer 模型来重建视觉 bounding box 轨迹，该模型可以更好地处理长期时间序列数据。results:研究结果显示， compared to 状态前的 X-Translator 模型， ViFiT 模型可以更好地重建视觉 bounding box 轨迹，MRFR 为 0.65，带宽减少率为 97.76%。

Abstract
Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In the computer vision domain, tracking is usually achieved by first detecting subjects with bounding boxes, then associating detected bounding boxes across video frames. For many IoT systems, images captured by cameras are usually sent over the network to be processed at a different site that has more powerful computing resources than edge devices. However, sending entire frames through the network causes significant bandwidth consumption that may exceed the system bandwidth constraints. To tackle this problem, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. To fill the gap of proper metrics of jointly capturing the system characteristics of both tracking quality and video bandwidth reduction, we propose a novel evaluation framework dubbed Minimum Required Frames (MRF) and Minimum Required Frames Ratio (MRFR). ViFiT achieves an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator of 0.98, resulting in a high frame reduction rate as 97.76%.

摘要
“追踪目标在影像中是智能 IoT 应用中最广泛使用的功能，包括安全监控、智能城市交通安全增强、车辆与行人通信等。在计算机视觉领域中，追踪通常通过首先检测目标的矩形框，然后在影像帧之间将检测到的矩形框相互对应。但是，将整帧影像发送到网络上处理，可能会导致系统带宽限制超过。为解决这个问题，我们提出了 ViFiT，一个基于 transformer 的模型，可以从手机资料（IMU 和精确时间测量）中重建视觉矩形框之旅程。它利用 transformer 的能力更好地模型长期时间序列数据。ViFiT 在 Vi-Fi 数据集上进行评估，评估框架包括5个不同的实际世界场景，包括室内和室外环境。为了填补跟踪质量和影像带宽削减系统特性之间的差距，我们提出了一个新的评估框架，名为 Minimum Required Frames（MRF）和 Minimum Required Frames Ratio（MRFR）。ViFiT 在 MRFR 方面取得了0.65，超过了现有跨模式重建的 LSTM Encoder-Decoder 架构 X-Translator 的0.98，实现了高帧减少率为97.76%。”

Shielding the Unseen: Privacy Protection through Poisoning NeRF with Spatial Deformation

paper_url: http://arxiv.org/abs/2310.03125
repo_url: None
paper_authors: Yihan Wu, Brandon Y. Feng, Heng Huang
for: 保护用户隐私 against Neural Radiance Fields (NeRF) 模型的生成能力。
methods: 我们提出了一种新型的恶意攻击方法，通过在观察到的视图中引入不可见的变化，使 NeRF 模型无法准确重建 3D 场景。我们开发了一种两级优化算法，包括基于 Projected Gradient Descent (PGD) 的空间变换。
results: 我们在两个常见 NeRF benchmark 数据集上进行了广泛的测试，包括 29 个真实世界场景的高质量图像。我们的结果表明，我们的隐私保护方法在这些 benchmark 数据集上具有显著的降低 NeRF 性能的能力。此外，我们还证明了我们的方法可以适应不同的干扰强度和 NeRF 架构。这种研究为 NeRF 的潜在隐私风险提供了重要的认知，并强调了在开发 Robust 3D 场景重建算法时需要考虑隐私问题。我们的研究贡献到了负责任 AI 和生成机器学习领域，旨在保护用户隐私和尊重数字时代的创作权。

Abstract
In this paper, we introduce an innovative method of safeguarding user privacy against the generative capabilities of Neural Radiance Fields (NeRF) models. Our novel poisoning attack method induces changes to observed views that are imperceptible to the human eye, yet potent enough to disrupt NeRF's ability to accurately reconstruct a 3D scene. To achieve this, we devise a bi-level optimization algorithm incorporating a Projected Gradient Descent (PGD)-based spatial deformation. We extensively test our approach on two common NeRF benchmark datasets consisting of 29 real-world scenes with high-quality images. Our results compellingly demonstrate that our privacy-preserving method significantly impairs NeRF's performance across these benchmark datasets. Additionally, we show that our method is adaptable and versatile, functioning across various perturbation strengths and NeRF architectures. This work offers valuable insights into NeRF's vulnerabilities and emphasizes the need to account for such potential privacy risks when developing robust 3D scene reconstruction algorithms. Our study contributes to the larger conversation surrounding responsible AI and generative machine learning, aiming to protect user privacy and respect creative ownership in the digital age.

摘要
在这篇论文中，我们介绍了一种新的方法保护用户隐私 against Neural Radiance Fields（NeRF）模型的生成能力。我们的新毒 poisoning 攻击方法会在观察到的视图中引入不可见的变化，却足够破坏 NeRF 重建3D场景的精度。为了实现这一点，我们开发了一种两级优化算法，包括基于 Projected Gradient Descent（PGD）的空间扭曲。我们对两个常用的 NeRF 测试集进行了广泛的测试，包括29个真实世界场景的高质量图像。我们的结果表明，我们的隐私保护方法可以在这些测试集中明显降低 NeRF 的性能。此外，我们还证明了我们的方法可以在不同的扰动强度和 NeRF 架构下进行适应。这项研究为 NeRF 的潜在隐私风险提供了重要的洞察，并触发了开发robust 3D场景重建算法时需要考虑的用户隐私问题。我们的研究贡献到了负责任AI和生成机器学习领域的大局，旨在保护用户隐私和尊重数字时代的创作权。

Blind CT Image Quality Assessment Using DDPM-derived Content and Transformer-based Evaluator

paper_url: http://arxiv.org/abs/2310.03118
repo_url: None
paper_authors: Yongyi Shi, Wenjun Xia, Ge Wang, Xuanqin Mou
for: 降低辐射剂量和使用稀疏视图扫描模式是常见的 computed tomography (CT) 扫描模式，但它们经常导致噪声和扫描artifacts。
methods: 本研究提出了一种新的视觉质量评估（BIQA）方法，该方法模仿人类视觉系统（HVS）的内生生成机制（IGM）。
results: 该方法在 MICCAI 2023 低剂量计算机断层成像质量评估大奖赛中获得了第二名，并在挑战数据集上进一步提高了表现。

Abstract
Lowering radiation dose per view and utilizing sparse views per scan are two common CT scan modes, albeit often leading to distorted images characterized by noise and streak artifacts. Blind image quality assessment (BIQA) strives to evaluate perceptual quality in alignment with what radiologists perceive, which plays an important role in advancing low-dose CT reconstruction techniques. An intriguing direction involves developing BIQA methods that mimic the operational characteristic of the human visual system (HVS). The internal generative mechanism (IGM) theory reveals that the HVS actively deduces primary content to enhance comprehension. In this study, we introduce an innovative BIQA metric that emulates the active inference process of IGM. Initially, an active inference module, implemented as a denoising diffusion probabilistic model (DDPM), is constructed to anticipate the primary content. Then, the dissimilarity map is derived by assessing the interrelation between the distorted image and its primary content. Subsequently, the distorted image and dissimilarity map are combined into a multi-channel image, which is inputted into a transformer-based image quality evaluator. Remarkably, by exclusively utilizing this transformer-based quality evaluator, we won the second place in the MICCAI 2023 low-dose computed tomography perceptual image quality assessment grand challenge. Leveraging the DDPM-derived primary content, our approach further improves the performance on the challenge dataset.

摘要
LOWERING RADIATION DOSE PER VIEW 和使用稀疏视图每次扫描是两种常见的 computed tomography (CT) 扫描模式，尽管经常导致噪声和扫描 artifacts distorted images 。干净图像质量评估 (BIQA) 努力evaluate perceptual quality 与放射学家所感知，这在提高低剂量 CT 重建技术方面扮演着重要的角色。一个有趣的方向是开发BIQA方法，模仿人类视觉系统 (HVS) 的运作特性。HVS 的 internal generative mechanism (IGM) 理论表明，HVS 活动地推导主要内容，以提高理解。在这种研究中，我们引入了一种创新的 BIQA 度量，模仿 HVS 的活动推导过程。首先，我们构建了一个 active inference module，实现为一种抑制扩散概率模型 (DDPM)，以预测主要内容。然后，我们计算了噪声图像和主要内容之间的相互关系，并生成了异同地图。最后，我们将噪声图像和异同地图组合成一个多通道图像，并输入到一种基于 transformer 的图像质量评估器。很Remarkably，只使用这种基于 transformer 的质量评估器，我们在 MICCAI 2023 低剂量 computed tomography 感知图像质量评估大奖中获得第二名。利用 DDPM 生成的主要内容，我们的方法进一步提高了挑战数据集的性能。

Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition

paper_url: http://arxiv.org/abs/2310.03108
repo_url: None
paper_authors: Hamid Mohammadi, Ehsan Nazerfard, Tahereh Firoozi
for: 这个论文旨在提出一种基于深度学习的视频暴力识别系统，以实现精度高且可扩展的人类暴力识别。
methods: 该论文使用了一种基于 transformer 的 Mixture of Experts（MoE）视频暴力识别系统，通过智能组合大视transformer和高效 transformer 架构，不仅利用了视transformer 架构的优势，还减少了使用大视transformer 的计算成本。
results: 该论文的实验结果表明，相比 CNN 模型，MoE 架构在 RWF 数据集上达到了 92.4% 的准确率。

Abstract
Video violence recognition based on deep learning concerns accurate yet scalable human violence recognition. Currently, most state-of-the-art video violence recognition studies use CNN-based models to represent and categorize videos. However, recent studies suggest that pre-trained transformers are more accurate than CNN-based models on various video analysis benchmarks. Yet these models are not thoroughly evaluated for video violence recognition. This paper introduces a novel transformer-based Mixture of Experts (MoE) video violence recognition system. Through an intelligent combination of large vision transformers and efficient transformer architectures, the proposed system not only takes advantage of the vision transformer architecture but also reduces the cost of utilizing large vision transformers. The proposed architecture maximizes violence recognition system accuracy while actively reducing computational costs through a reinforcement learning-based router. The empirical results show the proposed MoE architecture's superiority over CNN-based models by achieving 92.4% accuracy on the RWF dataset.

摘要
视频暴力识别基于深度学习的研究担心精准 yet可扩展的人类暴力识别。目前大多数 state-of-the-art 视频暴力识别研究使用 CNN-based 模型来表示和分类视频。然而，最近的研究表明，预训练的 transformer 比 CNN-based 模型在多种视频分析指标上更加准确。然而，这些模型尚未对视频暴力识别进行全面的评估。本文介绍一种新的 transformer-based Mixture of Experts（MoE）视频暴力识别系统。通过将大视 transformer 和高效 transformer Architecture 智能组合，提议的系统不仅利用了视transformer 架构的优势，还减少了利用大视 transformer 的计算成本。提议的架构最大化暴力识别系统的准确性，同时活动减少计算成本通过 reinforcement learning-based 路由器。实验结果显示，提议的 MoE 架构超过 CNN-based 模型的92.4% 准确率在 RWF 数据集上。

Creating an Atlas of Normal Tissue for Pruning WSI Patching Through Anomaly Detection

paper_url: http://arxiv.org/abs/2310.03106
repo_url: None
paper_authors: Peyman Nejat, Areej Alsaafin, Ghazal Alabtah, Nneka Comfere, Aaron Mangold, Dennis Murphree, Patricija Zot, Saba Yasir, Joaquin J. Garcia, H. R. Tizhoosh
for: 本研究旨在提高Computational pathology中的whole slide image (WSIs)选择patches的效果，以提高下游任务的表达度。
methods: 本研究提出了一种使用normal tissue samples WSIs建立”Atlas of normal tissue”，以消除normal histology的干扰和重复，提高WSIs的表达度。
results: 研究测试了该方法使用107个正常皮肤WSIs建立了normal atlas，并通过使用553个皮肤细胞癌WSIs和451个乳腺WSIs来验证该方法的有效性。结果表明，通过使用normal atlas可以将选择的WSIs减少30%-50%，同时保持同样的索引和搜索性能。

Abstract
Patching gigapixel whole slide images (WSIs) is an important task in computational pathology. Some methods have been proposed to select a subset of patches as WSI representation for downstream tasks. While most of the computational pathology tasks are designed to classify or detect the presence of pathological lesions in each WSI, the confounding role and redundant nature of normal histology in tissue samples are generally overlooked in WSI representations. In this paper, we propose and validate the concept of an "atlas of normal tissue" solely using samples of WSIs obtained from normal tissue biopsies. Such atlases can be employed to eliminate normal fragments of tissue samples and hence increase the representativeness collection of patches. We tested our proposed method by establishing a normal atlas using 107 normal skin WSIs and demonstrated how established indexes and search engines like Yottixel can be improved. We used 553 WSIs of cutaneous squamous cell carcinoma (cSCC) to show the advantage. We also validated our method applied to an external dataset of 451 breast WSIs. The number of selected WSI patches was reduced by 30% to 50% after utilizing the proposed normal atlas while maintaining the same indexing and search performance in leave-one-patinet-out validation for both datasets. We show that the proposed normal atlas shows promise for unsupervised selection of the most representative patches of the abnormal/malignant WSI lesions.

摘要
修补 gigapixel整幕图像（WSIs）是计算 PATHOLOGY 中重要的任务。一些方法已经被提出来选择WSIs中的一 subset作为下游任务的表示。大多数计算 PATHOLOGY 任务是用来分类或检测每个 WSI 中的疾病肿瘤，但是通常忽略了正常组织的混合和重复性。在这篇论文中，我们提出了一种“正常组织图 Atlas”，使用来自正常组织biopsy中的WSIs来排除正常组织的块。这些图 Atlas 可以增加WSIs的表示性。我们测试了我们的提议方法，使用107个正常皮肤WSIs建立了正常图 Atlas，并示出了使用Yottixel等搜索引擎的改进。我们使用553个皮肤细胞癌（cSCC）WSIs示出了这种优势。我们还验证了我们的方法在451个乳腺WSIs中的外部数据集中。选择WSIs中的减少了30%-50%，保持了相同的索引和搜索性能，通过离开一个病人进行留下一个病人验证。我们表明，我们的正常图 Atlas 可以无监督的选择疾病肿瘤 WSI 中最有代表性的块。

Privacy-preserving Multi-biometric Indexing based on Frequent Binary Patterns

paper_url: http://arxiv.org/abs/2310.03091
repo_url: None
paper_authors: Daile Osorio-Roig, Lazaro J. Gonzalez-Soler, Christian Rathgeb, Christoph Busch
for: 本研究旨在提出一种高效、private的多biometric标识系统，以提高生物метриック标识系统的计算效率和安全性。
methods: 本研究使用多biometric分类器和深度神经网络（DNN）基 embedding抽取器，并提出了一种基于频繁二进制模式的多biometricbinning策略，以优化生物метриック标识系统的计算效率和安全性。
results: 实验结果表明，使用提议的多biometric标识系统可以将计算工作负担减少至约57%（索引三种生物 метриック特征）和53%（索引两种生物 метриック特征），同时提高基eline生物метриック系统的安全性和性能。

Abstract
The development of large-scale identification systems that ensure the privacy protection of enrolled subjects represents a major challenge. Biometric deployments that provide interoperability and usability by including efficient multi-biometric solutions are a recent requirement. In the context of privacy protection, several template protection schemes have been proposed in the past. However, these schemes seem inadequate for indexing (workload reduction) in biometric identification systems. More specifically, they have been used in identification systems that perform exhaustive searches, leading to a degradation of computational efficiency. To overcome these limitations, we propose an efficient privacy-preserving multi-biometric identification system that retrieves protected deep cancelable templates and is agnostic with respect to biometric characteristics and biometric template protection schemes. To this end, a multi-biometric binning scheme is designed to exploit the low intra-class variation properties contained in the frequent binary patterns extracted from different types of biometric characteristics. Experimental results reported on publicly available databases using state-of-the-art Deep Neural Network (DNN)-based embedding extractors show that the protected multi-biometric identification system can reduce the computational workload to approximately 57\% (indexing up to three types of biometric characteristics) and 53% (indexing up to two types of biometric characteristics), while simultaneously improving the biometric performance of the baseline biometric system at the high-security thresholds. The source code of the proposed multi-biometric indexing approach together with the composed multi-biometric dataset, will be made available to the research community once the article is accepted.

摘要
大规模的人脸识别系统的开发，以保护报名人员的隐私为主要挑战。包括高效多种生物特征解决方案的生物metric部署已成为最新的要求。在隐私保护的情况下，过去有几种模板保护方案被提议，但这些方案在生物metric识别系统中实现了搜索性能的下降。为了解决这些限制，我们提议一种高效的隐私保护多种生物识别系统，可以恢复保证深度可逆模板。为此，我们设计了一种多种生物特征的集合方案，利用不同类型生物特征中的低内类变化性来实现高效的多种生物识别。实验结果，使用公共可用的数据库和当今最佳的深度神经网络（DNN）基 embedding抽取器，我们的受保护多种生物识别系统可以将计算工作负担减少到约57%（ indexing 三种生物特征）和53%（ indexing 两种生物特征），同时提高基eline生物系统的安全性reshold下的生物性能。我们将提交的文章中的多种生物 indexingapproach的源代码，以及组合的多种生物数据集，将在文章被接受后向研究社区开源。

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

paper_url: http://arxiv.org/abs/2310.03020
repo_url: None
paper_authors: Jianglong Ye, Peng Wang, Kejie Li, Yichun Shi, Heng Wang
for: 该研究旨在解决零shot视图合成问题，即从单个图像中生成高质量的新视图，同时保持3D结构一致性。
methods: 该研究提出了一种生成框架，包括两个阶段：首先将观察区域转换到新视图，然后hallucinate未见区域。在这两个阶段中，我们设计了场景表示变换和视点条件diffusion模型，并在模型中使用epipolor导向注意力和多视点注意力来保持3D一致性。
results: 该研究的实验和评估结果表明，与现有方法相比，提出的机制能够更有效地生成3D一致的零shot视图合成结果。我们的项目页面是https://jianglongye.com/consistent123/

Abstract
Zero-shot novel view synthesis (NVS) from a single image is an essential problem in 3D object understanding. While recent approaches that leverage pre-trained generative models can synthesize high-quality novel views from in-the-wild inputs, they still struggle to maintain 3D consistency across different views. In this paper, we present Consistent-1-to-3, which is a generative framework that significantly mitigate this issue. Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We design a scene representation transformer and view-conditioned diffusion model for performing these two stages respectively. Inside the models, to enforce 3D consistency, we propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information. Finally, we design a hierarchy generation paradigm to generate long sequences of consistent views, allowing a full 360 observation of the provided object image. Qualitative and quantitative evaluation over multiple datasets demonstrate the effectiveness of the proposed mechanisms against state-of-the-art approaches. Our project page is at https://jianglongye.com/consistent123/

摘要
<>Translate given text into Simplified Chinese.<> zeroshot 新视图合成 (NVS) 从单个图像是三维物体理解的关键问题。而最近的方法通过使用预训练生成模型可以从野外输入中生成高质量的新视图，但它们仍然无法保持不同视图之间的3D一致性。在这篇论文中，我们提出了 Consistent-1-to-3，它是一个生成框架，可以很好地解决这个问题。 Specifically，我们将 NVS 任务分解成两个阶段：（i）将观察到的区域转换到新视图，和（ii）描述不见的区域。我们设计了场景表示变换和视角conditioned 填充模型来完成这两个阶段。在模型中，我们提议使用 epipolor-guided 注意力来 incorporate 几何约束，并使用多视角注意力来更好地综合多视角信息。最后，我们设计了一种层次生成方法，可以生成长序列的一致视图，允许对提供的对象图像进行360度的观察。我们的项目页面是https://jianglongye.com/consistent123/。 Qualitative 和量化评估 над Multiple 数据集表明我们的机制与现状approaches的效果。

Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

paper_url: http://arxiv.org/abs/2310.03015
repo_url: None
paper_authors: Yifan Jiang, Hao Tang, Jen-Hao Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao
for: novel view synthesis from a single image
methods: crafted timestep sampling strategy, superior 3D feature extractor, enhanced training scheme
results: reduced training time from 10 days to less than 1 day, significantly accelerating the training process

Abstract
The task of novel view synthesis aims to generate unseen perspectives of an object or scene from a limited set of input images. Nevertheless, synthesizing novel views from a single image still remains a significant challenge in the realm of computer vision. Previous approaches tackle this problem by adopting mesh prediction, multi-plain image construction, or more advanced techniques such as neural radiance fields. Recently, a pre-trained diffusion model that is specifically designed for 2D image synthesis has demonstrated its capability in producing photorealistic novel views, if sufficiently optimized on a 3D finetuning task. Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs. To tackle this issue, we propose Efficient-3DiM, a simple but effective framework to learn a single-image novel-view synthesizer. Motivated by our in-depth analysis of the inference process of diffusion models, we propose several pragmatic strategies to reduce the training overhead to a manageable scale, including a crafted timestep sampling strategy, a superior 3D feature extractor, and an enhanced training scheme. When combined, our framework is able to reduce the total training time from 10 days to less than 1 day, significantly accelerating the training process under the same computational platform (one instance with 8 Nvidia A100 GPUs). Comprehensive experiments are conducted to demonstrate the efficiency and generalizability of our proposed method.

摘要
novel 视点合成的任务是生成对象或场景中未经见过的视角，但是从有限的输入图像中生成novel 视角仍然是计算机视觉领域中的一个重要挑战。先前的方法通过预测网格、多平面图像建构或更高级的技术如神经辐射场来解决这个问题。最近，一个特性化的扩散模型，专门为2D图像合成而设计，在生成高品质的novel 视角方面表现出色，但是需要训练大量的数据和模型参数，具有极高的计算成本和训练时间。为解决这个问题，我们提出了高效的3DiM框架，用于学习单图novel 视角合成器。我们的框架基于我们对扩散模型的推理过程进行了深入分析，并提出了一些实用的策略来降低训练负担，包括自定义时间步骤采样策略、更高级的3D特征提取器和改进的训练方案。当这些策略结合使用时，我们的框架可以在同样的计算平台上减少训练时间从10天降至1天以下，显著减少训练时间。我们进行了广泛的实验来证明我们的提议的效果和通用性。

Towards Domain-Specific Features Disentanglement for Domain Generalization

paper_url: http://arxiv.org/abs/2310.03007
repo_url: None
paper_authors: Hao Chen, Qi Zhang, Zenan Huang, Haobo Wang, Junbo Zhao
for: 解决现代机器学习算法面临的分布预期问题，域总结（DG）成为了一个受欢迎的方向，这些方法的目标是找到各个不同分布中的通用特征。
methods: 我们提出了一种新的对比基于分解方法CDDG，通过利用分解的特征来快速地抽象出域特定的特征，从而为DG任务提供更好的特征抽取。
results: 我们在多个标准 benchmark dataset上进行了广泛的实验，并证明了我们的方法在比较其他状态的方法时显著超越。此外，视觉评估也证明了我们的方法可以有效地实现特征分解。

Abstract
Distributional shift between domains poses great challenges to modern machine learning algorithms. The domain generalization (DG) signifies a popular line targeting this issue, where these methods intend to uncover universal patterns across disparate distributions. Noted, the crucial challenge behind DG is the existence of irrelevant domain features, and most prior works overlook this information. Motivated by this, we propose a novel contrastive-based disentanglement method CDDG, to effectively utilize the disentangled features to exploit the over-looked domain-specific features, and thus facilitating the extraction of the desired cross-domain category features for DG tasks. Specifically, CDDG learns to decouple inherent mutually exclusive features by leveraging them in the latent space, thus making the learning discriminative. Extensive experiments conducted on various benchmark datasets demonstrate the superiority of our method compared to other state-of-the-art approaches. Furthermore, visualization evaluations confirm the potential of our method in achieving effective feature disentanglement.

摘要
《分布Shift问题对现代机器学习算法提出了巨大挑战。Domain generalization（DG）成为了解决这个问题的流行途径，这些方法的目标是找到不同分布之间的通用特征。然而，现有的大多数优秀方法忽略了不相关的领域特征，这是DG的核心挑战。我们提出了一种基于对比的分解方法，即CDDG，以便利用分解后的特征来利用过looked域特征，从而促进DG任务中EXTRACT DESIRED CROSS-DOMAIN CATEGORY FEATURES的抽取。具体来说，CDDG通过在几何空间中利用相互排斥的特征来决定权重，从而使学习变得抽象。经验表明，我们的方法在各种 benchmark 数据集上显示出与其他现状顶峰方法相比的优越性。此外，视觉评估表明我们的方法在实现有效的特征分解方面具有潜在的潜力。》

COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

paper_url: http://arxiv.org/abs/2310.03006
repo_url: https://github.com/BoSmallEar/COOLer
paper_authors: Zhizheng Liu, Mattia Segu, Fisher Yu
for: 提高多个任务的Sequential Learning，使模型可以逐渐学习新任务而不产生过去任务的数据束缚。
methods: 提出了一种基于对比学习和连续学习的多目标跟踪器（COOLer），可以逐渐学习新类型的跟踪，保留过去类型的识别特征。同时，提出了一种对比类增量实例表示学习技术来进一步减轻实例表示的混合。
results: 实验结果表明，COOLer可以逐渐学习新类型的跟踪，同时有效地避免过去类型的识别特征的混合。代码可以在https://github.com/BoSmallEar/COOLer上下载。

Abstract
Continual learning allows a model to learn multiple tasks sequentially while retaining the old knowledge without the training data of the preceding tasks. This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT), which is desirable to accommodate the continuously evolving needs of autonomous systems. Previous solutions for continual learning of object detectors do not address the data association stage of appearance-based trackers, leading to catastrophic forgetting of previous classes' re-identification features. We introduce COOLer, a COntrastive- and cOntinual-Learning-based tracker, which incrementally learns to track new categories while preserving past knowledge by training on a combination of currently available ground truth labels and pseudo-labels generated by the past tracker. To further exacerbate the disentanglement of instance representations, we introduce a novel contrastive class-incremental instance representation learning technique. Finally, we propose a practical evaluation protocol for continual learning for MOT and conduct experiments on the BDD100K and SHIFT datasets. Experimental results demonstrate that COOLer continually learns while effectively addressing catastrophic forgetting of both tracking and detection. The code is available at https://github.com/BoSmallEar/COOLer.

摘要
（简化中文）持续学习可以让模型在不同任务之间学习，而不需要前一个任务的训练数据。这篇文章扩展了持续学习研究的范围，探讨了多个目标跟踪（MOT）中的类增量学习，这对自动化系统来说是非常有优势。先前的持续学习对象检测器解决方案不会处理出现在跟踪器中的数据关联阶段，导致 previous class 的重新识别特征发生融合式忘记。我们介绍了 COOLer，一个基于对比学习和持续学习的跟踪器，可以逐渐学习新的类，同时保留过去知识。为了进一步增强实例表示的独立性，我们介绍了一种新的对比类增量实例表示学习技术。最后，我们提出了一个实用的持续学习 для MOT 评估协议，并在 BDD100K 和 SHIFT 数据集上进行了实验。实验结果表明，COOLer 可以持续学习，并有效地解决了跟踪和检测的融合式忘记。代码可以在 https://github.com/BoSmallEar/COOLer 上获取。

Reversing Deep Face Embeddings with Probable Privacy Protection

paper_url: http://arxiv.org/abs/2310.03005
repo_url: None
paper_authors: Daile Osorio-Roig, Paul A. Gerlitz, Christian Rathgeb, Christoph Busch
For: The paper aims to evaluate the effectiveness of soft-biometric privacy-enhancement approaches in protecting face embeddings, and to assess the vulnerability of state-of-the-art face embedding extractors to attacks that attempt to reconstruct the original face images.* Methods: The paper uses a well-known state-of-the-art face image reconstruction approach to evaluate the effectiveness of soft-biometric privacy protection methods. The authors also analyze the transformation complexity used for privacy protection and assess the vulnerability of state-of-the-art face embedding extractors to attacks.* Results: The paper shows that biometric privacy-enhanced face embeddings can be reconstructed with an accuracy of up to approximately 98%, depending on the complexity of the protection algorithm. This suggests that while soft-biometric privacy-enhancement approaches can provide some level of protection, they may not be sufficient to ensure complete privacy protection for face embeddings.

Abstract
Generally, privacy-enhancing face recognition systems are designed to offer permanent protection of face embeddings. Recently, so-called soft-biometric privacy-enhancement approaches have been introduced with the aim of canceling soft-biometric attributes. These methods limit the amount of soft-biometric information (gender or skin-colour) that can be inferred from face embeddings. Previous work has underlined the need for research into rigorous evaluations and standardised evaluation protocols when assessing privacy protection capabilities. Motivated by this fact, this paper explores to what extent the non-invertibility requirement can be met by methods that claim to provide soft-biometric privacy protection. Additionally, a detailed vulnerability assessment of state-of-the-art face embedding extractors is analysed in terms of the transformation complexity used for privacy protection. In this context, a well-known state-of-the-art face image reconstruction approach has been evaluated on protected face embeddings to break soft biometric privacy protection. Experimental results show that biometric privacy-enhanced face embeddings can be reconstructed with an accuracy of up to approximately 98%, depending on the complexity of the protection algorithm.

摘要
通常，隐私增强面 recognition系统是设计来提供面嵌入的永久保护。最近，叫做软生物метри隐私增强方法已经被引入，以消除软生物метри特征。这些方法限制了从面嵌入中可以推断的软生物метри信息（性别或皮肤颜色）的量。先前的工作已经强调了评估隐私保护能力的严格评估和标准化评估协议的必要性。在这个背景下，本文探讨了方法，宣称提供软生物метри隐私保护，是否能够满足非逆性要求。此外，本文还进行了现有面嵌入抽取器的敏感性评估，并在保护算法的复杂度方面进行了分析。在这个上下文中，一种well-known的面图像重建方法被评估了在保护后的面嵌入上，以打砸软生物метри隐私保护。实验结果显示，具有隐私保护的面嵌入可以在98%的准确率下重建，具体取决于保护算法的复杂度。

Optimizing Key-Selection for Face-based One-Time Biometrics via Morphing

paper_url: http://arxiv.org/abs/2310.02997
repo_url: None
paper_authors: Daile Osorio-Roig, Mahdi Ghafourian, Christian Rathgeb, Ruben Vera-Rodriguez, Christoph Busch, Julian Fierrez
for: 提高face recognition系统对抗特殊攻击的安全性
methods: 提出了不同的钥匙选择策略来提高signal level的安全性
results: 实验结果表明， certain key selection strategies可以完全阻止特殊攻击，而对实际情况下的攻击 Success chance can be reduced to approximately 5.0%

Abstract
Nowadays, facial recognition systems are still vulnerable to adversarial attacks. These attacks vary from simple perturbations of the input image to modifying the parameters of the recognition model to impersonate an authorised subject. So-called privacy-enhancing facial recognition systems have been mostly developed to provide protection of stored biometric reference data, i.e. templates. In the literature, privacy-enhancing facial recognition approaches have focused solely on conventional security threats at the template level, ignoring the growing concern related to adversarial attacks. Up to now, few works have provided mechanisms to protect face recognition against adversarial attacks while maintaining high security at the template level. In this paper, we propose different key selection strategies to improve the security of a competitive cancelable scheme operating at the signal level. Experimental results show that certain strategies based on signal-level key selection can lead to complete blocking of the adversarial attack based on an iterative optimization for the most secure threshold, while for the most practical threshold, the attack success chance can be decreased to approximately 5.0%.

摘要
现在，人脸识别系统仍然易受到敌意攻击。这些攻击可以是简单地修改输入图像，或者修改识别模型的参数，以便模拟授权者。所谓的隐私增强人脸识别系统主要是为了保护存储的生物特征数据（模板）。在文献中，隐私增强人脸识别方法主要集中在传统安全威胁上，忽视了增长的敌意攻击问题。到目前为止，只有一些工作提供了保护人脸识别 against敌意攻击的机制，同时保持高度安全的方法。在这篇论文中，我们提议了不同的钥匙选择策略，以提高竞争性下的 cancelable 方案的安全性。实验结果表明，基于信号水平的钥匙选择策略可以完全阻止敌意攻击，而在最实用的阈值上，攻击成功的机会可以降至约5.0%。

Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma

paper_url: http://arxiv.org/abs/2310.02972
repo_url: https://github.com/astarakee/segrap2023
paper_authors: Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu
for: 本研究旨在提高头颈区Computed Tomography（CT）图像中的Target segmentation，以提高肿瘤规划规划的准确性。
methods: 我们提出了一种完全自动的框架，并开发了两种模型，一种用于45个Organs at Risk（OARs）的分割，另一种用于两个Gross Tumor Volumes（GTVs）的分割。我们使用了协调Intensity Distributions的预处理方法，然后自动对目标区域进行裁剪。
results: 我们在SegRap 2023挑战的验证阶段中，使用了这种方法获得了每个任务的第二名。我们的框架可以在https://github.com/Astarakee/segrap2023上获取。

Abstract
Target segmentation in CT images of Head&Neck (H&N) region is challenging due to low contrast between adjacent soft tissue. The SegRap 2023 challenge has been focused on benchmarking the segmentation algorithms of Nasopharyngeal Carcinoma (NPC) which would be employed as auto-contouring tools for radiation treatment planning purposes. We propose a fully-automatic framework and develop two models for a) segmentation of 45 Organs at Risk (OARs) and b) two Gross Tumor Volumes (GTVs). To this end, we preprocess the image volumes by harmonizing the intensity distributions and then automatically cropping the volumes around the target regions. The preprocessed volumes were employed to train a standard 3D U-Net model for each task, separately. Our method took second place for each of the tasks in the validation phase of the challenge. The proposed framework is available at https://github.com/Astarakee/segrap2023

摘要
target segmentation in CT pictures of head and neck (H&N) area is difficult due to low contrast between adjacent soft tissue. the segrap 2023 challenge has been focused on benchmarking the segmentation algorithms of nasopharyngeal carcinoma (NPC) which would be employed as auto-contouring tools for radiation treatment planning purposes. we propose a fully-automatic framework and develop two models for a) segmentation of 45 organs at risk (OARs) and b) two gross tumor volumes (GTVs). to this end, we preprocess the image volumes by harmonizing the intensity distributions and then automatically cropping the volumes around the target regions. the preprocessed volumes were employed to train a standard 3D U-Net model for each task, separately. our method took second place for each of the tasks in the validation phase of the challenge. the proposed framework is available at https://github.com/Astarakee/segrap2023.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

paper_url: http://arxiv.org/abs/2310.02960
repo_url: https://github.com/yangcaoai/CoDA_NeurIPS2023
paper_authors: Yang Cao, Yihan Zeng, Hang Xu, Dan Xu
for: 本文 targets at 3D scene中的开放词汇Object Detection（OV-3DDet）问题，即检测Scene中的novel object。
methods: 本文提出了一种同时解决localizing和classifying novel object的框架，通过结合有限基类的3D box geometry priors和2D semantic open-vocabulary priors来生成pseudo box labels of novel objects。另外，本文还提出了一种基于发现的novel boxes的cross-modal alignment模块，用于将点云和图像/文本模式之间的特征空间相互抽象。
results: 对于两个difficult datasets（i.e., SUN-RGBD和ScanNet），本文的方法实现了高效的novel object localization和classification。与最佳替换方法相比，本文的方法具有80%的mAP提升。代码和预训练模型在项目页面上发布。

Abstract
Open-vocabulary 3D Object Detection (OV-3DDet) aims to detect objects from an arbitrary list of categories within a 3D scene, which remains seldom explored in the literature. There are primarily two fundamental problems in OV-3DDet, i.e., localizing and classifying novel objects. This paper aims at addressing the two problems simultaneously via a unified framework, under the condition of limited base categories. To localize novel 3D objects, we propose an effective 3D Novel Object Discovery strategy, which utilizes both the 3D box geometry priors and 2D semantic open-vocabulary priors to generate pseudo box labels of the novel objects. To classify novel object boxes, we further develop a cross-modal alignment module based on discovered novel boxes, to align feature spaces between 3D point cloud and image/text modalities. Specifically, the alignment process contains a class-agnostic and a class-discriminative alignment, incorporating not only the base objects with annotations but also the increasingly discovered novel objects, resulting in an iteratively enhanced alignment. The novel box discovery and crossmodal alignment are jointly learned to collaboratively benefit each other. The novel object discovery can directly impact the cross-modal alignment, while a better feature alignment can, in turn, boost the localization capability, leading to a unified OV-3DDet framework, named CoDA, for simultaneous novel object localization and classification. Extensive experiments on two challenging datasets (i.e., SUN-RGBD and ScanNet) demonstrate the effectiveness of our method and also show a significant mAP improvement upon the best-performing alternative method by 80%. Codes and pre-trained models are released on the project page.

摘要
Open-vocabulary 3D对象检测（OV-3DDet）目标是在3D场景中检测来自不同类别的对象，这是现有文献中少有探讨的领域。这个问题的两个基本问题是对novel对象进行本地化和分类。这篇论文提出了一种同时解决这两个问题的简单框架，即CoDA框架，其基于有限基础类别的前提下。为了本地化novel对象，我们提议一种有效的3D新对象发现策略，它利用3D盒子几何学预设和2D semantic开放词汇预设来生成 Pseudo box标签。为了类别novel对象盒子，我们进一步开发了基于发现的新盒子的cross-modal对接模块，以对Feature空间 между3D点云和图像/文本Modalities进行对接。特别是，对接过程包括无类别和类别特异对接，并利用基础对象的标注以及逐渐发现的新对象，进行迭代增强对接。新盒子发现和cross-modal对接 jointly learning，以便相互帮助。新对象发现可以直接影响cross-modal对接，而更好的对接可以提高本地化能力，从而实现一个简单的OV-3DDet框架，名为CoDA，用于同时本地化和分类novel对象。经验表明，我们的方法在SUN-RGBD和ScanNet两个难题 dataset上实现了显著的map提升，相比最佳替代方法，提升约80%。代码和预训练模型在项目页面上发布。

Adaptive Landmark Color for AUV Docking in Visually Dynamic Environments

paper_url: http://arxiv.org/abs/2310.02944
repo_url: None
paper_authors: Corey Knutson, Zhipeng Cao, Junaed Sattar
for: 增加自主潜水器（AUV）的任务时间，提供了一个地方 для AUV 重新充电和接收新的任务信息。
methods: 使用适应色LED标记和动态色滤波来提高水质不同情况下的灯标可见度。AUV 和 docking station 都使用摄像头确定水背景色以计算需要的标记颜色，无需AUV和 docking station之间的通信。
results: 在池和湖中进行的实验表明，我们的方法比静态颜色阈值方法更好，随着背景颜色的变化。在清水情况下，DS 的探测范围为5米， false positives 很少。

Abstract
Autonomous Underwater Vehicles (AUVs) conduct missions underwater without the need for human intervention. A docking station (DS) can extend mission times of an AUV by providing a location for the AUV to recharge its batteries and receive updated mission information. Various methods for locating and tracking a DS exist, but most rely on expensive acoustic sensors, or are vision-based, which is significantly affected by water quality. In this \doctype, we present a vision-based method that utilizes adaptive color LED markers and dynamic color filtering to maximize landmark visibility in varying water conditions. Both AUV and DS utilize cameras to determine the water background color in order to calculate the desired marker color. No communication between AUV and DS is needed to determine marker color. Experiments conducted in a pool and lake show our method performs 10 times better than static color thresholding methods as background color varies. DS detection is possible at a range of 5 meters in clear water with minimal false positives.

摘要
自主水下潜水器（AUV）可以在水下完成任务无需人类干预。一个岸站（DS）可以延长AUV的任务时间，提供AUV重新充电和更新任务信息的位置。许多找到和跟踪岸站的方法存在，但大多数依赖于昂贵的陀螺探测器，或者视觉基于方法，它受到水质的影响。在这篇文章中，我们提出了一种视觉基于的方法，利用适应色LED标记和动态色滤波提高标记颜色的可见性，并且不需AUV和DS之间进行交流，以适应不同的水质。我们在池和湖进行了实验，发现我们的方法在背景颜色变化时比静止颜色阈值方法好10倍。在清水情况下，DS的探测范围为5米，false positive少。

Graph data modelling for outcome prediction in oropharyngeal cancer patients

paper_url: http://arxiv.org/abs/2310.02931
repo_url: None
paper_authors: Nithya Bhasker, Stefan Leger, Alexander Zwanenburg, Chethan Babu Reddy, Sebastian Bodenstedt, Steffen Löck, Stefanie Speidel
for: 这个研究是用来预测或opharyngeal cancer（OPC）病人的 binary outcome，使用 computed tomography（CT）-based radiomic features。
methods: 这个研究使用 patient hypergraph network（PHGN），并在 inductive learning 设置下进行训练。
results: 研究获得了比较好的结果，并且与 GNN 和基eline linear models 进行比较，结果显示 PHGN 在时间-to-event 分析中表现更好。

Abstract
Graph neural networks (GNNs) are becoming increasingly popular in the medical domain for the tasks of disease classification and outcome prediction. Since patient data is not readily available as a graph, most existing methods either manually define a patient graph, or learn a latent graph based on pairwise similarities between the patients. There are also hypergraph neural network (HGNN)-based methods that were introduced recently to exploit potential higher order associations between the patients by representing them as a hypergraph. In this work, we propose a patient hypergraph network (PHGN), which has been investigated in an inductive learning setup for binary outcome prediction in oropharyngeal cancer (OPC) patients using computed tomography (CT)-based radiomic features for the first time. Additionally, the proposed model was extended to perform time-to-event analyses, and compared with GNN and baseline linear models.

摘要
几何神经网络（GNNs）在医疗领域日益受欢迎，用于疾病分类和结果预测。由于病人数据不易 disponibles as a graph，大多数现有方法可以 manually define a patient graph, or learn a latent graph based on pairwise similarities between the patients。此外，hypergraph neural network (HGNN)-based methods were introduced recently to exploit potential higher order associations between the patients by representing them as a hypergraph。在这项工作中，我们提出了一个 patient hypergraph network (PHGN)，在 inductive learning setup 中用于 binary outcome prediction in oropharyngeal cancer (OPC) patients using computed tomography (CT)-based radiomic features for the first time。此外，提出的模型还被扩展到进行时间到事件分析，并与 GNN 和基线线性模型进行比较。

Computationally Efficient Quadratic Neural Networks

paper_url: http://arxiv.org/abs/2310.02901
repo_url: None
paper_authors: Mathew Mithra Noel, Venkataraman Muthiah-Nakarajan
for: 本研究旨在提出一种高阶人工神经元，其输出是通过应用激活函数于高阶多omial函数的输入来计算的。
methods: 本研究使用高阶神经元，其中每个神经元的输出是一个高阶多omial函数的函数。另外，本研究还使用激活函数来计算神经元的输出。
results: 本研究发现，使用高阶神经元可以提高神经网络的学习能力，而且只需要添加 $n$ 个额外参数。此外，本研究还示出了一种使用 quadratic neurons 的减少参数神经网络模型，可以在 benchmark 分类 dataset 上达到更高的准确率。

Abstract
Higher order artificial neurons whose outputs are computed by applying an activation function to a higher order multinomial function of the inputs have been considered in the past, but did not gain acceptance due to the extra parameters and computational cost. However, higher order neurons have significantly greater learning capabilities since the decision boundaries of higher order neurons can be complex surfaces instead of just hyperplanes. The boundary of a single quadratic neuron can be a general hyper-quadric surface allowing it to learn many nonlinearly separable datasets. Since quadratic forms can be represented by symmetric matrices, only $\frac{n(n+1)}{2}$ additional parameters are needed instead of $n^2$. A quadratic Logistic regression model is first presented. Solutions to the XOR problem with a single quadratic neuron are considered. The complete vectorized equations for both forward and backward propagation in feedforward networks composed of quadratic neurons are derived. A reduced parameter quadratic neural network model with just $ n $ additional parameters per neuron that provides a compromise between learning ability and computational cost is presented. Comparison on benchmark classification datasets are used to demonstrate that a final layer of quadratic neurons enables networks to achieve higher accuracy with significantly fewer hidden layer neurons. In particular this paper shows that any dataset composed of $C$ bounded clusters can be separated with only a single layer of $C$ quadratic neurons.

摘要
高等人造神经元 whose outputs are computed by applying an activation function to a higher-order multinomial function of the inputs have been considered in the past, but did not gain acceptance due to the extra parameters and computational cost. However, higher-order neurons have significantly greater learning capabilities since the decision boundaries of higher-order neurons can be complex surfaces instead of just hyperplanes. The boundary of a single quadratic neuron can be a general hyper-quadric surface, allowing it to learn many nonlinearly separable datasets. Since quadratic forms can be represented by symmetric matrices, only $\frac{n(n+1)}{2}$ additional parameters are needed instead of $n^2$. A quadratic Logistic regression model is first presented. Solutions to the XOR problem with a single quadratic neuron are considered. The complete vectorized equations for both forward and backward propagation in feedforward networks composed of quadratic neurons are derived. A reduced parameter quadratic neural network model with just $n$ additional parameters per neuron that provides a compromise between learning ability and computational cost is presented. Comparison on benchmark classification datasets are used to demonstrate that a final layer of quadratic neurons enables networks to achieve higher accuracy with significantly fewer hidden layer neurons. In particular, this paper shows that any dataset composed of $C$ bounded clusters can be separated with only a single layer of $C$ quadratic neurons.

Human-centric Behavior Description in Videos: New Benchmark and Model

paper_url: http://arxiv.org/abs/2310.02894
repo_url: None
paper_authors: Lingru Zhou, Yiqi Gao, Manqing Zhang, Peng Wu, Peng Wang, Yanning Zhang
for: 本研究旨在提供细化的个体行为描述，以提高视频监控中的人员识别和行为分析。
methods: 我们提出了一种基于人物中心的视频captioning方法，通过细化的人物描述来捕捉每个个体的行为特征。
results: 我们的方法在描述每个个体的行为方面达到了领先的 Result，并且可以准确地将个体与其行为相关联。

Abstract
In the domain of video surveillance, describing the behavior of each individual within the video is becoming increasingly essential, especially in complex scenarios with multiple individuals present. This is because describing each individual's behavior provides more detailed situational analysis, enabling accurate assessment and response to potential risks, ensuring the safety and harmony of public places. Currently, video-level captioning datasets cannot provide fine-grained descriptions for each individual's specific behavior. However, mere descriptions at the video-level fail to provide an in-depth interpretation of individual behaviors, making it challenging to accurately determine the specific identity of each individual. To address this challenge, we construct a human-centric video surveillance captioning dataset, which provides detailed descriptions of the dynamic behaviors of 7,820 individuals. Specifically, we have labeled several aspects of each person, such as location, clothing, and interactions with other elements in the scene, and these people are distributed across 1,012 videos. Based on this dataset, we can link individuals to their respective behaviors, allowing for further analysis of each person's behavior in surveillance videos. Besides the dataset, we propose a novel video captioning approach that can describe individual behavior in detail on a person-level basis, achieving state-of-the-art results. To facilitate further research in this field, we intend to release our dataset and code.

摘要
在视频监控领域，描述每个人在视频中的行为变得越来越重要，特别是在多个人存在的复杂场景下。这是因为每个人的行为描述提供了更加细致的情况分析，使得可以准确评估和应对潜在风险，保证公共场所的安全和和谐。然而，现有的视频水平描述数据集不能提供每个人特定行为的细致描述。而且，仅仅是视频水平的描述无法提供深入的解释每个人的行为，使得准确确定每个人的特定身份变得困难。为解决这个挑战，我们构建了人类中心的视频监控描述数据集，该数据集包括7,820个人的动态行为描述。具体来说，我们在1,012个视频中标注了每个人的位置、衣物和场景中的其他元素之间的互动，这些人分布在1,012个视频中。基于这个数据集，我们可以将每个人的行为联系到其自己，使得可以在监控视频中进行每个人的行为分析。此外，我们提出了一种新的视频描述方法，可以在人类基础上详细描述每个人的行为，达到当前领域的状态作卷。为便于未来的研究，我们计划将数据集和代码发布。

A Grammatical Compositional Model for Video Action Detection

paper_url: http://arxiv.org/abs/2310.02887
repo_url: None
paper_authors: Zhijun Zhang, Xu Zou, Jiahuan Zhou, Sheng Zhong, Ying Wu
for: 本研究旨在提高视频中人类行为分类的精度，通过分析人员动作和物品之间的互动关系。
methods: 本研究提出了一种基于通用和或图的 grammatical compositional model (GCM)，利用语法模型的compositional property和深度神经网络的表达能力，以提高行为分类的精度。
results: 实验表明，GCM模型在AVA数据集和Something-Else任务中表现出色，而且可以通过推理分析过程提高 interpretability。

Abstract
Analysis of human actions in videos demands understanding complex human dynamics, as well as the interaction between actors and context. However, these interaction relationships usually exhibit large intra-class variations from diverse human poses or object manipulations, and fine-grained inter-class differences between similar actions. Thus the performance of existing methods is severely limited. Motivated by the observation that interactive actions can be decomposed into actor dynamics and participating objects or humans, we propose to investigate the composite property of them. In this paper, we present a novel Grammatical Compositional Model (GCM) for action detection based on typical And-Or graphs. Our model exploits the intrinsic structures and latent relationships of actions in a hierarchical manner to harness both the compositionality of grammar models and the capability of expressing rich features of DNNs. The proposed model can be readily embodied into a neural network module for efficient optimization in an end-to-end manner. Extensive experiments are conducted on the AVA dataset and the Something-Else task to demonstrate the superiority of our model, meanwhile the interpretability is enhanced through an inference parsing procedure.

摘要
要分析视频中人类行为，需要理解人类动态和人类与context的交互关系。然而，这些交互关系通常具有大量内类变化和细致的间类差异，使得现有方法表现有限。我们受到人类交互行为可以分解为actor动态和参与人或物品的观察的想法所 inspirited，我们提出了一种新的语法组合模型（GCM）。我们的模型利用人类行为的内在结构和隐藏关系，在层次结构中充分发挥语法模型的compositional性和深度学习模型的表达能力。该模型可以轻松地被嵌入到深度学习网络中，并通过端到端的优化来提高性能。我们在AVA数据集和Something-Else任务中进行了广泛的实验，并通过推理分析过程提高了可读性。

Multi-Resolution Fusion for Fully Automatic Cephalometric Landmark Detection

paper_url: http://arxiv.org/abs/2310.02855
repo_url: None
paper_authors: Dongqian Guo, Wencheng Han
for: 验证架构在斜视X光像中的首要脊标定附近精度。
methods: 使用多分辨率的图像pyramid结构，并将不同的识别场合输入到多个模型中，以实现最佳的特征组合。
results: 在2023年首届 Cephalometric Landmark Detection in Lateral X-ray Images 挑战中，实现了1.62mm的 Mean Radial Error (MRE) 和2.0mm的 Success Detection Rate (SDR) 74.18%。

Abstract
Cephalometric landmark detection on lateral skull X-ray images plays a crucial role in the diagnosis of certain dental diseases. Accurate and effective identification of these landmarks presents a significant challenge. Based on extensive data observations and quantitative analyses, we discovered that visual features from different receptive fields affect the detection accuracy of various landmarks differently. As a result, we employed an image pyramid structure, integrating multiple resolutions as input to train a series of models with different receptive fields, aiming to achieve the optimal feature combination for each landmark. Moreover, we applied several data augmentation techniques during training to enhance the model's robustness across various devices and measurement alternatives. We implemented this method in the Cephalometric Landmark Detection in Lateral X-ray Images 2023 Challenge and achieved a Mean Radial Error (MRE) of 1.62 mm and a Success Detection Rate (SDR) 2.0mm of 74.18% in the final testing phase.

摘要
依据广泛的数据观察和量化分析，我们发现了不同视觉特征场景下的不同影响因素。为了实现最佳的特征组合，我们采用了图像 pyramid 结构，将多个分辨率作为输入，并在不同的层次上训练多个模型，每个模型具有不同的视觉场景。此外，我们在训练过程中应用了多种数据增强技术，以提高模型在不同设备和测量方法下的Robustness。我们在 Cephalometric Landmark Detection in Lateral X-ray Images 2023 Challenge 中实现了一个 Mean Radial Error (MRE) 为 1.62 mm 和 Success Detection Rate (SDR) 2.0 mm 的 74.18%。

Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models

paper_url: http://arxiv.org/abs/2310.02848
repo_url: None
paper_authors: Siyuan Yang, Lu Zhang, Liqian Ma, Yu Liu, JingJing Fu, You He
for: 填充图像中缺失的像素区域的内容，使其具有可见性和semantic plausibility。
methods: 利用涉及扩散模型的强大泛化模型进行文本引导图像填充。提出了一种无需调整的注意力引导策略，以便在扩散模型的抽象过程中阻止指定区域的消除和 occluded content的还原。还提出了一种分类优化算法，以促进抽象过程中的净化稳定性。
results: 与现状方法相比，MagicRemover显示出了显著的提升，特别是在高质量图像填充方面。经过了量化评估和用户调查，MagicRemover的性能在图像填充任务中具有显著优势。代码将在https://github.com/exisas/Magicremover上发布。

Abstract
Image inpainting aims to fill in the missing pixels with visually coherent and semantically plausible content. Despite the great progress brought from deep generative models, this task still suffers from i. the difficulties in large-scale realistic data collection and costly model training; and ii. the intrinsic limitations in the traditionally user-defined binary masks on objects with unclear boundaries or transparent texture. In this paper, we propose MagicRemover, a tuning-free method that leverages the powerful diffusion models for text-guided image inpainting. We introduce an attention guidance strategy to constrain the sampling process of diffusion models, enabling the erasing of instructed areas and the restoration of occluded content. We further propose a classifier optimization algorithm to facilitate the denoising stability within less sampling steps. Extensive comparisons are conducted among our MagicRemover and state-of-the-art methods including quantitative evaluation and user study, demonstrating the significant improvement of MagicRemover on high-quality image inpainting. We will release our code at https://github.com/exisas/Magicremover.

摘要
Image Inpainting 目标是填充缺失像素为视觉一致和Semantically plausible的内容。尽管深度生成模型已经带来了很大的进步，但这项任务仍然受到以下两种困难的限制：一是大规模的真实数据收集和成本高昂的模型训练; 二是传统的用户定义的二进制面Masks on objects with unclear boundaries or transparent texture。在这篇论文中，我们提出 MagicRemover，一种不需要调整的方法，利用了强大的扩散模型进行文本引导的图像填充。我们提出了一种注意力引导策略，以制限扩散模型的采样过程，使得可以Erase instructed areas and restore occluded content。我们还提出了一种分类优化算法，以便在 fewer sampling steps 中提高图像填充的稳定性。我们对 MagicRemover 和现有方法进行了广泛的比较，包括量化评估和用户研究，并示出 MagicRemover 在高质量图像填充方面具有显著改进。我们将在 GitHub 上发布我们的代码，链接为。

Delving into CLIP latent space for Video Anomaly Recognition

paper_url: http://arxiv.org/abs/2310.02835
repo_url: https://github.com/luca-zanella-dvl/AnomalyCLIP
paper_authors: Luca Zanella, Benedetta Liberatori, Willi Menapace, Fabio Poiesi, Yiming Wang, Elisa Ricci
for: 检测和识别视频中的异常场景，包括视频水平监管和异常检测等应用场景。
methods: 提出了一种新的方法AnomalyCLIP，利用Large Language and Vision（LLV）模型CLIP，并结合多个实例学习来实现视频异常检测和分类。
results: 对三个主要异常检测标准 datasets（ShanghaiTech、UCF-Crime和XD-Violence）进行比较，研究表明AnomalyCLIP在识别视频异常场景方面表现出色，比基elines表现更高。

Abstract
We tackle the complex problem of detecting and recognising anomalies in surveillance videos at the frame level, utilising only video-level supervision. We introduce the novel method AnomalyCLIP, the first to combine Large Language and Vision (LLV) models, such as CLIP, with multiple instance learning for joint video anomaly detection and classification. Our approach specifically involves manipulating the latent CLIP feature space to identify the normal event subspace, which in turn allows us to effectively learn text-driven directions for abnormal events. When anomalous frames are projected onto these directions, they exhibit a large feature magnitude if they belong to a particular class. We also introduce a computationally efficient Transformer architecture to model short- and long-term temporal dependencies between frames, ultimately producing the final anomaly score and class prediction probabilities. We compare AnomalyCLIP against state-of-the-art methods considering three major anomaly detection benchmarks, i.e. ShanghaiTech, UCF-Crime, and XD-Violence, and empirically show that it outperforms baselines in recognising video anomalies.

摘要
我们面临着视频监测中的复杂问题，即在帧级别上检测和识别异常。我们介绍了一种新的方法——异常CLIP，这是首次将大语言和视觉（LLV）模型，如CLIP，与多个实例学习结合用于共同的视频异常检测和分类。我们的方法特点在于在CLIP特征空间中扭曲normal事件子空间，这使得我们可以有效地通过文本驱动的方式学习异常事件的指令。当异常帧被投影到这些指令上时，它们会表现出大的特征 магнитуد。我们还引入了一种高效的Transformer架构来模型帧之间的短期和长期 temporaldependencies，最终生成异常分数和分类概率。我们与状态的方法进行比较，考虑了三个主要的异常检测 benchmark，即上海工程大学、UCF-Crime和XD-Violence，并经验显示，AnomalyCLIP在识别视频异常的任务中表现出色。

All Sizes Matter: Improving Volumetric Brain Segmentation on Small Lesions

paper_url: http://arxiv.org/abs/2310.02829
repo_url: None
paper_authors: Ayhan Can Erdur, Daniel Scholz, Josef A. Buchner, Stephanie E. Combs, Daniel Rueckert, Jan C. Peeken
for:* 这种研究旨在提高脑 метаstatic radiosurgery中的 lesion detection和 segmentation精度，以便更好地地图出脑中的多个恶性肿瘤。methods:* 这种方法使用了多种神经网络模型，包括 blob 损失函数、差分序列和专门针对小肿瘤模型。results:* 实验表明，使用 blob 损失函数和差分序列可以提高 segmentation 结果，但是包含专门针对小肿瘤模型的ensemble 会导致 segmentation 结果下降。此外，通过基于领域知识的后处理步骤可以提高结果的精度。

Abstract
Brain metastases (BMs) are the most frequently occurring brain tumors. The treatment of patients having multiple BMs with stereo tactic radiosurgery necessitates accurate localization of the metastases. Neural networks can assist in this time-consuming and costly task that is typically performed by human experts. Particularly challenging is the detection of small lesions since they are often underrepresented in exist ing approaches. Yet, lesion detection is equally important for all sizes. In this work, we develop an ensemble of neural networks explicitly fo cused on detecting and segmenting small BMs. To accomplish this task, we trained several neural networks focusing on individual aspects of the BM segmentation problem: We use blob loss that specifically addresses the imbalance of lesion instances in terms of size and texture and is, therefore, not biased towards larger lesions. In addition, a model using a subtraction sequence between the T1 and T1 contrast-enhanced sequence focuses on low-contrast lesions. Furthermore, we train additional models only on small lesions. Our experiments demonstrate the utility of the ad ditional blob loss and the subtraction sequence. However, including the specialized small lesion models in the ensemble deteriorates segmentation results. We also find domain-knowledge-inspired postprocessing steps to drastically increase our performance in most experiments. Our approach enables us to submit a competitive challenge entry to the ASNR-MICCAI BraTS Brain Metastasis Challenge 2023.

摘要
脑肿瘤（BM）是最常见的脑肿瘤，它们的治疗需要精准地LOCAL化肿瘤。人工智能可以帮助在这种时间consuming和成本高的任务中提高精度。特别是检测小肿瘤是非常困难的，因为它们经常被忽略或排除在现有的方法中。然而，检测所有肿瘤的大小都很重要。在这种工作中，我们开发了一个ensemble of neural networks，专门用于检测和分割小BM。为了完成这项任务，我们训练了多个神经网络，每个神经网络都专注于各自的BM分割问题方面。我们使用blob损失，这种损失特别关注肿瘤实例中尺寸和文化的不均衡问题，因此不受大肿瘤的干扰。此外，我们还使用了一个基于T1和T1增强序列的抽象序列，专门用于低对比度肿瘤。此外，我们还训练了额外的小肿瘤模型。我们的实验表明，采用这些方法可以提高我们的性能。然而，包括特殊的小肿瘤模型在ensemble中会下降分割结果。我们还发现了基于领域知识的后处理步骤，可以帮助我们大幅提高性能。我们的方法使得我们能够提交竞争力强的挑战提交到ASNR-MICCAI BraTS脑肿瘤挑战2023。

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

paper_url: http://arxiv.org/abs/2310.02815
repo_url: https://github.com/MasterHow/CoBEV
paper_authors: Hao Shi, Chengshan Pang, Jiaming Zhang, Kailun Yang, Yuhao Wu, Huajian Ni, Yining Lin, Rainer Stiefelhagen, Kaiwei Wang
for: 本研究旨在提高路面摄像头驱动的3D物体检测精度和Robustness，扩展视觉中心的感知范围，并提高道路安全性。
methods: 本研究提出了一种新的综合BEV检测框架，称为Complementary-BEV（CoBEV），它将深度和高度Integrate到构建robust BEV表示中，并使用新提出的两Stage complementary特征选择（CFS）模块进行水平融合。此外，还integrated了BEV特征缩减框架，以进一步提高检测精度。
results: 对于公共的3D检测标准 benchmarks DAIR-V2X-I和Rope3D以及私人的Supremind-Road数据集进行了广泛的实验，结果表明，CoBEV不仅实现了新的状态之精度，还显著提高了之前方法在长距离enario和摄像头干扰中的Robustness，并在不同的场景和摄像头参数下进行了大幅度的提高。此外，CoBEV还达到了在DAIR-V2X-I上的汽车AP分数80%的新纪录。

Abstract
Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at https://github.com/MasterHow/CoBEV.

摘要
“路边摄像头驱动的3D对象检测是智能交通系统中的关键任务，可以扩展感知范围 beyond 视觉中心的车辆和提高道路安全性。而前一些研究受限于只使用深度或高度信息，我们发现深度和高度都是重要的，并且它们是 complementary。这种见解驱动我们开发了Complementary-BEV（CoBEV），一种基于端到端的单目3D对象检测框架，它将深度和高度集成到构建robust BEV表示中。具体来说，CoBEV每个像素的深度和高度分布，并将摄像头特征抬升到3D空间进行水平融合，使用我们提出的新的两stage complementary特征选择（CFS）模块。此外，我们还融合了BEV特征精炼框架，以进一步提高检测精度。我们在公共的3D检测标准 benchmarks 上进行了广泛的实验，包括DAIR-V2X-I和Rope3D，以及私人的Supremind-Road数据集，并证明了CoBEV不仅实现了新的状态机器的精度，而且也提高了前一些方法在远程场景和摄像头干扰下的Robustness，并增加了对不同场景和摄像头参数的适应性。此外，我们还首次实现了摄像头模型的车辆AP分数达到80%在DAIR-V2X-I中，在易模式下。我们将代码公开于https://github.com/MasterHow/CoBEV。”

Tracking Anything in Heart All at Once

paper_url: http://arxiv.org/abs/2310.02792
repo_url: https://github.com/rprokap/pset-9
paper_authors: Chengkang Shen, Hao Zhu, You Zhou, Yu Liu, Si Yi, Lili Dong, Weipeng Zhao, David J. Brady, Xun Cao, Zhan Ma, Yi Lin
For: This paper aims to improve the accuracy and efficiency of myocardial motion tracking in cardiac imaging, with the goal of early detection and prevention of Cardiovascular Diseases (CVDs).* Methods: The Neural Cardiac Motion Field (NeuralCMF) method uses implicit neural representation (INR) to model the 3D structure and comprehensive 6D forward/backward motion of the heart, without the need for paired datasets or self-supervised optimization.* Results: Experimental validations across three representative datasets demonstrate the robustness and innovative nature of the NeuralCMF, with significant advantages over existing state-of-the-art methods in cardiac imaging and motion tracking.Here is the information in Simplified Chinese text:
for: 本研究目的是提高心脏动态跟踪的精度和效率，以预防和检测心血管疾病（CVD）。
methods: 本方法使用启发式神经表示（INR）模型心脏3D结构和全向/反向6D动态运动，不需要对称数据集或自动化优化。
results: 三个示例数据集的实验验证表明NeuralCMF具有Robustness和创新性，与现有的心脏成像和动态跟踪方法相比具有显著优势。

Abstract
Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of Cardiovascular Diseases (CVDs), the foremost cause of death globally. However, current techniques suffer incomplete and inaccurate motion estimation of the myocardium both in spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. In addressing these challenges, this paper introduces the Neural Cardiac Motion Field (NeuralCMF). NeuralCMF leverages the implicit neural representation (INR) to model the 3D structure and the comprehensive 6D forward/backward motion of the heart. This approach offers memory-efficient storage and continuous capability to query the precise shape and motion of the myocardium throughout the cardiac cycle at any specific point. Notably, NeuralCMF operates without the need for paired datasets, and its optimization is self-supervised through the physics knowledge priors both in space and time dimensions, ensuring compatibility with both 2D and 3D echocardiogram video inputs. Experimental validations across three representative datasets support the robustness and innovative nature of the NeuralCMF, marking significant advantages over existing state-of-the-arts in cardiac imaging and motion tracking.

摘要

LROC-PANGU-GAN: Closing the Simulation Gap in Learning Crater Segmentation with Planetary Simulators

paper_url: http://arxiv.org/abs/2310.02781
repo_url: None
paper_authors: Jaewon La, Jaime Phadke, Matt Hutton, Marius Schwinning, Gabriele De Canio, Florian Renk, Lars Kunze, Matthew Gadd
for: 这个研究旨在提高探测器在外星表面上降落时能够预测和避免危险，例如峭壁或深潭可能会对探测器的降落和运作造成严重的风险。methods: 这个研究使用了深度学习模型，并将其训练在具有不同类型的数据集上，包括实际拍摄的LROC图像和模拟的PANGU图像。results: 研究结果显示，使用这种方法可以提高探测器降落时的危险预测性，并且在实际拍摄的LROC图像上实现了更好的分类性能。

Abstract
It is critical for probes landing on foreign planetary bodies to be able to robustly identify and avoid hazards - as, for example, steep cliffs or deep craters can pose significant risks to a probe's landing and operational success. Recent applications of deep learning to this problem show promising results. These models are, however, often learned with explicit supervision over annotated datasets. These human-labelled crater databases, such as from the Lunar Reconnaissance Orbiter Camera (LROC), may lack in consistency and quality, undermining model performance - as incomplete and/or inaccurate labels introduce noise into the supervisory signal, which encourages the model to learn incorrect associations and results in the model making unreliable predictions. Physics-based simulators, such as the Planet and Asteroid Natural Scene Generation Utility, have, in contrast, perfect ground truth, as the internal state that they use to render scenes is known with exactness. However, they introduce a serious simulation-to-real domain gap - because of fundamental differences between the simulated environment and the real-world arising from modelling assumptions, unaccounted for physical interactions, environmental variability, etc. Therefore, models trained on their outputs suffer when deployed in the face of realism they have not encountered in their training data distributions. In this paper, we therefore introduce a system to close this "realism" gap while retaining label fidelity. We train a CycleGAN model to synthesise LROC from Planet and Asteroid Natural Scene Generation Utility (PANGU) images. We show that these improve the training of a downstream crater segmentation network, with segmentation performance on a test set of real LROC images improved as compared to using only simulated PANGU images.

摘要
为了使外星体上的探测器能够坚固地标识和避免障碍，例如峭壁或深沟，是非常重要的。现在，深度学习在这个问题上的应用已经显示出了扎实的结果。然而，这些模型通常是通过明确的监督来学习的，而这些由人 annotated的坑数据库，如Lunar Reconnaissance Orbiter Camera (LROC)，可能缺乏一致性和质量，从而影响模型的性能。 incomplete和/或不准确的标签会把附加的噪声引入监督信号中，使模型学习错误的关联，导致模型的预测不可靠。相比之下，物理学习器，如Planet and Asteroid Natural Scene Generation Utility (PANGU)，有完美的真实信息，因为它们使用的内部状态是known with exactness。然而，它们又存在真实世界和 simulate world之间的差异，这些差异来自模型假设、不计算的物理交互、环境变化等。因此，使用它们生成的输出来训练模型会导致模型在面对实际情况时表现不佳。为了关闭这个“现实”差异，我们因此引入一个系统，用于在保持标签准确性的情况下，将LROC从PANGU图像中生成Synthesize。我们显示，这些改进了下游坑分割网络的测试集性能。

Dynamic Shuffle: An Efficient Channel Mixture Method

paper_url: http://arxiv.org/abs/2310.02776
repo_url: None
paper_authors: Kaijun Gong, Zhuowen Yin, Yushu Li, Kailing Guo, Xiangmin Xu
for: 提高ShuffleNet的性能
methods: 提出了一种数据依存的混合方法，通过动态生成数据依存的 permutation 矩阵来减少数据中的重复性
results: 实验结果表明，与常见的点 wise 卷积相比，提出的方法可以减少数据中的重复性，并在图像分类 benchmark 上显著提高 ShuffleNet 的性能

Abstract
The redundancy of Convolutional neural networks not only depends on weights but also depends on inputs. Shuffling is an efficient operation for mixing channel information but the shuffle order is usually pre-defined. To reduce the data-dependent redundancy, we devise a dynamic shuffle module to generate data-dependent permutation matrices for shuffling. Since the dimension of permutation matrix is proportional to the square of the number of input channels, to make the generation process efficiently, we divide the channels into groups and generate two shared small permutation matrices for each group, and utilize Kronecker product and cross group shuffle to obtain the final permutation matrices. To make the generation process learnable, based on theoretical analysis, softmax, orthogonal regularization, and binarization are employed to asymptotically approximate the permutation matrix. Dynamic shuffle adaptively mixes channel information with negligible extra computation and memory occupancy. Experiment results on image classification benchmark datasets CIFAR-10, CIFAR-100, Tiny ImageNet and ImageNet have shown that our method significantly increases ShuffleNets' performance. Adding dynamic generated matrix with learnable static matrix, we further propose static-dynamic-shuffle and show that it can serve as a lightweight replacement of ordinary pointwise convolution.

摘要
卷积神经网络中的重复性不仅取决于权重，还取决于输入。混合是一种有效的操作，可以混合通道信息，但混合顺序通常是预定的。为了减少数据相关的重复性，我们提出了一种动态混合模块，可以生成数据相关的 permutation 矩阵。由于 permutation 矩阵的维度与输入通道数平方成正，为了使生成过程高效，我们将通道分为组，并生成每组两个共享的小 permutation 矩阵，然后使用 Kronecker 乘法和跨组混合来获得最终的 permutation 矩阵。为了让生成过程学习，我们采用了理论分析，使用 softmax、正则化和归一化来渐近地 aproximate permutation 矩阵。动态混合可以有效地混合通道信息，并且增加了 ShuffleNets 的性能。此外，我们还提出了静态-动态混合，可以作为普通点 wise 混合的轻量级替换。

SHOT: Suppressing the Hessian along the Optimization Trajectory for Gradient-Based Meta-Learning

paper_url: http://arxiv.org/abs/2310.02751
repo_url: None
paper_authors: JunHoo Lee, Jayeon Yoo, Nojun Kwak
for: 这篇论文旨在探讨Gradient-based meta-learning（GBML）中隐藏的Hessian问题，并提出一个名为SHOT（Suppressing the Hessian along the Optimization Trajectory）的算法来解决这个问题。
methods: 这篇论文使用SHOT算法，它将对GBML中的内部迭代进行修改，以减少内部迭代中Hessian的使用。SHOT算法不增加基eline模型的计算量多少。
results: 这篇论文通过实验证明SHOT算法可以对GBML中的Hessian进行隐藏，并且SHOT算法可以在标准的几个shot学习任务上表现更好。

Abstract
In this paper, we hypothesize that gradient-based meta-learning (GBML) implicitly suppresses the Hessian along the optimization trajectory in the inner loop. Based on this hypothesis, we introduce an algorithm called SHOT (Suppressing the Hessian along the Optimization Trajectory) that minimizes the distance between the parameters of the target and reference models to suppress the Hessian in the inner loop. Despite dealing with high-order terms, SHOT does not increase the computational complexity of the baseline model much. It is agnostic to both the algorithm and architecture used in GBML, making it highly versatile and applicable to any GBML baseline. To validate the effectiveness of SHOT, we conduct empirical tests on standard few-shot learning tasks and qualitatively analyze its dynamics. We confirm our hypothesis empirically and demonstrate that SHOT outperforms the corresponding baseline. Code is available at: https://github.com/JunHoo-Lee/SHOT

摘要
在这篇论文中，我们假设Gradient-based Meta-Learning（GBML）内loop中隐式地抑制梯度。基于这个假设，我们提出了一种名为SHOT（抑制优化轨迹上的梯度）的算法，该算法的目的是将目标和参考模型参数的距离尽可能小，以抑制内loop中的梯度。尽管处理高阶项，SHOT不会增加基eline模型的计算复杂度很多。它是GBML任何基eline模型的不可知数，可以应用于任何GBML基eline模型。为了验证SHOT的效果，我们进行了标准的少数shot学习任务的实验，并质量分析其动态。我们确认了我们的假设，并证明SHOT在对应的基eline模型上表现出色。代码可以在以下链接中找到：https://github.com/JunHoo-Lee/SHOT。

Condition numbers in multiview geometry, instability in relative pose estimation, and RANSAC

paper_url: http://arxiv.org/abs/2310.02719
repo_url: None
paper_authors: Hongyi Fan, Joe Kileel, Benjamin Kimia
for: 本研究旨在提供一个数值状况分析框架，用于多视角几何中的最小问题，利用计算代数和里曼几何工具。
methods: 本研究使用计算代数和里曼几何工具来分析最小问题的数值状况，并提出一个数学方法来评估最小问题的状况。
results: 本研究发现，在某些世界场景下，5-和7-点最小问题会具有无限几何状况，即使没有外围值和足够的数据支持一个假设。此外，本研究还发现，使用RANSAC算法可以除除外围值，并且可以选择良好的数据，以避免无限几何状况。

Abstract
In this paper we introduce a general framework for analyzing the numerical conditioning of minimal problems in multiple view geometry, using tools from computational algebra and Riemannian geometry. Special motivation comes from the fact that relative pose estimation, based on standard 5-point or 7-point Random Sample Consensus (RANSAC) algorithms, can fail even when no outliers are present and there is enough data to support a hypothesis. We argue that these cases arise due to the intrinsic instability of the 5- and 7-point minimal problems. We apply our framework to characterize the instabilities, both in terms of the world scenes that lead to infinite condition number, and directly in terms of ill-conditioned image data. The approach produces computational tests for assessing the condition number before solving the minimal problem. Lastly synthetic and real data experiments suggest that RANSAC serves not only to remove outliers, but also to select for well-conditioned image data, as predicted by our theory.

摘要
在本文中，我们提出了一种总体框架，用于分析多视图几何中的数值条件，使用计算代数和里曼几何工具。特别的动机来自于5点或7点随机抽样Consensus（RANSAC）算法在无异常情况下仍然失败的现象。我们 argue这些情况 arise due to the intrinsic instability of the 5- and 7-point minimal problems。我们应用了这种框架，来描述这些不稳定性，包括世界场景导致condition number为∞的情况，以及直接基于图像数据的ill-conditioned情况。该方法生成了Before solving the minimal problem的计算测试。最后，我们在 sintetic和实际数据 экспериментах中发现，RANSAC不仅可以 removing outliers，还可以选择良好的conditioned image data，与我们的理论相符。

Understanding Pan-Sharpening via Generalized Inverse

paper_url: http://arxiv.org/abs/2310.02718
repo_url: None
paper_authors: Shiqi Liu, Yutong Bai, Xinyang Han, Alan Yuille
for: 这 paper 的目的是提出一种基于照片图像和多spectral 图像的 Pan-sharpening 算法，以获得高空间和高spectral 图像。
methods: 这 paper 使用了一种简单的矩阵方程来描述 Pan-sharpening 问题，并提出了一种基于普通 inverse 矩阵理论的两种通用 inverse 矩阵方法，其中一种是component substitution 方法，另一种是 multi-resolution analysis 方法。
results: 该 paper 通过 synthetic 实验和实际数据实验表明，提出的方法比其他方法更好和更锐化，并且在实际实验中，下采样增强效果显著。

Abstract
Pan-sharpening algorithm utilizes panchromatic image and multispectral image to obtain a high spatial and high spectral image. However, the optimizations of the algorithms are designed with different standards. We adopt the simple matrix equation to describe the Pan-sharpening problem. The solution existence condition and the acquirement of spectral and spatial resolution are discussed. A down-sampling enhancement method was introduced for better acquiring the spatial and spectral down-sample matrices. By the generalized inverse theory, we derived two forms of general inverse matrix formulations that can correspond to the two prominent classes of Pan-sharpening methods, that is, component substitution and multi-resolution analysis methods. Specifically, the Gram Schmidt Adaptive(GSA) was proved to follow the general inverse matrix formulation of component substitution. A model prior to the general inverse matrix of the spectral function was rendered. The theoretical errors are analyzed. Synthetic experiments and real data experiments are implemented. The proposed methods are better and sharper than other methods qualitatively in both synthetic and real experiments. The down-sample enhancement effect is shown of better results both quantitatively and qualitatively in real experiments. The generalized inverse matrix theory help us better understand the Pan-sharpening.

摘要
泛化简化算法使用泛chromatic图像和多spectral图像获得高空间和高spectral图像。然而，优化算法的标准不同。我们采用简单矩阵方程来描述泛化简化问题。解存问题和spectral和空间分辨率的获得是讨论的。我们引入了下采样提高方法以更好地获得空间和spectral下采样矩阵。通过总 inverse理论，我们 deriv了两种通用 inverse矩阵形式，可以与两种主要的泛化简化方法相匹配，即component substitute和多resolution分析方法。Specifically，Gram Schmidt Adaptive(GSA)被证明遵循了component substitute的通用 inverse矩阵形式。我们制定了spectral函数的模型先验。理论错误是分析。我们实现了synthetic和实际数据实验。提出的方法比其他方法更好和更加锐化，并且在实际实验中下采样提高效果明显。通过总 inverse矩阵理论，我们更好地理解泛化简化。

GETAvatar: Generative Textured Meshes for Animatable Human Avatars

paper_url: http://arxiv.org/abs/2310.02714
repo_url: https://github.com/magic-research/GETAvatar
paper_authors: Xuanmeng Zhang, Jianfeng Zhang, Rohan Chacko, Hongyi Xu, Guoxian Song, Yi Yang, Jiashi Feng
for: 这篇论文是为了解决3D-意识全身人体生成问题而写的，目标是创造高质量的图像和准确的几何结构。
methods: 该论文提出了一种名为GET Avatar的生成模型，该模型直接生成了明确的3D纹理图面，以实现高质量的图像生成。
results: 实验表明，GET Avatar可以在3D-意识人体生成中达到状态机器人表现，并且可以高效地生成512x512和1024x1024分辨率的图像。

Abstract
We study the problem of 3D-aware full-body human generation, aiming at creating animatable human avatars with high-quality textures and geometries. Generally, two challenges remain in this field: i) existing methods struggle to generate geometries with rich realistic details such as the wrinkles of garments; ii) they typically utilize volumetric radiance fields and neural renderers in the synthesis process, making high-resolution rendering non-trivial. To overcome these problems, we propose GETAvatar, a Generative model that directly generates Explicit Textured 3D meshes for animatable human Avatar, with photo-realistic appearance and fine geometric details. Specifically, we first design an articulated 3D human representation with explicit surface modeling, and enrich the generated humans with realistic surface details by learning from the 2D normal maps of 3D scan data. Second, with the explicit mesh representation, we can use a rasterization-based renderer to perform surface rendering, allowing us to achieve high-resolution image generation efficiently. Extensive experiments demonstrate that GETAvatar achieves state-of-the-art performance on 3D-aware human generation both in appearance and geometry quality. Notably, GETAvatar can generate images at 512x512 resolution with 17FPS and 1024x1024 resolution with 14FPS, improving upon previous methods by 2x. Our code and models will be available.

摘要
我们研究3D意识整体人类生成问题，目标是创建可动人形模型，高质量文字和几何。通常，这个领域存在两个挑战：一是现有方法无法生成衣物皱纹等细节rich;二是通常利用volume radiance fields和神经渲染器进行synthesis，高分辨率渲染非常困难。为了解决这些问题，我们提出了GETavatar，一种生成模型，直接生成Explicit Textured 3D mesh，用于可动人形模型，具有真实的光照和细节。具体来说，我们首先设计了3D人体表示，并通过学习2D normal maps的3D扫描数据来增强生成的人体细节。其次，通过使用面积化渲染器，我们可以高效地渲染surface，并且可以 дости得高分辨率图像生成。我们的实验表明，GETavatar在3D意识人类生成中具有国际级表现，并且可以在512x512分辨率和1024x1024分辨率下生成图像，提高了前一代方法的2倍。我们的代码和模型将公开。

Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Clasification

paper_url: http://arxiv.org/abs/2310.02690
repo_url: None
paper_authors: Guoxin Wang, Xuyang Cao, Shan An, Fengmei Fan, Chao Zhang, Jinsong Wang, Feng Yu, Zhiren Wang
for: 这个研究旨在使用深度学习方法和脑成像技术来诊断心理疾病。
methods: 研究使用了多维度嵌入感知模式融合transformer（MFFormer），具体而言是使用Resting-state功能磁共振成像（rs-fMRI）和T1强化磁共振成像（T1w sMRI），两种不同的脑成像模式，以全盘的时间序列资料和三维空间资料进行融合。
results: 研究结果显示，使用MFFormer可以更好地诊断心理疾病，比使用单一modalities或多modalities MRI更好。

Abstract
Deep learning approaches, together with neuroimaging techniques, play an important role in psychiatric disorders classification. Previous studies on psychiatric disorders diagnosis mainly focus on using functional connectivity matrices of resting-state functional magnetic resonance imaging (rs-fMRI) as input, which still needs to fully utilize the rich temporal information of the time series of rs-fMRI data. In this work, we proposed a multi-dimension-embedding-aware modality fusion transformer (MFFormer) for schizophrenia and bipolar disorder classification using rs-fMRI and T1 weighted structural MRI (T1w sMRI). Concretely, to fully utilize the temporal information of rs-fMRI and spatial information of sMRI, we constructed a deep learning architecture that takes as input 2D time series of rs-fMRI and 3D volumes T1w. Furthermore, to promote intra-modality attention and information fusion across different modalities, a fusion transformer module (FTM) is designed through extensive self-attention of hybrid feature maps of multi-modality. In addition, a dimension-up and dimension-down strategy is suggested to properly align feature maps of multi-dimensional from different modalities. Experimental results on our private and public OpenfMRI datasets show that our proposed MFFormer performs better than that using a single modality or multi-modality MRI on schizophrenia and bipolar disorder diagnosis.

摘要
深度学习方法和神经成像技术在心理疾病诊断中发挥重要作用。前期研究主要利用功能连接矩阵的resting-state功能磁共振成像(rs-fMRI)作为输入， ainda需要充分利用rs-fMRI数据的时间序列信息。在这项工作中，我们提出了一种多维度嵌入感觉模型融合转换器（MFFormer）用于诊断偏头痛和mania病。具体来说，为了充分利用rs-fMRI的时间信息和T1磁共振成像(T1w sMRI)的空间信息，我们构建了一个深度学习架构，输入2D时间序列rs-fMRI和3D尺寸T1w sMRI。此外，为了促进不同模态之间的内部注意力和信息融合，我们设计了一个融合转换器模块（FTM），通过了rs-fMRI和sMRI的混合特征地图进行了广泛的自注意。此外，我们还提出了一种维度上下降策略，以适应不同模态的特征维度不同。实验结果表明，我们提出的MFFormer在我们自己的私人数据集和公共OpenfMRI数据集上比使用单一模态或多模态MRI更好地诊断偏头痛和mania病。

PostRainBench: A comprehensive benchmark and a new model for precipitation forecasting

paper_url: http://arxiv.org/abs/2310.02676
repo_url: https://github.com/yyyujintang/PostRainBench
paper_authors: Yujin Tang, Jiaming Zhou, Xiang Pan, Zeying Gong, Junwei Liang
for: 该研究的目的是提高降水预测的准确性，以便更好地预测极端天气事件。
methods: 该研究使用了人工智能基于后处理技术，与传统的数值天气预测方法相结合，以提高预测准确性。
results: 实验结果显示，该方法在三个数据集上的降水预测准确率高于当前state-of-the-art方法，特别是在极端降水情况下表现出色，比传统数值天气预测方法提高15.6%, 17.4%, 和31.8%。

Abstract
Accurate precipitation forecasting is a vital challenge of both scientific and societal importance. Data-driven approaches have emerged as a widely used solution for addressing this challenge. However, solely relying on data-driven approaches has limitations in modeling the underlying physics, making accurate predictions difficult. Coupling AI-based post-processing techniques with traditional Numerical Weather Prediction (NWP) methods offers a more effective solution for improving forecasting accuracy. Despite previous post-processing efforts, accurately predicting heavy rainfall remains challenging due to the imbalanced precipitation data across locations and complex relationships between multiple meteorological variables. To address these limitations, we introduce the PostRainBench, a comprehensive multi-variable NWP post-processing benchmark consisting of three datasets for NWP post-processing-based precipitation forecasting. We propose CAMT, a simple yet effective Channel Attention Enhanced Multi-task Learning framework with a specially designed weighted loss function. Its flexible design allows for easy plug-and-play integration with various backbones. Extensive experimental results on the proposed benchmark show that our method outperforms state-of-the-art methods by 6.3%, 4.7%, and 26.8% in rain CSI on the three datasets respectively. Most notably, our model is the first deep learning-based method to outperform traditional Numerical Weather Prediction (NWP) approaches in extreme precipitation conditions. It shows improvements of 15.6%, 17.4%, and 31.8% over NWP predictions in heavy rain CSI on respective datasets. These results highlight the potential impact of our model in reducing the severe consequences of extreme weather events.

摘要
准确预测降水是科学和社会重要的挑战。数据驱动方法已成为解决这个挑战的广泛使用的解决方案。然而，凭借数据驱动方法alone做出的预测具有限制，因为它们无法模拟下面物理。将人工智能基于后处理技术与传统的数值天气预测（NWP）方法结合使用可以提高预测准确性。despite previous post-processing efforts, accurately predicting heavy rainfall remains challenging due to the imbalanced precipitation data across locations and complex relationships between multiple meteorological variables. To address these limitations, we introduce the PostRainBench, a comprehensive multi-variable NWP post-processing benchmark consisting of three datasets for NWP post-processing-based precipitation forecasting. We propose CAMT, a simple yet effective Channel Attention Enhanced Multi-task Learning framework with a specially designed weighted loss function. Its flexible design allows for easy plug-and-play integration with various backbones. Extensive experimental results on the proposed benchmark show that our method outperforms state-of-the-art methods by 6.3%, 4.7%, and 26.8% in rain CSI on the three datasets respectively. Most notably, our model is the first deep learning-based method to outperform traditional Numerical Weather Prediction (NWP) approaches in extreme precipitation conditions. It shows improvements of 15.6%, 17.4%, and 31.8% over NWP predictions in heavy rain CSI on respective datasets. These results highlight the potential impact of our model in reducing the severe consequences of extreme weather events.

paper_url: http://arxiv.org/abs/2310.02663
repo_url: None
paper_authors: Xuhang Chen, Chi-Man Pun, Shuqiang Wang
for: 医疗图像翻译任务的实现，用于补充临床诊断中缺失的模式数据。
methods: 提出了多任务框架MedPrompt，可以有效地翻译不同的模式。特别是，我们提出了自适应提示块，动态指导翻译网络向各自的模式。同时，我们还引入了提取提示块和融合提示块，以有效地编码跨模式提示。为了提高跨模式特征提取，我们将Transformer模型纳入系统。
results: 经验证明，我们的提案模型在五个数据集和四对模式之间具有最佳视觉质量和优秀的泛化能力。

Abstract
Cross-modal medical image translation is an essential task for synthesizing missing modality data for clinical diagnosis. However, current learning-based techniques have limitations in capturing cross-modal and global features, restricting their suitability to specific pairs of modalities. This lack of versatility undermines their practical usefulness, particularly considering that the missing modality may vary for different cases. In this study, we present MedPrompt, a multi-task framework that efficiently translates different modalities. Specifically, we propose the Self-adaptive Prompt Block, which dynamically guides the translation network towards distinct modalities. Within this framework, we introduce the Prompt Extraction Block and the Prompt Fusion Block to efficiently encode the cross-modal prompt. To enhance the extraction of global features across diverse modalities, we incorporate the Transformer model. Extensive experimental results involving five datasets and four pairs of modalities demonstrate that our proposed model achieves state-of-the-art visual quality and exhibits excellent generalization capability.

摘要
医学多模态图像翻译是临床诊断中缺失模态数据的重要任务。然而，当前的学习基于技术有限于捕捉跨模态和全局特征，导致它们不适用于特定对比的情况。这种缺乏 universality 使得它们在实际应用中不太实用，特别是考虑到缺失的模态可能在不同的案例中不同。在本研究中，我们提出了 MedPrompt，一种多任务框架，可以效率地翻译不同的模态。具体来说，我们提出了自适应Prompt块，可以动态引导翻译网络向不同的模态。在这个框架中，我们引入了提取Prompt块和Prompt融合块，以高效地编码跨模态Prompt。为了增强不同模态之间的全局特征提取，我们采用了Transformer模型。我们对五个数据集和四对模态进行了广泛的实验，结果表明，我们提出的模型在视觉质量上达到了领先水平，并且具有优秀的泛化能力。

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

paper_url: http://arxiv.org/abs/2310.02650
repo_url: None
paper_authors: Matthew Hanlon, Boyang Sun, Marc Pollefeys, Hermann Blum
for: 本研究旨在使用活动视觉地标定解决多机器人或人机合作中视点变化带来的挑战。
methods: 本研究比较了现有Literature中的方法，并提出了新的数据驱动方法。
results: 实验和实际应用中，数据驱动方法的性能较为出色，超过了现有方法的表现。

Abstract
Rather than having each newly deployed robot create its own map of its surroundings, the growing availability of SLAM-enabled devices provides the option of simply localizing in a map of another robot or device. In cases such as multi-robot or human-robot collaboration, localizing all agents in the same map is even necessary. However, localizing e.g. a ground robot in the map of a drone or head-mounted MR headset presents unique challenges due to viewpoint changes. This work investigates how active visual localization can be used to overcome such challenges of viewpoint changes. Specifically, we focus on the problem of selecting the optimal viewpoint at a given location. We compare existing approaches in the literature with additional proposed baselines and propose a novel data-driven approach. The result demonstrates the superior performance of the data-driven approach when compared to existing methods, both in controlled simulation experiments and real-world deployment.

摘要
Here's the text in Simplified Chinese:而不是每个新部署的机器人都需要创建自己的环境地图，现在可以通过使用SLAM技术实现的设备的可用性提供了将其本地化在另一个机器人或设备的地图上的选项。在多机器人或人机合作等情况下，将所有代理人在同一个地图上本地化是必要的。但是，将地面机器人本地化在飞行器或头戴式混合现实（MR）头盔的地图上存在视点变化的挑战。本研究 investigate了如何使用活动视觉本地化解决视点变化问题。我们专注于选择给定位置的优化视点问题，并与文献中的现有方法进行比较，并提出了一种新的数据驱动方法。结果显示数据驱动方法在控制 simulate experiments 和实际部署中表现出色。

P2CADNet: An End-to-End Reconstruction Network for Parametric 3D CAD Model from Point Clouds

paper_url: http://arxiv.org/abs/2310.02638
repo_url: None
paper_authors: Zhihao Zong, Fazhi He, Rubin Fan, Yuxin Liu
for: 这个论文的目的是提出一种基于端点云的CAD模型重建方法（P2CADNet），以解决现代工业和社会中feature-based parametric CAD模型的重建问题。
methods: 该方法首先提出了一种结构，它结合端点云特征提取器、CAD序列重建器和参数优化器。然后，为了在autoregressive方式下重建featured CAD模型，CAD序列重建器使用了两个transformer解码器，其中一个带有目标面罩和另一个没有。最后，为了更准确地预测参数，我们设计了一个参数优化器 WITH cross-attention机制来进一步细化CAD特征参数。
results: 我们在公共数据集上评估了P2CADNet，并 obtener了优秀的重建质量和准确性。根据我们所知，P2CADNet是首个基于端点云的end-to-end网络，可以被视为未来研究的基准。因此，我们在github上公开了源代码，访问https://github.com/Blice0415/P2CADNet。

Abstract
Computer Aided Design (CAD), especially the feature-based parametric CAD, plays an important role in modern industry and society. However, the reconstruction of featured CAD model is more challenging than the reconstruction of other CAD models. To this end, this paper proposes an end-to-end network to reconstruct featured CAD model from point cloud (P2CADNet). Initially, the proposed P2CADNet architecture combines a point cloud feature extractor, a CAD sequence reconstructor and a parameter optimizer. Subsequently, in order to reconstruct the featured CAD model in an autoregressive way, the CAD sequence reconstructor applies two transformer decoders, one with target mask and the other without mask. Finally, for predicting parameters more precisely, we design a parameter optimizer with cross-attention mechanism to further refine the CAD feature parameters. We evaluate P2CADNet on the public dataset, and the experimental results show that P2CADNet has excellent reconstruction quality and accuracy. To our best knowledge, P2CADNet is the first end-to-end network to reconstruct featured CAD model from point cloud, and can be regarded as baseline for future works. Therefore, we open the source code at https://github.com/Blice0415/P2CADNet.

摘要
computer-aided design (CAD)，特icularly feature-based parametric CAD，在现代工业和社会中扮演着重要的角色。然而，重建featured CAD模型比其他CAD模型更加困难。为此，这篇论文提出了一种终端网络来重建point cloud中的featured CAD模型（P2CADNet）。在提案的P2CADNet体系结构中，首先combines一个点云特征提取器、一个CAD序列重构器和一个参数优化器。然后，为了在自动回归方式下重建featured CAD模型，CAD序列重构器使用了两个变换器解码器，一个带有目标面罩和另一个没有罩。最后，为了更 precisely预测参数，我们设计了一个参数优化器 WITH cross-attention机制来进一步细化CAD特征参数。我们对公共数据集进行了测试，并得到了优秀的重建质量和准确性。根据我们所知，P2CADNet是首个以终端网络重建point cloud中的featured CAD模型，可以 serves as a benchmark for future works。因此，我们在github上公开了源代码（https://github.com/Blice0415/P2CADNet）。

Analyzing and Improving OT-based Adversarial Networks

paper_url: http://arxiv.org/abs/2310.02611
repo_url: None
paper_authors: Jaemoo Choi, Jaewoong Choi, Myungjoo Kang
for: 本文 targets solving the problem of generative modeling using Optimal Transport (OT) theory.
methods: 本文提出了一种 unified framework ， combining OT-based adversarial methods to improve the performance of generative models. The authors also analyze the training dynamics of this framework and propose a novel method for gradual refinement of the generated distribution.
results: compared with previous best-performing OT-based models, the proposed approach achieves a FID score of 2.51 on CIFAR-10, outperforming unified OT-based adversarial approaches.

Abstract
Optimal Transport (OT) problem aims to find a transport plan that bridges two distributions while minimizing a given cost function. OT theory has been widely utilized in generative modeling. In the beginning, OT distance has been used as a measure for assessing the distance between data and generated distributions. Recently, OT transport map between data and prior distributions has been utilized as a generative model. These OT-based generative models share a similar adversarial training objective. In this paper, we begin by unifying these OT-based adversarial methods within a single framework. Then, we elucidate the role of each component in training dynamics through a comprehensive analysis of this unified framework. Moreover, we suggest a simple but novel method that improves the previously best-performing OT-based model. Intuitively, our approach conducts a gradual refinement of the generated distribution, progressively aligning it with the data distribution. Our approach achieves a FID score of 2.51 on CIFAR-10, outperforming unified OT-based adversarial approaches.

摘要

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

paper_url: http://arxiv.org/abs/2310.02596
repo_url: https://github.com/wyysf-98/SweetDreamer
paper_authors: Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan
for: 本研究旨在解决2D扩散模型在3D世界中的文本到3D生成问题，即多视点不一致问题。
methods: 我们使用了微调2D扩散模型以生成视点特定的坐标图，并将其与可见的3D对象匹配，以解决多视点不一致问题。
results: 我们的方法可以高效地解决多视点不一致问题，并保持2D扩散模型的细节和多样性。我们的方法在人工评估中获得了85+%的一致率，远高于前一代方法的30%。

Abstract
It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This "coarse" alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/

摘要
"天然地，将2D结果从预训练的扩散模型升级到3D世界进行文本到3D生成是具有内在困难的。2D扩散模型只是学习视角无关的假设，因此在升级时缺乏3D知识，导致多视图不一致问题。我们发现这个问题主要来自于几何不一致问题，避免误置的几何结构可以大幅减轻问题。因此，我们改进了一致性，通过将2D的几何假设与well-defined的3D形状进行对应，在升级过程中解决了大多数问题。这是通过练习2D扩散模型以便视角意识和生成视специ fic coordinate map来实现的。在我们的过程中，只使用了粗略的3D信息进行对应。这种"粗"对应不仅解决了多视图不一致的几何问题，还保留了2D扩散模型的详细和多样化高质量对象生成能力。此外，我们的对应几何假设（AGP）是通用的，可以轻松地与多种当前领先的管道集成，从而实现高通用性。我们的方法在人工评估中获得了85+%的一致率，而许多前一代方法只有30%左右。我们的项目页面是https://sweetdreamer3d.github.io/。"

ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer

paper_url: http://arxiv.org/abs/2310.02588
repo_url: None
paper_authors: Seok-Yong Byun, Wonju Lee
for: 这 paper 的目的是解释和调试 ViT 模型预测过程中的批处理错误。
methods: 这 paper 使用了一种新的梯度自由 visual explanation 方法，叫做 ViT-ReciproCAM，不需要注意力矩阵和梯度信息。
results: 对比现有的 Relevance 方法，ViT-ReciproCAM 在 Average Drop-Coherence-Complexity (ADCC) metric 中表现出了 $4.58%$ 到 $5.80%$ 的提升，并生成了更加Localized 的焦点图。

Abstract
This paper presents a novel approach to address the challenges of understanding the prediction process and debugging prediction errors in Vision Transformers (ViT), which have demonstrated superior performance in various computer vision tasks such as image classification and object detection. While several visual explainability techniques, such as CAM, Grad-CAM, Score-CAM, and Recipro-CAM, have been extensively researched for Convolutional Neural Networks (CNNs), limited research has been conducted on ViT. Current state-of-the-art solutions for ViT rely on class agnostic Attention-Rollout and Relevance techniques. In this work, we propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information. ViT-ReciproCAM utilizes token masking and generated new layer outputs from the target layer's input to exploit the correlation between activated tokens and network predictions for target classes. Our proposed method outperforms the state-of-the-art Relevance method in the Average Drop-Coherence-Complexity (ADCC) metric by $4.58\%$ to $5.80\%$ and generates more localized saliency maps. Our experiments demonstrate the effectiveness of ViT-ReciproCAM and showcase its potential for understanding and debugging ViT models. Our proposed method provides an efficient and easy-to-implement alternative for generating visual explanations, without requiring attention and gradient information, which can be beneficial for various applications in the field of computer vision.

摘要
Currently, state-of-the-art solutions for ViT rely on class agnostic Attention-Rollout and Relevance techniques, but these methods have limitations. In contrast, our proposed method does not require attention matrix and gradient information, making it more efficient and easier to implement.Our experiments show that ViT-ReciproCAM outperforms the state-of-the-art Relevance method in the Average Drop-Coherence-Complexity (ADCC) metric by 4.58% to 5.80% and generates more localized saliency maps. This demonstrates the effectiveness of our proposed method in understanding and debugging ViT models.Our method has the potential to be beneficial for various applications in the field of computer vision, as it provides an efficient and easy-to-implement alternative for generating visual explanations.

A Prototype-Based Neural Network for Image Anomaly Detection and Localization

paper_url: http://arxiv.org/abs/2310.02576
repo_url: https://github.com/98chao/ProtoAD
paper_authors: Chao Huang, Zhao Kang, Hong Wu
for: 这篇论文的目的是提出一个基于原型的图像异常检测和定位方法（ProtoAD），用于快速和高效地检测和定位图像中的异常点。
methods: 本文使用的方法包括对正常图像的对应网络进行深度学习，然后使用非Parametric clustering来学习正常图像的标本，最后则使用这些标本来建立一个图像异常检测和定位网络（ProtoAD）。
results: 实验结果显示，ProtoAD可以与现有的方法比较，在两个工业异常检测数据集（MVTec AD和BTAD）上具有竞争的性能，同时具有更高的推断速度。

Abstract
Image anomaly detection and localization perform not only image-level anomaly classification but also locate pixel-level anomaly regions. Recently, it has received much research attention due to its wide application in various fields. This paper proposes ProtoAD, a prototype-based neural network for image anomaly detection and localization. First, the patch features of normal images are extracted by a deep network pre-trained on nature images. Then, the prototypes of the normal patch features are learned by non-parametric clustering. Finally, we construct an image anomaly localization network (ProtoAD) by appending the feature extraction network with $L2$ feature normalization, a $1\times1$ convolutional layer, a channel max-pooling, and a subtraction operation. We use the prototypes as the kernels of the $1\times1$ convolutional layer; therefore, our neural network does not need a training phase and can conduct anomaly detection and localization in an end-to-end manner. Extensive experiments on two challenging industrial anomaly detection datasets, MVTec AD and BTAD, demonstrate that ProtoAD achieves competitive performance compared to the state-of-the-art methods with a higher inference speed. The source code is available at: https://github.com/98chao/ProtoAD.

摘要
图像异常检测和地图化可以不仅进行图像水平异常分类，还可以定位像素级异常区域。这一领域在不同领域的应用广泛，因此吸引了许多研究者的关注。这篇论文提出了一种基于原型的神经网络图像异常检测和地图化方法（ProtoAD）。首先，使用深度网络预训练的自然图像特征提取器来提取正常图像的补丁特征。然后，通过非 Parametric 聚类学习出正常补丁特征的核心。最后，我们将特征提取网络与 $L2$ 特征 нормализа、 $1\times1$ 卷积层、通道最大池化和减法操作组合成一个图像异常检测和地图化网络（ProtoAD）。我们使用核心作为 $1\times1$ 卷积层的核心，因此我们的神经网络不需要训练阶段，可以在终端到终端的方式进行异常检测和地图化。我们在 MVTec AD 和 BTAD 两个industrial anomaly detection 数据集上进行了广泛的实验，并证明了 ProtoAD 与状态机器的方法相比，具有更高的检测速度。源代码可以在 GitHub 上获取：https://github.com/98chao/ProtoAD。

AdaMerging: Adaptive Model Merging for Multi-Task Learning

paper_url: http://arxiv.org/abs/2310.02575
repo_url: https://github.com/EnnengYang/AdaMerging
paper_authors: Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, Dacheng Tao
for: 本研究旨在解决多任务学习（MTL）中模型融合的问题，即将多个任务特化的模型直接融合到单一模型中，以提高MTL的性能。
methods: 本研究提出了一种名为自适应模型融合（AdaMerging）的新技术，它通过在无supervised的情况下自动学习模型融合系数，以提高MTL的性能。具体来说，AdaMerging使用测试样本的熵函数来反射iteratively修改模型融合系数。
results: 实验结果表明，Compared to当前状态的拟合算法，AdaMerging显示了11%的性能提升。此外，AdaMerging还在下游任务测试阶段展现出了更好的泛化能力和数据分布变化的鲁棒性。

Abstract
Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11\% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase.

摘要
多任务学习（MTL）目的是让模型同时处理多个任务。现有的发展是任务加法，这种方法可以将每个任务精心适应的模型直接合并到单个模型中，无需进行重新训练过程，使用初始训练数据。然而，直接添加模型通常会导致合并后模型的总性能下降。这种下降是因为多个任务之间可能存在冲突和复杂的相互关系。因此，需要一种更有效的模型合并方法，不需要使用初始训练数据。这篇论文提出了一种创新的技术——适应模型合并（AdaMerging）。这种方法通过自动学习模型合并的系数，而不是使用初始训练数据，来解决这个挑战。具体来说，我们的 AdaMerging 方法是一种自动、无监督的任务加法方案，通过测试样本的熵度最小化来迭代地优化模型合并系数。我们的实验结果表明，我们的 AdaMerging 方法在八个任务上具有remarkable的表现优势。相比之下，当前状态的技术任务加法合并方法，AdaMerging 显示了11%的性能提升。此外，AdaMerging 还展现出了更好的泛化能力，当应用于未见下游任务时。同时，它还在测试阶段可能发生数据分布变化时显示了显著的鲁棒性。

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

paper_url: http://arxiv.org/abs/2310.02569
repo_url: https://github.com/fudandisc/reform-eval
paper_authors: Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen, Xuanjing Huang, Zhongyu Wei
for: 本文提出了一种新的评估大视言语模型（LVLM）的方法，以便更好地评估LVLM的多种能力。
methods: 本文使用了现有的多模式标准套件，并对其进行了修改，以使得LVLM可以更好地处理自由形式的文本输出。
results: 通过对多种LVLM进行了广泛的实验和分析，本文发现了这些模型的优缺点，并提出了进一步改进的方法。

Abstract
Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluate. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available in existing benchmarks and reduce the manual effort required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present the ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Based on ReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths and weaknesses of existing LVLMs, and identify the underlying factors. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.

摘要
近年来，大规模视语模型（LVLM）的发展做出了非常出色的进步。受到强大的语言脊梁和有效的跨模式对接策略的推动，LVLM表现出了对视觉信号的感知和视情理解的奇异能力。然而，LVLM的能力尚未得到全面和量化的评估。现有的多Modalbenchmark大多需要任务强调的输入输出格式，对自由文本输出的自动评估带来很大的挑战。为了有效利用现有benchmark的注释和减少新benchmark的手动劳动，我们提议将现有benchmark重新格式化为LVLM兼容的格式。通过系统性的数据收集和重新格式化，我们提出了ReForm-Evalbenchmark，提供了大量用于评估LVLM的数据。基于ReForm-Eval，我们进行了广泛的实验，对现有LVLM的优势和缺点进行了系统分析，并确定了下面因素。我们的benchmark和评估框架将被开源，成为LVLM的发展的基础。

Generalization in diffusion models arises from geometry-adaptive harmonic representation

paper_url: http://arxiv.org/abs/2310.02557
repo_url: None
paper_authors: Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, Stéphane Mallat
for: 这个论文目的是证明深度神经网络（DNN）在去噪化过程中可以学习高维密度，即使面对尺度问题。
methods: 这个论文使用了分别训练的两个DNN，以不重 overlap的方式训练两个不同的子集。
results: 这两个DNN在几个训练图像后就能够学习出相似的分数函数，并且它们的去噪化性能几乎是最佳的。这表明DNN的架构和训练算法与数据分布的性质有很好的对预传。

Abstract
High-quality samples generated with score-based reverse diffusion algorithms provide evidence that deep neural networks (DNN) trained for denoising can learn high-dimensional densities, despite the curse of dimensionality. However, recent reports of memorization of the training set raise the question of whether these networks are learning the "true" continuous density of the data. Here, we show that two denoising DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, with a surprisingly small number of training images. This strong generalization demonstrates an alignment of powerful inductive biases in the DNN architecture and/or training algorithm with properties of the data distribution. We analyze these, demonstrating that the denoiser performs a shrinkage operation in a basis adapted to the underlying image. Examination of these bases reveals oscillating harmonic structures along contours and in homogeneous image regions. We show that trained denoisers are inductively biased towards these geometry-adaptive harmonic representations by demonstrating that they arise even when the network is trained on image classes such as low-dimensional manifolds, for which the harmonic basis is suboptimal. Additionally, we show that the denoising performance of the networks is near-optimal when trained on regular image classes for which the optimal basis is known to be geometry-adaptive and harmonic.

摘要
高品质样本通过分数逆扩散算法生成，证明深度神经网络（DNN）在去噪过程中学习高维概率分布，即使面临维度困难。然而，最近的报告表明，这些网络是否真的学习数据的连续概率分布？我们表明，两个不相交的subset中训练的denoising DNN，它们学习的scor函数几乎相同，因此学习的概率分布也几乎相同，并且很少需要训练图像。这种强大的泛化表明了DNN架构和/或训练算法与数据分布的对齐。我们分析这些，并证明denoiser在基于图像的基础上进行缩小操作，并且在抽象图像区域和抽象图像边缘的oscillating振荡结构。我们还证明，训练denoisers时，网络具有geometry-adaptive harmonic representation的强大适应性，即使网络被训练在低维抽象图像类型上。此外，我们还证明，训练在正规图像类型上的denoisers具有高效性，这些类型上的优化基是geometry-adaptive和振荡的。

SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers

paper_url: http://arxiv.org/abs/2310.02544
repo_url: https://github.com/UCDvision/SlowFormer
paper_authors: KL Navaneet, Soroush Abbasi Koohpayegani, Essam Sleiman, Hamed Pirsiavash
for: 本文旨在探讨针对深度模型的可扩展计算方法在推理时的计算减少方法，以及这些方法在面对抗击攻击时的Robustness问题。
methods: 本文使用了一种通用抗击攻击方法，即针对深度模型的可扩展计算方法，并在不同的图像数据集上进行了实验。
results: 实验结果表明，在使用了针对深度模型的可扩展计算方法时，可以通过粘贴一个小于8%的图像区域的抗击补丁来增加计算和功率消耗。此外，通过对模型进行标准的 adversarial 训练方法可以减少一些攻击的成功率。

Abstract
Recently, there has been a lot of progress in reducing the computation of deep models at inference time. These methods can reduce both the computational needs and power usage of deep models. Some of these approaches adaptively scale the compute based on the input instance. We show that such models can be vulnerable to a universal adversarial patch attack, where the attacker optimizes for a patch that when pasted on any image, can increase the compute and power consumption of the model. We run experiments with three different efficient vision transformer methods showing that in some cases, the attacker can increase the computation to the maximum possible level by simply pasting a patch that occupies only 8\% of the image area. We also show that a standard adversarial training defense method can reduce some of the attack's success. We believe adaptive efficient methods will be necessary for the future to lower the power usage of deep models, so we hope our paper encourages the community to study the robustness of these methods and develop better defense methods for the proposed attack.

摘要
近些时间，深度模型的执行时间计算减少得非常多。这些方法可以降低计算需求和电力消耗。一些这些方法可以根据输入实例进行可靠缩放计算。我们发现这些模型受到一种通用敌意袋patch攻击，攻击者可以优化一个覆盖任何图像的袋patch，以提高模型的计算和电力消耗。我们进行了三种不同的高效视Transformer方法的实验，发现在某些情况下，攻击者可以通过覆盖图像8%的区域而增加计算到最大可能的水平。我们还发现了标准的对抗训练防御方法可以减少一些攻击的成功。我们认为适应性高效的方法将是未来的需要，因此我们希望我们的论文能够鼓励社区研究这些方法的Robustness和开发更好的防御方法。

ShaSTA-Fuse: Camera-LiDAR Sensor Fusion to Model Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking

paper_url: http://arxiv.org/abs/2310.02532
repo_url: None
paper_authors: Tara Sadjadpour, Rares Ambrus, Jeannette Bohg
for: 这 paper 的目的是开发一个基于摄像头和雷达感知器的3D多对象跟踪（MOT）框架，以提高自动驾驶机器人在场景中安全导航的能力。methods: 该 paper 使用了一种新的摄像头-雷达融合方法，该方法可以将摄像头和雷达感知器中的信息融合以便优化关系学习，从而提高数据关联、跟踪生命周期管理、干扰消除、假阳性传播和跟踪信任度评估等方面的性能。results: 该 paper 在 nuScenes benchmark 上 achieved state-of-the-art 性能 amongst multimodal 3D MOT algorithms using CenterPoint detections。此外，paper 还提供了一种 novel fusion approach 和 a first-of-its-kind multimodal sequential track confidence refinement technique，以及一系列的ablative analysis 以证明摄像头感知器的添加对小、远对象的识别和跟踪具有重要的改进作用。

Abstract
3D multi-object tracking (MOT) is essential for an autonomous mobile agent to safely navigate a scene. In order to maximize the perception capabilities of the autonomous agent, we aim to develop a 3D MOT framework that fuses camera and LiDAR sensor information. Building on our prior LiDAR-only work, ShaSTA, which models shape and spatio-temporal affinities for 3D MOT, we propose a novel camera-LiDAR fusion approach for learning affinities. At its core, this work proposes a fusion technique that generates a rich sensory signal incorporating information about depth and distant objects to enhance affinity estimation for improved data association, track lifecycle management, false-positive elimination, false-negative propagation, and track confidence score refinement. Our main contributions include a novel fusion approach for combining camera and LiDAR sensory signals to learn affinities, and a first-of-its-kind multimodal sequential track confidence refinement technique that fuses 2D and 3D detections. Additionally, we perform an ablative analysis on each fusion step to demonstrate the added benefits of incorporating the camera sensor, particular for small, distant objects that tend to suffer from the depth-sensing limits and sparsity of LiDAR sensors. In sum, our technique achieves state-of-the-art performance on the nuScenes benchmark amongst multimodal 3D MOT algorithms using CenterPoint detections.

摘要
三维多对象跟踪（MOT）是自动移动代理安全导航场景的关键。为了提高自动代理的感知能力，我们目标是开发一个三维MOT框架，并将摄像头和激光探测器信息进行融合。基于我们之前的激光仅工作，ShaSTA，我们提出了一种新的摄像头-激光融合方法，用于学习关系。这种方法生成了rich的感知信号，包括深度和远程对象的信息，以提高关系估计，从而改善数据关联、跟踪生命周期管理、假阳性除除、假阴性升级和跟踪信任分数精度。我们的主要贡献包括一种新的摄像头-激光感知信号融合方法，以及一种首次实现的多模态序列跟踪信任修复技术，可以融合2D和3D探测。此外，我们还进行了每个融合步骤的ablative分析，以显示在包括小、远对象的情况下，摄像头感知的added benefits。总的来说，我们的技术在nuScenes标准 benchmark上达到了多模态3D MOT算法中的状态 искусственный智能水平。

On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study

paper_url: http://arxiv.org/abs/2310.02528
repo_url: None
paper_authors: Liben Chen, Long Chen, Tian Ellison-Chen, Zhuoyuan Xu
for: 本研究旨在研究人类认知和计算机模型之间的关系，以便更好地理解计算机模型如何模仿人类认知。
methods: 本研究使用了survey方法 recording human thinking process，并比较了计算机模型的输出和注意力映射与人类的认知过程。
results: 研究发现，虽然计算机模型的结构和人类认知有相似之处，但它们仍然努力于认知推理。人类认知过程的分析可以指导未来的研究，并引入更多的认知能力到特征和结构设计中。

Abstract
Visual Question Answering (VQA) is a challenging task that requires cross-modal understanding and reasoning of visual image and natural language question. To inspect the association of VQA models to human cognition, we designed a survey to record human thinking process and analyzed VQA models by comparing the outputs and attention maps with those of humans. We found that although the VQA models resemble human cognition in architecture and performs similarly with human on the recognition-level, they still struggle with cognitive inferences. The analysis of human thinking procedure serves to direct future research and introduce more cognitive capacity into modeling features and architectures.

摘要
视觉问答（VQA）是一项具有挑战性的任务，需要跨Modal的理解和自然语言问题的推理。为了了解人类认知与VQA模型之间的关系，我们设计了一份问卷，记录了人类思维过程，并对VQA模型进行比较和分析。我们发现，虽然VQA模型与人类认知结构相似，但它们仍然努力于认知推理。人类思维过程的分析可以指导未来的研究，并在模型特征和建筑中引入更多的认知能力。

A Spatio-Temporal Attention-Based Method for Detecting Student Classroom Behaviors

paper_url: http://arxiv.org/abs/2310.02523
repo_url: None
paper_authors: Fan Yang
for: 本研究旨在提高学生班级中的教学效率，通过精准地检测学生的班级行为。
methods: 本研究使用SlowFast网络生成视频中的动作和环境信息特征地图，然后应用空间时间注意力模块，包括信息汇集、压缩和刺激过程。接着，在时间、通道和空间维度中获得注意力地图，并基于这些注意力地图进行多标签行为分类。
results: 对于一个自制的学生班级行为数据集（STSCB），与SlowFast模型相比，BDSTA模型可以提高学生行为分类检测的准确率8.94%。

Abstract
Accurately detecting student behavior from classroom videos is beneficial for analyzing their classroom status and improving teaching efficiency. However, low accuracy in student classroom behavior detection is a prevalent issue. To address this issue, we propose a Spatio-Temporal Attention-Based Method for Detecting Student Classroom Behaviors (BDSTA). Firstly, the SlowFast network is used to generate motion and environmental information feature maps from the video. Then, the spatio-temporal attention module is applied to the feature maps, including information aggregation, compression and stimulation processes. Subsequently, attention maps in the time, channel and space dimensions are obtained, and multi-label behavior classification is performed based on these attention maps. To solve the long-tail data problem that exists in student classroom behavior datasets, we use an improved focal loss function to assign more weight to the tail class data during training. Experimental results are conducted on a self-made student classroom behavior dataset named STSCB. Compared with the SlowFast model, the average accuracy of student behavior classification detection improves by 8.94\% using BDSTA.

摘要
通过准确探测学生CLASSROOM行为来分析学生的CLASSROOM状态和改善教学效率是非常有利的。然而，学生CLASSROOM行为检测精度低的问题很普遍。为解决这个问题，我们提议一种基于Spatio-Temporal Attention的学生CLASSROOM行为检测方法（BDSTA）。首先，我们使用SlowFast网络生成视频中的动作和环境信息特征地图。然后，我们应用空间-时间注意力模块到特征地图中，包括信息聚合、压缩和刺激过程。最后，我们在时间、通道和空间维度中获得了注意力地图，并根据这些注意力地图进行多标签行为分类。为解决学生CLASSROOM行为数据集中的长尾数据问题，我们使用改进的焦点损失函数来在训练中对tail类数据分配更多的权重。我们在自己制作的学生CLASSROOM行为数据集名为STSCB上进行了实验，与SlowFast模型相比，BDSTA的学生行为分类精度提高了8.94%。

SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior

paper_url: http://arxiv.org/abs/2310.02522
repo_url: https://github.com/whiffe/scb-dataset
paper_authors: Fan Yang, Tao Wang
for: 本研究旨在提供一个可靠的学生行为档案，以便未来的研究人员可以使用该档案进行学生行为探测和教学效果提升。
methods: 本研究使用深度学习方法来自动检测学生的课堂行为，包括使用YOLOv5、YOLOv7和YOLOv8算法进行评估。
results: 本研究在评估SCB-dataset3时，使用YOLOv5、YOLOv7和YOLOv8算法实现了平均准确率（map）达80.3%。

Abstract
The use of deep learning methods to automatically detect students' classroom behavior is a promising approach for analyzing their class performance and improving teaching effectiveness. However, the lack of publicly available datasets on student behavior poses a challenge for researchers in this field. To address this issue, we propose the Student Classroom Behavior dataset (SCB-dataset3), which represents real-life scenarios. Our dataset comprises 5686 images with 45578 labels, focusing on six behaviors: hand-raising, reading, writing, using a phone, bowing the head, and leaning over the table. We evaluated the dataset using the YOLOv5, YOLOv7, and YOLOv8 algorithms, achieving a mean average precision (map) of up to 80.3$\%$. We believe that our dataset can serve as a robust foundation for future research in student behavior detection and contribute to advancements in this field. Our SCB-dataset3 is available for download at: https://github.com/Whiffe/SCB-dataset

摘要
使用深度学习方法自动检测学生的教室行为是一种有前途的方法，可以分析学生的教学效果和提高教学质量。然而，公共可用的学生行为数据集存在一定的挑战，以至于研究人员在这个领域困难于进行研究。为解决这个问题，我们提出了学生教室行为数据集（SCB-dataset3），该数据集反映了实际生活场景。我们的数据集包括5686个图像和45578个标签，关注六种行为：手势提高、读书、写作、使用手机、垂头和抽ån表。我们使用YOLOv5、YOLOv7和YOLOv8算法进行评估，实现了平均准确率（map）的80.3%。我们认为，SCB-dataset3可以作为未来学生行为检测研究的坚实基础，为这一领域的进步做出贡献。SCB-dataset3可以在以下地址下载：https://github.com/Whiffe/SCB-dataset。

2023-10-04

ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements

Shielding the Unseen: Privacy Protection through Poisoning NeRF with Spatial Deformation

Blind CT Image Quality Assessment Using DDPM-derived Content and Transformer-based Evaluator

Reinforcement Learning-based Mixture of Vision Transformers for Video Violence Recognition

Creating an Atlas of Normal Tissue for Pruning WSI Patching Through Anomaly Detection

Privacy-preserving Multi-biometric Indexing based on Frequent Binary Patterns

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

Towards Domain-Specific Features Disentanglement for Domain Generalization

COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Reversing Deep Face Embeddings with Probable Privacy Protection

Optimizing Key-Selection for Face-based One-Time Biometrics via Morphing

Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection

Adaptive Landmark Color for AUV Docking in Visually Dynamic Environments

Graph data modelling for outcome prediction in oropharyngeal cancer patients

Computationally Efficient Quadratic Neural Networks

Human-centric Behavior Description in Videos: New Benchmark and Model

A Grammatical Compositional Model for Video Action Detection

Multi-Resolution Fusion for Fully Automatic Cephalometric Landmark Detection

Magicremover: Tuning-free Text-guided Image inpainting with Diffusion Models

Delving into CLIP latent space for Video Anomaly Recognition

All Sizes Matter: Improving Volumetric Brain Segmentation on Small Lesions

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

Tracking Anything in Heart All at Once

LROC-PANGU-GAN: Closing the Simulation Gap in Learning Crater Segmentation with Planetary Simulators

Dynamic Shuffle: An Efficient Channel Mixture Method

SHOT: Suppressing the Hessian along the Optimization Trajectory for Gradient-Based Meta-Learning

Condition numbers in multiview geometry, instability in relative pose estimation, and RANSAC

Understanding Pan-Sharpening via Generalized Inverse

GETAvatar: Generative Textured Meshes for Animatable Human Avatars

Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Clasification

PostRainBench: A comprehensive benchmark and a new model for precipitation forecasting

MedPrompt: Cross-Modal Prompting for Multi-Task Medical Image Translation

Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach

P2CADNet: An End-to-End Reconstruction Network for Parametric 3D CAD Model from Point Clouds

Analyzing and Improving OT-based Adversarial Networks

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for Vision Transformer

A Prototype-Based Neural Network for Image Anomaly Detection and Localization

AdaMerging: Adaptive Model Merging for Multi-Task Learning

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Generalization in diffusion models arises from geometry-adaptive harmonic representation

SlowFormer: Universal Adversarial Patch for Attack on Compute and Energy Efficiency of Inference Efficient Vision Transformers

ShaSTA-Fuse: Camera-LiDAR Sensor Fusion to Model Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking

On the Cognition of Visual Question Answering Models and Human Intelligence: A Comparative Study

A Spatio-Temporal Attention-Based Method for Detecting Student Classroom Behaviors

SCB-Dataset3: A Benchmark for Detecting Student Classroom Behavior