2023-10-07

cs.CV

cs.CV - 2023-10-07

DISCOVER: Making Vision Networks Interpretable via Competition and Dissection

paper_url: http://arxiv.org/abs/2310.04929
repo_url: None
paper_authors: Konstantinos P. Panousis, Sotirios Chatzis
for: 这个论文旨在提高深度网络的后期解释性，特别是网络解剖。我们的目标是为视觉任务训练的网络提供一个框架，以便更好地发现每个神经元的个人功能。
methods: 我们利用了最新的视觉语言模型和网络层的新概念——随机地方竞争的线性单元。只有小量的层神经元被激活，导致activation sparse度 extremly low（只有$\approx 4%$）。我们的提posed方法可以推断（稀疏）神经元活动模式，使神经元可以activate/特化于输入特征。
results: 我们的方法可以保持或改进视觉网络的分类性能，同时实现了一种原则性的文本基于描述和网络神经元表示的框架。在我们的实验中，我们发现：（i）我们的方法可以提高网络的分类性能，（ii）我们的方法可以实现文本基于描述和网络神经元表示的原则性框架。

Abstract
Modern deep networks are highly complex and their inferential outcome very hard to interpret. This is a serious obstacle to their transparent deployment in safety-critical or bias-aware applications. This work contributes to post-hoc interpretability, and specifically Network Dissection. Our goal is to present a framework that makes it easier to discover the individual functionality of each neuron in a network trained on a vision task; discovery is performed in terms of textual description generation. To achieve this objective, we leverage: (i) recent advances in multimodal vision-text models and (ii) network layers founded upon the novel concept of stochastic local competition between linear units. In this setting, only a small subset of layer neurons are activated for a given input, leading to extremely high activation sparsity (as low as only $\approx 4\%$). Crucially, our proposed method infers (sparse) neuron activation patterns that enables the neurons to activate/specialize to inputs with specific characteristics, diversifying their individual functionality. This capacity of our method supercharges the potential of dissection processes: human understandable descriptions are generated only for the very few active neurons, thus facilitating the direct investigation of the network's decision process. As we experimentally show, our approach: (i) yields Vision Networks that retain or improve classification performance, and (ii) realizes a principled framework for text-based description and examination of the generated neuronal representations.

摘要
现代深度网络具有极高的复杂性，其含义难以解释。这成为了在安全关键应用或偏见敏感应用中透明部署的障碍。本研究做出了后期解释性的贡献，特别是网络解剖。我们的目标是提供一种框架，使得在视觉任务上训练的网络中每个神经元的个性功能更易于发现。为达到这个目标，我们利用：（i）视觉语言模型的最新进展和（ii）基于新的精度地方竞争理论的网络层。在这种设置下，只有输入的一小部分神经元被激活，导致活动率极低（只有约4%）。然而，我们的提议方法可以推断（稀疏）神经元活动模式，使神经元能够对特定输入特征进行特殊化，从而拓宽它们的个性功能。这种能力使我们的方法在分解过程中具有很高的可用性：只有活动的神经元被生成为人类可读的描述，从而方便直接检查网络的决策过程。我们的实验表明，我们的方法可以：（i）保持或提高分类性能，和（ii）实现基于文本描述的网络检查和分析的原则性框架。

DynamicBEV: Leveraging Dynamic Queries and Temporal Context for 3D Object Detection

paper_url: http://arxiv.org/abs/2310.05989
repo_url: None
paper_authors: Jiawei Yao, Yingxin Lai
for: 这个研究的目的是为了提高BEV图像中3D物体检测的精度和效率，并且能够适应复杂的空间时间关系。
methods: 这篇论文提出了一个新的方法 called DynamicBEV，它使用动态查询来替代传统的静止查询，以更好地适应场景中的复杂空间时间关系。这个方法使用K-means clustering和Top-K Attention来协同组合本地和远程特征，以提高数据聚合效果。此外，这篇论文还提出了一个轻量级的时间融合模组（LTFM），用于有效地融合时间上下文，并且降低计算量。最后，这篇论文还使用了一个自定义的多标本损失函数，以确保特征表现的平衡。
results: 根据nuScenes dataset的实验结果，这篇论文确认了DynamicBEV的效果，并且获得了新的州际纪录，这说明了这个方法在Query-based BEV object detection中的优越性。

Abstract
3D object detection is crucial for applications like autonomous driving and robotics. While query-based 3D object detection for BEV (Bird's Eye View) images has seen significant advancements, most existing methods follows the paradigm of static query. Such paradigm is incapable of adapting to complex spatial-temporal relationships in the scene. To solve this problem, we introduce a new paradigm in DynamicBEV, a novel approach that employs dynamic queries for BEV-based 3D object detection. In contrast to static queries, the proposed dynamic queries exploit K-means clustering and Top-K Attention in a creative way to aggregate information more effectively from both local and distant feature, which enable DynamicBEV to adapt iteratively to complex scenes. To further boost efficiency, DynamicBEV incorporates a Lightweight Temporal Fusion Module (LTFM), designed for efficient temporal context integration with a significant computation reduction. Additionally, a custom-designed Diversity Loss ensures a balanced feature representation across scenarios. Extensive experiments on the nuScenes dataset validate the effectiveness of DynamicBEV, establishing a new state-of-the-art and heralding a paradigm-level breakthrough in query-based BEV object detection.

摘要
三维物体检测是自动驾驶和 робо技术中非常重要的应用。而现有的方法都是使用静态查询，这种方法无法适应场景中的复杂空时关系。为解决这个问题，我们提出了一种新的思路：动态查询（DynamicBEV），它利用K-means归一化和Top-K注意力来更有效地从本地和远程特征处理信息，以适应场景的变化。此外，DynamicBEV还包括一个轻量级时间融合模块（LTFM），用于有效地融合时间上下文，并大幅减少计算量。此外，我们还设计了一种自定义的多样性损失函数，以保证不同场景中特征表示的均衡。广泛的实验 validate DynamicBEV 的有效性，成功创造了一个新的静态查询下的 BEV 物体检测状态ixel。

$H$-RANSAC, an algorithmic variant for Homography image transform from featureless point sets: application to video-based football analytics

paper_url: http://arxiv.org/abs/2310.04912
repo_url: https://github.com/gnousias/h-ransac
paper_authors: George Nousias, Konstantinos Delibasis, Ilias Maglogiannis
for: 这种paper是为了解决图像匹配问题，具体来说是计算图像之间的投影矩阵。
methods: 这种方法使用了一种通用的RANSAC算法，并添加了一些针对特定情况的修改，以提高其精度和可靠性。
results: 这种方法在一个大量的足球赛事图像 dataset 上进行测试，与其他State-of-the-art实现相比，表现出了更高的精度和更多的成功处理的图像对。

Abstract
Estimating homography matrix between two images has various applications like image stitching or image mosaicing and spatial information retrieval from multiple camera views, but has been proved to be a complicated problem, especially in cases of radically different camera poses and zoom factors. Many relevant approaches have been proposed, utilizing direct feature based, or deep learning methodologies. In this paper, we propose a generalized RANSAC algorithm, H-RANSAC, to retrieve homography image transformations from sets of points without descriptive local feature vectors and point pairing. We allow the points to be optionally labelled in two classes. We propose a robust criterion that rejects implausible point selection before each iteration of RANSAC, based on the type of the quadrilaterals formed by random point pair selection (convex or concave and (non)-self-intersecting). A similar post-hoc criterion rejects implausible homography transformations is included at the end of each iteration. The expected maximum iterations of $H$-RANSAC are derived for different probabilities of success, according to the number of points per image and per class, and the percentage of outliers. The proposed methodology is tested on a large dataset of images acquired by 12 cameras during real football matches, where radically different views at each timestamp are to be matched. Comparisons with state-of-the-art implementations of RANSAC combined with classic and deep learning image salient point detection indicates the superiority of the proposed $H$-RANSAC, in terms of average reprojection error and number of successfully processed pairs of frames, rendering it the method of choice in cases of image homography alignment with few tens of points, while local features are not available, or not descriptive enough. The implementation of $H$-RANSAC is available in https://github.com/gnousias/H-RANSAC

摘要
“计算图像之间的Homography矩阵有很多应用，如图像融合或图像拼接，以及从多个摄像头视角中获取空间信息。但是，这个问题在摄像头姿态和缩放因子之间很大的情况下是非常复杂的。许多相关的方法已经被提出，包括直接特征基于方法和深度学习方法。在这篇论文中，我们提出一种通用的RANSAC算法，即H-RANSAC，用于从无特征点Cloud中提取Homography图像变换。我们允许点可选择为两类标签。我们提出了一种robust的检验点选择前每次RANSAC迭代的 критери产生，基于随机点对选择的四边形类型（半球或半球和（非）自交）。此外，我们还包括在每次迭代结束后对不可能的Homography变换进行检验的预后检查。我们预计在不同的成功概率下，H-RANSAC的最大迭代次数可以得出，根据图像点Cloud的大小和每个图像点的类别。我们对一个包含12个摄像头 captured during real football matches的大型图像集进行测试，并与STATE-OF-THE-ART的RANSAC+ classic/深度学习图像突出点检测结合相比。结果表明，我们的H-RANSAC方法在平均 reprojection error 和成功处理图像对的数量方面表现出色，在图像Homography对齐问题中具有优选性， especial when local features are not available or not descriptive enough。H-RANSAC实现可以在https://github.com/gnousias/H-RANSAC ”

WAIT: Feature Warping for Animation to Illustration video Translation using GANs

paper_url: http://arxiv.org/abs/2310.04901
repo_url: https://github.com/giddyyupp/wait
paper_authors: Samet Hicsonmez, Nermin Samet, Fidan Samet, Oguz Bakir, Emre Akbas, Pinar Duygulu
for: 本研究探讨了一个新的视频到视频翻译领域，旨在将动画电影翻译为原始插图风格。
methods: 我们提出了一种新的视频翻译问题，使用一个未排序的图像集来翻译输入视频。这是一项挑战性的任务，因为我们没有利用视频序列的优势，同时从多个图像中获得一致的风格更加困难。
results: 我们提出了一种基于图像到图像翻译模型的新生成器网络，并使用特征扭曲层来确保视频之间的时间协调。我们通过三个数据集进行质量和量化的比较，证明了我们的方法的效果。代码和预训练模型可以在GitHub上获取。

Abstract
In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.

摘要
在这篇论文中，我们探讨了一个新的视频到视频翻译领域。我们受到了动画电影的推广，这些电影是根据儿童插画书改编的。我们希望使用原始插画的风格翻译视频。现有的视频到视频翻译模型都是基于单个风格图片或视频序列来翻译输入视频。我们引入了一个新的视频翻译问题，在这个问题中，我们使用一个无序集合的图片来翻译输入视频。这是一个具有两种挑战性：一是我们没有优势的时间一致性，二是从无序图片中获取视频帧的一致风格更加困难。大多数视频到视频翻译方法都是基于图片到图片翻译模型，并且添加了镜像流、时间预测器等额外网络来捕捉时间关系。这些额外网络使模型训练和推理变得复杂，并且慢了过程。为保证视频翻译中的时间一致性，我们提议了一个新的生成器网络，具有特征扭曲层，这种方法超越了之前的方法的局限性。我们通过三个数据集的质量和量的评估，证明了我们的方法的有效性。代码和预训练模型可以在https://github.com/giddyyupp/wait上获取。

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

paper_url: http://arxiv.org/abs/2310.04900
repo_url: https://github.com/ninatu/howtocaption
paper_authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne
for: 这个论文是为了提高text-video模型的性能而设计的。
methods: 该论文使用大型自然语言模型（LLM）来生成视频描述，并通过提取视频字幕的子title来减少人工标注的干扰。
results: 该论文的实验结果表明，使用该方法可以获得大规模无需人工标注的视频描述，并且可以提高text-video retrieval任务的性能。

Abstract
Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.

摘要
文章中提到的视频教程是一个非常好的来源 для学习多Modal表示，通过使用自动语音识别系统（ASR）提取视频中的音频信号中的话语和字幕，并利用这些话语和字幕来学习多Modal表示。然而，与人工标注的字幕相比，视频中的话语和字幕与视频的视觉内容之间存在干扰，因此提供了质量不高的指导。因此，大规模无注释网络视频训练数据仍然是训练文本-视频模型的下OPTimal。在这种情况下，我们提议利用大型自然语言模型（LLM）来获取视频描述。 Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks.

Machine Learning for Automated Mitral Regurgitation Detection from Cardiac Imaging

paper_url: http://arxiv.org/abs/2310.04871
repo_url: None
paper_authors: Ke Xiao, Erik Learned-Miller, Evangelos Kalogerakis, James Priest, Madalina Fiterau
for: Mitral regurgitation (MR) 诊断
methods: 使用 semi-supervised 模型 CUSSP，利用标准计算机视觉技术和对比模型，从大量无标签数据中学习，并与专业分类器结合，实现首次自动 MR 诊断系统。
results: 在测试集上，CUSSP 得到了 F1 分数 0.69 和 ROC-AUC 分数 0.88，创造了这个新任务的首个benchmarkresult。

Abstract
Mitral regurgitation (MR) is a heart valve disease with potentially fatal consequences that can only be forestalled through timely diagnosis and treatment. Traditional diagnosis methods are expensive, labor-intensive and require clinical expertise, posing a barrier to screening for MR. To overcome this impediment, we propose a new semi-supervised model for MR classification called CUSSP. CUSSP operates on cardiac imaging slices of the 4-chamber view of the heart. It uses standard computer vision techniques and contrastive models to learn from large amounts of unlabeled data, in conjunction with specialized classifiers to establish the first ever automated MR classification system. Evaluated on a test set of 179 labeled -- 154 non-MR and 25 MR -- sequences, CUSSP attains an F1 score of 0.69 and a ROC-AUC score of 0.88, setting the first benchmark result for this new task.

摘要
Mitral regurgitation (MR) 是一种心脏阀门疾病，有可能有致命的后果，只能通过及时诊断和治疗来抵消。传统诊断方法是昂贵，劳动密集，需要丰富的临床专业知识，这成为了 MR 检测的障碍。为了突破这个障碍，我们提出了一种新的半supervised模型 для MR 分类，称为 CUSSP。CUSSP 运行在心脏成像剖面的4室视图上，使用标准的计算机视觉技术和对比模型，从大量的无标签数据中学习，并与专门的分类器结合以建立首次的自动 MR 分类系统。在一个测试集上进行评估，CUSSP 的 F1 分数为 0.69，ROC-AUC 分数为 0.88，创造了这个新任务的首个benchmark结果。

Exploiting Facial Relationships and Feature Aggregation for Multi-Face Forgery Detection

paper_url: http://arxiv.org/abs/2310.04845
repo_url: None
paper_authors: Chenhao Lin, Fangbin Yi, Hang Wang, Qian Li, Deng Jingyi, Chao Shen
for: 防止多人脸伪造攻击
methods: 使用facial relationships学习模块和全局特征聚合模块
results: 在两个公开的多人脸伪造数据集上实现了状态畅的多人脸伪造检测效果

Abstract
Face forgery techniques have emerged as a forefront concern, and numerous detection approaches have been proposed to address this challenge. However, existing methods predominantly concentrate on single-face manipulation detection, leaving the more intricate and realistic realm of multi-face forgeries relatively unexplored. This paper proposes a novel framework explicitly tailored for multi-face forgery detection,filling a critical gap in the current research. The framework mainly involves two modules:(i) a facial relationships learning module, which generates distinguishable local features for each face within images,(ii) a global feature aggregation module that leverages the mutual constraints between global and local information to enhance forgery detection accuracy.Our experimental results on two publicly available multi-face forgery datasets demonstrate that the proposed approach achieves state-of-the-art performance in multi-face forgery detection scenarios.

摘要
面孔伪造技术已经成为当前研究的突出问题，许多检测方法已经被提议以解决这个挑战。然而，现有的方法主要集中在单个面孔修改检测上，留下更复杂和实际的多面孔伪造场景尚未得到足够的探索。这篇论文提出了一种新的框架，专门针对多面孔伪造检测。该框架主要包括两个模块：（i）一个人脸关系学习模块，生成每个图像中的 distinguishing 本地特征（ii）一个全局特征聚合模块，利用全局和本地信息之间的互补关系来增强伪造检测精度。我们在两个公开available的多面孔伪造数据集上进行了实验，结果表明，我们提出的方法在多面孔伪造检测场景中具有国际之最的表现。

Extract-Transform-Load for Video Streams

paper_url: http://arxiv.org/abs/2310.04830
repo_url: https://github.com/ferdiko/vetl
paper_authors: Ferdinand Kossmann, Ziniu Wu, Eugenie Lai, Nesime Tatbul, Lei Cao, Tim Kraska, Samuel Madden
for: 这篇论文主要是为了解决大规模视频分析中的存储和查询问题。
methods: 这篇论文提出了一种名为Skyscraper的系统，可以实现自适应的视频抽取、转换和加载（V-ETL）过程，以降低存储和查询成本。Skyscraper使用了缓存和云汇抽象来应对工作负载峰值，并可以自动调整抽取率和分辨率以适应不同的内容。
results: 在实验中，Skyscraper在对视频抽取和转换过程中显著降低了成本，同时也提供了一定的可靠性保证，而现有的最佳实践系统则无法同时实现这两点。

Abstract
Social media, self-driving cars, and traffic cameras produce video streams at large scales and cheap cost. However, storing and querying video at such scales is prohibitively expensive. We propose to treat large-scale video analytics as a data warehousing problem: Video is a format that is easy to produce but needs to be transformed into an application-specific format that is easy to query. Analogously, we define the problem of Video Extract-Transform-Load (V-ETL). V-ETL systems need to reduce the cost of running a user-defined V-ETL job while also giving throughput guarantees to keep up with the rate at which data is produced. We find that no current system sufficiently fulfills both needs and therefore propose Skyscraper, a system tailored to V-ETL. Skyscraper can execute arbitrary video ingestion pipelines and adaptively tunes them to reduce cost at minimal or no quality degradation, e.g., by adjusting sampling rates and resolutions to the ingested content. Skyscraper can hereby be provisioned with cheap on-premises compute and uses a combination of buffering and cloud bursting to deal with peaks in workload caused by expensive processing configurations. In our experiments, we find that Skyscraper significantly reduces the cost of V-ETL ingestion compared to adaptions of current SOTA systems, while at the same time giving robustness guarantees that these systems are lacking.

摘要
社交媒体、自动驾驶车和交通摄像头产生大规模的视频流，但存储和查询这些视频流的成本过高。我们认为将大规模视频分析视为数据存储问题：视频是容易生成的格式，但需要转换为特定应用程序可查询的格式。我们定义视频EXTRACT-TRANSFORM-LOAD（V-ETL）问题。V-ETL系统需要降低运行用户定义的V-ETL任务的成本，同时给予吞吐量保证以满足数据生产的速度。我们发现当前系统无法充分满足这两个需求，因此我们提出了Skyscraper系统。Skyscraper可以执行任意视频入库管道，并动态调整这些管道以降低成本，例如调整采样率和分辨率。Skyscraper可以通过具有低成本的在线计算机和缓存和云冲顶来处理因价格高昂的处理配置而引起的峰值工作负荷。在我们的实验中，我们发现Skyscraper可以与当前SOTA系统相比，对视频入库成本进行显著减少，同时保证这些系统缺乏的可靠性 guarantee。

How to effectively train an ensemble of Faster R-CNN object detectors to quantify uncertainty

paper_url: http://arxiv.org/abs/2310.04829
repo_url: https://github.com/akola-mbey-denis/efficientensemble
paper_authors: Denis Mbey Akola, Gianni Franchi
for:这个论文提出了一种新的对象检测 ensemble 模型训练方法，具体来说是对 Faster R-CNN 模型进行不确定性估计。methods: authors 提出了训练一个 Region Proposal Network (RPN) 和多个 Fast R-CNN 预测头，以建立一个可靠的深度ensemble网络，用于对象检测中估计不确定性。results: authors 采用这种方法，并通过实验表明，这种方法比naive方法（完全训练所有 $n$ 模型）要快得多。此外， authors 还使用 Ensemble Model’s Expected Calibration Error (ECE) 来估计不确定性。最后， authors 比较了这种模型与 Gaussian YOLOv3 的性能。

Abstract
This paper presents a new approach for training two-stage object detection ensemble models, more specifically, Faster R-CNN models to estimate uncertainty. We propose training one Region Proposal Network(RPN)~\cite{https://doi.org/10.48550/arxiv.1506.01497} and multiple Fast R-CNN prediction heads is all you need to build a robust deep ensemble network for estimating uncertainty in object detection. We present this approach and provide experiments to show that this approach is much faster than the naive method of fully training all $n$ models in an ensemble. We also estimate the uncertainty by measuring this ensemble model's Expected Calibration Error (ECE). We then further compare the performance of this model with that of Gaussian YOLOv3, a variant of YOLOv3 that models uncertainty using predicted bounding box coordinates. The source code is released at \url{https://github.com/Akola-Mbey-Denis/EfficientEnsemble}

摘要
这篇论文提出了一种新的对两stage对象检测集成模型进行训练方法，具体来说是对Faster R-CNN模型进行不确定性估计。我们提议在一个Region Proposal Network（RPN）~\cite{https://doi.org/10.48550/arxiv.1506.01497}和多个快速R-CNN预测头上进行训练，这样就能够建立一个robust的深度集成网络，用于对象检测中的不确定性估计。我们介绍了这种方法，并通过实验表明这种方法比Naive方法（即将所有$n$模型在集成中完全训练）要快得多。我们还使用 ensemble模型的预测结果来估计不确定性，并使用Expected Calibration Error（ECE）来测量这个模型的不确定性。最后，我们进一步比较了这个模型的性能与Gaussian YOLOv3模型，这是一种使用预测 bounding box 坐标来模拟不确定性的YOLOv3变体。代码可以在 \url{https://github.com/Akola-Mbey-Denis/EfficientEnsemble} 上下载。

Comparative study of multi-person tracking methods

paper_url: http://arxiv.org/abs/2310.04825
repo_url: None
paper_authors: Denis Mbey Akola
for: 这个论文研究了两种跟踪算法（SORT和Tracktor++），这两种算法在MOT Challenge leaderboard上排名第一位（MOTChallenge网页：https://motchallenge.net）。
methods: 作者采用了流行的跟踪-by-检测方法，并使用自己训练的人体检测模型（MOT17Det数据集：https://motchallenge.net/data/MOT17Det/）和MOT17数据集（https://motchallenge.net/data/MOT17/）中的人体识别模型来降低Tracktor++中的假重复警示。
results: 实验结果表明，Tracktor++比SORT更好的多人跟踪算法。作者还进行了减少RE-ID网络和运动的贡献对Tracktor++结果的分析，并提供了未来研究的建议。

Abstract
This paper presents a study of two tracking algorithms (SORT~\cite{7533003} and Tracktor++~\cite{2019}) that were ranked first positions on the MOT Challenge leaderboard (The MOTChallenge web page: https://motchallenge.net ). The purpose of this study is to discover the techniques used and to provide useful insights about these algorithms in the tracking pipeline that could improve the performance of MOT tracking algorithms. To this end, we adopted the popular tracking-by-detection approach. We trained our own Pedestrian Detection model using the MOT17Det dataset (MOT17Det : https://motchallenge.net/data/MOT17Det/ ). We also used a re-identification model trained on MOT17 dataset (MOT17 : https://motchallenge.net/data/MOT17/ ) for Tracktor++ to reduce the false re-identification alarms. We then present experimental results which shows that Tracktor++ is a better multi-person tracking algorithm than SORT. We also performed ablation studies to discover the contribution of re-identification(RE-ID) network and motion to the results of Tracktor++. We finally conclude by providing some recommendations for future research.

摘要
Here's the Simplified Chinese translation:这篇论文研究了两种多对象跟踪（MOT）算法（SORT和Tracktor++），它们在MOT Challenge leaderboard上排名第一。研究的目的是了解这些算法使用的技术和提供有用的洞察，以提高MOT跟踪性能。采用了跟踪通过检测的方法，并使用MOT17Det数据集来训练自己的人体检测模型。此外，还使用MOT17数据集来训练Tracktor++中的重复标识模型，以降低假的重复警示。实验结果表明，Tracktor++比SORT更好的多人跟踪算法。此外，还进行了减少RE-ID网络和运动对Tracktor++的贡献的ablation研究。最后，文章提出了一些未来研究的建议。

Combining UPerNet and ConvNeXt for Contrails Identification to reduce Global Warming

paper_url: http://arxiv.org/abs/2310.04808
repo_url: https://github.com/biluko/2023gric
paper_authors: Zhenkuan Wang
for: This study focuses on aircraft contrail detection in global satellite images to improve contrail models and mitigate their impact on climate change.methods: An innovative data preprocessing technique for NOAA GOES-16 satellite images is developed, using brightness temperature data from the infrared channel to create false-color images, enhancing model perception. The model selection is based on the UPerNet architecture, implemented using the MMsegmentation library, with the integration of two ConvNeXt configurations for improved performance.results: The approach achieves exceptional results, boasting a high Dice coefficient score, placing it in the top 5% of participating teams.

Abstract
Semantic segmentation is a critical tool in computer vision, applied in various domains like autonomous driving and medical imaging. This study focuses on aircraft contrail detection in global satellite images to improve contrail models and mitigate their impact on climate change.An innovative data preprocessing technique for NOAA GOES-16 satellite images is developed, using brightness temperature data from the infrared channel to create false-color images, enhancing model perception. To tackle class imbalance, the training dataset exclusively includes images with positive contrail labels.The model selection is based on the UPerNet architecture, implemented using the MMsegmentation library, with the integration of two ConvNeXt configurations for improved performance. Cross-entropy loss with positive class weights enhances contrail recognition. Fine-tuning employs the AdamW optimizer with a learning rate of $2.5 \times 10^{-4}$.During inference, a multi-model prediction fusion strategy and a contrail determination threshold of 0.75 yield a binary prediction mask. RLE encoding is used for efficient prediction result organization.The approach achieves exceptional results, boasting a high Dice coefficient score, placing it in the top 5\% of participating teams. This underscores the innovative nature of the segmentation model and its potential for enhanced contrail recognition in satellite imagery.For further exploration, the code and models are available on GitHub: \url{https://github.com/biluko/2023GRIC.git}.

摘要
semantic segmentation是计算机视觉中的一种重要工具，在各种领域中应用，如自动驾驶和医疗影像。这个研究将focus on气象卫星图像中的飞机烟尘迹象检测，以改进烟尘模型并减少对气候变化的影响。在这种情况下，我们开发了一种新的数据预处理技术，使用NOAA GOES-16卫星图像的红外通道的亮度温度数据创建 false-color 图像，提高模型的识别能力。为了解决类别偏斜问题，我们专门使用包含正确烟尘标签的训练集。模型选择基于 UPerNet 架构，通过 MMsegmentation 库的实现，并配置了两种 ConvNeXt 结构以提高性能。使用十字积分损失函数，并将正确类型权重为 2.5 倍 10 ^{-4}。在推理时，我们采用多模型预测融合策略和烟尘确定阈值 0.75，生成二进制预测Mask。使用 RLE 编码器进行有效的预测结果组织。这种方法实现了出色的结果，具有高 dice 系数，位列参赛队伍的前 5%。这表明我们的 segmentation 模型具有创新性，并有可能在卫星图像中提高烟尘识别精度。如果您想了解更多，可以在 GitHub 上查看我们的代码和模型： \url{https://github.com/biluko/2023GRIC.git}。

Fully Sparse Long Range 3D Object Detection Using Range Experts and Multimodal Virtual Points

paper_url: http://arxiv.org/abs/2310.04800
repo_url: None
paper_authors: Ajinkya Khoche, Laura Pereira Sánchez, Nazre Batool, Sina Sharif Mansouri, Patric Jensfelt
for: 提高自动驾驶车辆的安全性和效率，准确探测和 реаги于远距离对象、障碍物和危险。
methods: combining two LiDAR based 3D detection networks, one specializing at near to mid-range objects, and one at long-range 3D detection, with Multimodal Virtual Points (MVP) to enrich the data with virtual points.
results: achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range.

Abstract
3D object detection at long-range is crucial for ensuring the safety and efficiency of self-driving cars, allowing them to accurately perceive and react to objects, obstacles, and potential hazards from a distance. But most current state-of-the-art LiDAR based methods are limited by the sparsity of range sensors, which generates a form of domain gap between points closer to and farther away from the ego vehicle. Another related problem is the label imbalance for faraway objects, which inhibits the performance of Deep Neural Networks at long-range. Although image features could be beneficial for long-range detections, and some recently proposed multimodal methods incorporate image features, they do not scale well computationally at long ranges or are limited by depth estimation accuracy. To address the above limitations, we propose to combine two LiDAR based 3D detection networks, one specializing at near to mid-range objects, and one at long-range 3D detection. To train a detector at long range under a scarce label regime, we further propose to weigh the loss according to the labelled objects' distance from ego vehicle. To mitigate the LiDAR sparsity issue, we leverage Multimodal Virtual Points (MVP), an image based depth completion algorithm, to enrich our data with virtual points. Our method, combining two range experts trained with MVP, which we refer to as RangeFSD, achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range. The code will be released soon.

摘要
三维物体探测在长距离下是自动驾驶车辆的安全和效率 Ensuring the safety and efficiency of self-driving cars, allowing them to accurately perceive and react to objects, obstacles, and potential hazards from a distance. However, most current state-of-the-art LiDAR-based methods are limited by the sparsity of range sensors, which creates a domain gap between points closer to and farther away from the ego vehicle. Another related problem is the label imbalance for faraway objects, which hinders the performance of deep neural networks at long range. Although image features could be beneficial for long-range detections, and some recently proposed multimodal methods incorporate image features, they do not scale well computationally at long ranges or are limited by depth estimation accuracy. To address the above limitations, we propose to combine two LiDAR-based 3D detection networks, one specializing in near-to-mid-range objects, and one at long-range 3D detection. To train a detector at long range under a scarce label regime, we further propose to weigh the loss according to the labeled objects' distance from the ego vehicle. To mitigate the LiDAR sparsity issue, we leverage Multimodal Virtual Points (MVP), an image-based depth completion algorithm, to enrich our data with virtual points. Our method, combining two range experts trained with MVP, which we refer to as RangeFSD, achieves state-of-the-art performance on the Argoverse2 (AV2) dataset, with improvements at long range. The code will be released soon.

HI-SLAM: Monocular Real-time Dense Mapping with Hybrid Implicit Fields

paper_url: http://arxiv.org/abs/2310.04787
repo_url: None
paper_authors: Wei Zhang, Tiecheng Sun, Sen Wang, Qing Cheng, Norbert Haala
for: 这个论文旨在提出一种基于神经场的实时单目地图呈现框架，以实现准确和密集的同时定位和地图建模（SLAM）。
methods: 我们的方法将精细的SLAM方法与神经场隐式场合并，并使用多分辨率网格编码和签名距离函数（SDF）表示来生成神经场。这使得我们可以在实时更新地图，并通过在线循环关闭来保持全球一致性。
results: 我们的方法在实验中表现出比现有方法更高的准确率和地图完整性，同时保持实时性。

Abstract
In this letter, we present a neural field-based real-time monocular mapping framework for accurate and dense Simultaneous Localization and Mapping (SLAM). Recent neural mapping frameworks show promising results, but rely on RGB-D or pose inputs, or cannot run in real-time. To address these limitations, our approach integrates dense-SLAM with neural implicit fields. Specifically, our dense SLAM approach runs parallel tracking and global optimization, while a neural field-based map is constructed incrementally based on the latest SLAM estimates. For the efficient construction of neural fields, we employ multi-resolution grid encoding and signed distance function (SDF) representation. This allows us to keep the map always up-to-date and adapt instantly to global updates via loop closing. For global consistency, we propose an efficient Sim(3)-based pose graph bundle adjustment (PGBA) approach to run online loop closing and mitigate the pose and scale drift. To enhance depth accuracy further, we incorporate learned monocular depth priors. We propose a novel joint depth and scale adjustment (JDSA) module to solve the scale ambiguity inherent in depth priors. Extensive evaluations across synthetic and real-world datasets validate that our approach outperforms existing methods in accuracy and map completeness while preserving real-time performance.

摘要
封信中，我们提出了基于神经场的实时单视映射框架，用于高精度和完整的同时地理位和地图（SLAM）。现有的神经映射框架有很好的表现，但它们依赖于RGB-D或 pose输入，或者不能在实时上运行。为了解决这些限制，我们的方法将粗粒度 SLAM 与神经隐藏场 integrate вместе。具体来说，我们的粗粒度 SLAM 方法在并行跟踪和全局优化时运行，而神经场基于最新的 SLAM 估计构建了地图。为了高效地构建神经场，我们采用多尺度网格编码和签名距离函数（SDF）表示。这使得我们可以保持地图总是最新的，并在全球更新时适时更新。为了保证全球一致性，我们提出了一种高效的 Sim(3) 基于 pose graph bundle adjustment（PGBA）方法来在线进行循环关闭和缓解 pose 和比例偏移。为了进一步提高深度精度，我们引入了学习的单视深度估计。我们提出了一种新的共同深度和比例调整（JDSA）模块，以解决深度估计中的比例歧阱。我们在 sintetic 和实际 datasets 进行了广泛的评估，并证明了我们的方法在准确和地图完整性方面超过现有方法，而且保持实时性。

IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers

paper_url: http://arxiv.org/abs/2310.04780
repo_url: https://github.com/hzlsaber/IPMix
paper_authors: Zhenglin Huang, Xianan Bao, Na Zhang, Qingqi Zhang, Xiaomei Tu, Biao Wu, Xi Yang
For: 提高卷积神经网络的Robustness和纯度之间的质量衡量。* Methods: integrate three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead, and introduce structural complexity at different levels to generate more diverse images, as well as adopt the random mixing method for multi-scale information fusion.* Results: outperform state-of-the-art corruption robustness on CIFAR-C and ImageNet-C, and significantly improve other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.

Abstract
Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propose IPMix, a simple data augmentation approach to improve robustness without hurting clean accuracy. IPMix integrates three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead. To further improve the robustness, IPMix introduces structural complexity at different levels to generate more diverse images and adopts the random mixing method for multi-scale information fusion. Experiments demonstrate that IPMix outperforms state-of-the-art corruption robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also significantly improves the other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O.

摘要
<>训练高精度卷积神经网络分类器时，数据扩充已被证明是有效的防止过拟合的方法。然而，在实际场景中建立深度神经网络需要不仅在干净数据上达到高精度，还需要在数据分布shift时具有Robustness。而之前的方法认为存在精度和Robustness之间的负面关系，我们提出了IPMix，一种简单的数据扩充方法，可以在有限计算负担下提高训练数据的多样性，而不会增加过拟合。IPMix在图像、patch和像素三级数据扩充方面进行了一体化和标签保持的实现，并通过不同级别的结构复杂度来生成更多样的图像，并采用了随机混合方法来融合多尺度信息。实验表明，IPMix在CIFAR-C和ImageNet-C上的抗损害性能比state-of-the-art高，而且还在ImageNet-R、ImageNet-A和ImageNet-O上显著提高了其他安全指标，包括对抗扰动抑制、均衡、预测一致性和异常检测，达到了或与state-of-the-art相当的结果。<>

TransCC: Transformer Network for Coronary Artery CCTA Segmentation

paper_url: http://arxiv.org/abs/2310.04779
repo_url: None
paper_authors: Chenchu Xu, Meng Li, Xue Wu
for: 这个研究旨在提高 coronary computed tomography angiography (CCTA) 影像的精确分类，以早期检测和治疗 coronary heart disease (CHD)。
methods: 本研究使用 transformer 和卷积神经网络，具有自注意机制，以解决 coronary 分类任务中的两个挑战：一是对标的地方结构带来损害，二是需要同时考虑全局和本地特征。
results: 实验结果显示，TransCC 可以优于现有方法， segmentation 性能平均为 0.730 和 0.582，这些结果证明 TransCC 在 CCTA 影像分类中的有效性。

Abstract
The accurate segmentation of Coronary Computed Tomography Angiography (CCTA) images holds substantial clinical value for the early detection and treatment of Coronary Heart Disease (CHD). The Transformer, utilizing a self-attention mechanism, has demonstrated commendable performance in the realm of medical image processing. However, challenges persist in coronary segmentation tasks due to (1) the damage to target local structures caused by fixed-size image patch embedding, and (2) the critical role of both global and local features in medical image segmentation tasks.To address these challenges, we propose a deep learning framework, TransCC, that effectively amalgamates the Transformer and convolutional neural networks for CCTA segmentation. Firstly, we introduce a Feature Interaction Extraction (FIE) module designed to capture the characteristics of image patches, thereby circumventing the loss of semantic information inherent in the original method. Secondly, we devise a Multilayer Enhanced Perceptron (MEP) to augment attention to local information within spatial dimensions, serving as a complement to the self-attention mechanism. Experimental results indicate that TransCC outperforms existing methods in segmentation performance, boasting an average Dice coefficient of 0.730 and an average Intersection over Union (IoU) of 0.582. These results underscore the effectiveness of TransCC in CCTA image segmentation.

摘要
精准分割 coronary computed tomography angiography (CCTA) 图像具有临床价值，可以早期探测和治疗 coronary heart disease (CHD)。transformer 使用自注意机制，在医疗图像处理领域表现出色。然而，在 coronary 分割任务中仍存在一些挑战，主要是因为 (1) 固定大小图像块嵌入引起的目标地方结构损害，以及 (2) 医疗图像分割任务中的全球和本地特征的重要作用。为Address these challenges, we propose a deep learning framework, TransCC, that effectively combines the Transformer and convolutional neural networks for CCTA segmentation.首先，我们提出了一种Feature Interaction Extraction (FIE)模块，用于捕捉图像块特征，从而绕过原始方法中的含义损失。其次，我们设计了一种多层增强感知机制 (MEP)，用于增强对本地信息的注意力，作为自注意机制的补偿。实验结果表明，TransCC 在分割性能方面表现出色，其中平均 dice 系数为 0.730，平均 intersection over union (IoU) 为 0.582。这些结果证明 TransCC 在 CCTA 图像分割中的效果。

1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction

paper_url: http://arxiv.org/abs/2310.04769
repo_url: None
paper_authors: Zhishan Zhou, Zhi Lv, Shihao Zhou, Minqiang Zou, Tong Wu, Mochen Yu, Yao Tang, Jiajun Liang
for: Egocentric 3D Hand Pose Estimation challenge
methods: ViT backbones and simple regressor for 3D keypoints prediction, non-model method for merging multi-view results
results: 12.21mm MPJPE on test dataset, first place in challenge

Abstract
This report introduce our work on Egocentric 3D Hand Pose Estimation workshop. Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. In the competition, we adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. We noticed that Hand-objects occlusions and self-occlusions lead to performance degradation, thus proposed a non-model method to merge multi-view results in the post-process stage. Moreover, We utilized test time augmentation and model ensemble to make further improvement. We also found that public dataset and rational preprocess are beneficial. Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.

摘要
这份报告介绍我们在 egocentric 3D 手势估计工作坊中的工作。我们使用 AssemblyHands，这个挑战是从单视图图像中进行 egocentric 3D 手势估计。在竞赛中，我们采用基于 ViT 的背bone 和简单的回归器进行 3D 关键点预测，这提供了强大的模型基线。我们注意到手-物体遮挡和自遮挡会导致性能下降，因此我们提议一种非模型方法将多视图结果在后处理阶段合并。此外，我们利用测试时的扩展和模型ensemble来做进一步的改进。我们还发现公共数据集和合理的预处理是有利的。我们的方法在测试集上达到了 12.21mm MPJPE，在 Egocentric 3D Hand Pose Estimation 挑战中获得了第一名。

CAD Models to Real-World Images: A Practical Approach to Unsupervised Domain Adaptation in Industrial Object Classification

paper_url: http://arxiv.org/abs/2310.04757
repo_url: https://github.com/dritter-bht/synthnet-transfer-learning
paper_authors: Dennis Ritter, Mike Hemberger, Marc Hönig, Volker Stopp, Erik Rodner, Kristian Hildebrand
for: 本研究系统atica��analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting, using only category-labeled CAD models.
methods: 本研究使用的方法包括 domain adaptation pipeline, which achieves SoTA performance on the VisDA benchmark and drastically improves recognition performance on a new open industrial dataset comprised of 102 mechanical parts.
results: 研究结果表明，使用这种方法可以帮助实现state-of-the-art unsupervised domain adaptation in practice,并且提供了一些实践中应用的指南。

Abstract
In this paper, we systematically analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting. In contrast to standard natural object benchmarks existing in the field, our results highlight the most important design choices when only category-labeled CAD models are available but classification needs to be done with real-world images. Our domain adaptation pipeline achieves SoTA performance on the VisDA benchmark, but more importantly, drastically improves recognition performance on our new open industrial dataset comprised of 102 mechanical parts. We conclude with a set of guidelines that are relevant for practitioners needing to apply state-of-the-art unsupervised domain adaptation in practice. Our code is available at https://github.com/dritter-bht/synthnet-transfer-learning.

摘要
在这篇论文中，我们系统地分析了无监督领域适应管道，用于对工业场景中的物体分类。与现有领域中的标准自然物体标准相比，我们的结果显示了领域适应管道的重要设计选择，当仅有类别标注的CAD模型可用，但是需要使用实际图像进行分类。我们的领域适应管道在VisDA标准曲线上实现了领先的性能，而且在我们新提出的102种机械部件的开放 dataset中显著提高了识别性能。我们 conclude with一些实践中应用状态空间领域适应的指南，可以帮助实践者。我们的代码可以在https://github.com/dritter-bht/synthnet-transfer-learning中找到。

Balancing stability and plasticity in continual learning: the readout-decomposition of activation change (RDAC) framework

paper_url: http://arxiv.org/abs/2310.04741
repo_url: None
paper_authors: Daniel Anthes, Sushrut Thorat, Peter König, Tim C. Kietzmann
for: 本研究旨在解释继续学习（Continual Learning，CL）算法中稳定性和材料性之间的贸易关系，并提供有价值的思路以帮助解决这一问题。
methods: 该研究提出了一种名为Readout-Decomposition of Activation Change（RDAC）框架，该框架可以帮助解释CL算法中稳定性和材料性之间的关系，同时还可以解释在深度非线性神经网络中处理分割CIFAR-110任务时，各种常用的正则化算法（如Synaptic intelligence、Elastic-weight consolidation、Learning without Forgetting）和回忆策略（如Gradient episodic memory、Data replay）的稳定性和材料性之间的贸易关系。
results: 研究发现，GEM和Data replay等回忆策略可以保持稳定性和材料性，而SI、EWC和LwF等正则化算法则在维持稳定性的同时会减少材料性。此外，对一个隐藏层线性神经网络进行分析，我们 derivated一种gradient decomposition算法，可以限制活动变化在先前的读写空间中，以保持高稳定性而不会进一步减少材料性。结果表明，该算法可以维持稳定性无需重要的材料性损失。

Abstract
Continual learning (CL) algorithms strive to acquire new knowledge while preserving prior information. However, this stability-plasticity trade-off remains a central challenge. This paper introduces a framework that dissects this trade-off, offering valuable insights into CL algorithms. The Readout-Decomposition of Activation Change (RDAC) framework first addresses the stability-plasticity dilemma and its relation to catastrophic forgetting. It relates learning-induced activation changes in the range of prior readouts to the degree of stability and changes in the null space to the degree of plasticity. In deep non-linear networks tackling split-CIFAR-110 tasks, the framework clarifies the stability-plasticity trade-offs of the popular regularization algorithms Synaptic intelligence (SI), Elastic-weight consolidation (EWC), and learning without Forgetting (LwF), and replay-based algorithms Gradient episodic memory (GEM), and data replay. GEM and data replay preserved stability and plasticity, while SI, EWC, and LwF traded off plasticity for stability. The inability of the regularization algorithms to maintain plasticity was linked to them restricting the change of activations in the null space of the prior readout. Additionally, for one-hidden-layer linear neural networks, we derived a gradient decomposition algorithm to restrict activation change only in the range of the prior readouts, to maintain high stability while not further sacrificing plasticity. Results demonstrate that the algorithm maintained stability without significant plasticity loss. The RDAC framework informs the behavior of existing CL algorithms and paves the way for novel CL approaches. Finally, it sheds light on the connection between learning-induced activation/representation changes and the stability-plasticity dilemma, also offering insights into representational drift in biological systems.

摘要

Activate and Reject: Towards Safe Domain Generalization under Category Shift

paper_url: http://arxiv.org/abs/2310.04724
repo_url: None
paper_authors: Chaoqi Chen, Luyao Tang, Leitian Tao, Hong-Yu Zhou, Yue Huang, Xiaoguang Han, Yizhou Yu
for: 本研究旨在解决深度神经网络在开放世界中实现满意的准确率的问题，特别是在不同领域和物种出现时能够同时探测未知类型样本和知名类型样本的检测问题。
methods: 我们提出了一种Activate and Reject（ART）框架，通过在训练时间期间优化未知类型的概率，然后使用权重平滑来缓解过自信问题。在测试时，我们引入了一种步骤式在线适应方法，通过跨领域最近邻和类prototype信息来预测标签，不需要更新网络参数或使用阈值机制。
results: 我们的实验表明，ART可以提高深度网络的普适能力，对不同的视觉任务进行改进。对于图像分类任务，ART提高了H-score的平均提升率为6.1%，相比之下前一个最佳方法。对于物体检测和 semantic segmentation，我们建立了新的标准 bencmarks，并实现了竞争性的表现。

Abstract
Albeit the notable performance on in-domain test points, it is non-trivial for deep neural networks to attain satisfactory accuracy when deploying in the open world, where novel domains and object classes often occur. In this paper, we study a practical problem of Domain Generalization under Category Shift (DGCS), which aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains. Compared to prior DG works, we face two new challenges: 1) how to learn the concept of ``unknown'' during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments for safe model deployment. To this end, we propose a novel Activate and Reject (ART) framework to reshape the model's decision boundary to accommodate unknown classes and conduct post hoc modification to further discriminate known and unknown classes using unlabeled test data. Specifically, during training, we promote the response to the unknown by optimizing the unknown probability and then smoothing the overall output to mitigate the overconfidence issue. At test time, we introduce a step-wise online adaptation method that predicts the label by virtue of the cross-domain nearest neighbor and class prototype information without updating the network's parameters or using threshold-based mechanisms. Experiments reveal that ART consistently improves the generalization capability of deep networks on different vision tasks. For image classification, ART improves the H-score by 6.1% on average compared to the previous best method. For object detection and semantic segmentation, we establish new benchmarks and achieve competitive performance.

摘要
尽管深度神经网络在域内测试点上表现出色，但在开放世界中部署时，它们很难达到满意的准确率。在这篇论文中，我们研究了适用于领域总结下的类别转换问题（DGCS），该问题的目标是同时检测未知类样本并将知道类样本分类到目标领域中。相比于先前的DG工作，我们面临两个新的挑战：1）如何在训练期间学习“未知”的概念，只使用源领域知道类样本；2）如何适应到未经见过的环境中安全地部署模型。为此，我们提出了一种名为Activate and Reject（ART）框架，用于重塑模型的决策边界，以便容纳未知类和进行后续修改以进一步分类知道和未知类样本使用无标注测试数据。在训练期间，我们通过优化未知概率来提高模型对未知类的应答，然后使用权重平滑来缓解过自信问题。在测试时，我们引入了一种步骤式在线适应方法，通过跨领域最近邻和类型范围信息来预测标签，不需要更新网络参数或使用阈值机制。实验表明，ART可以在不同的视觉任务上提高深度网络的总成绩。对于图像分类，ART提高了H-score的平均提升为6.1%，较前一个最佳方法。对于物体检测和semantic segmentation，我们建立了新的benchmark和获得了竞争性的表现。

Memory-Constrained Semantic Segmentation for Ultra-High Resolution UAV Imagery

paper_url: http://arxiv.org/abs/2310.04721
repo_url: None
paper_authors: Qi Li, Jiaxin Cai, Yuanlong Yu, Jason Gu, Jia Pan, Wenxi Liu
for: 这篇论文主要目的是解决无人机图像分析中的高分辨率图像分类问题，特别是在具有 GPU 内存限制的 Computational Device 上进行高效的分类。
methods: 本文提出了一个 GPU 内存有效的和高效的框架，实现了本地推理而不需要访问对应像的高分辨率信息。具体来说，我们提出了一个新的空间导向高分辨率查询模组，可以透过查询邻近的潜在嵌入对象来预测每个像素的分类结果，而不需要访问高分辨率信息。此外，我们还提出了一个高效的内存基于的互动方案，以corrrect potential的Semantic Bias 。
results: 在实验中，我们使用了公共的标准库 benchmark 进行评估，并在小型和大型 GPU 内存使用限制下 achieve 了 superior 的表现。

Abstract
Amidst the swift advancements in photography and sensor technologies, high-definition cameras have become commonplace in the deployment of Unmanned Aerial Vehicles (UAVs) for diverse operational purposes. Within the domain of UAV imagery analysis, the segmentation of ultra-high resolution images emerges as a substantial and intricate challenge, especially when grappling with the constraints imposed by GPU memory-restricted computational devices. This paper delves into the intricate problem of achieving efficient and effective segmentation of ultra-high resolution UAV imagery, while operating under stringent GPU memory limitation. The strategy of existing approaches is to downscale the images to achieve computationally efficient segmentation. However, this strategy tends to overlook smaller, thinner, and curvilinear regions. To address this problem, we propose a GPU memory-efficient and effective framework for local inference without accessing the context beyond local patches. In particular, we introduce a novel spatial-guided high-resolution query module, which predicts pixel-wise segmentation results with high quality only by querying nearest latent embeddings with the guidance of high-resolution information. Additionally, we present an efficient memory-based interaction scheme to correct potential semantic bias of the underlying high-resolution information by associating cross-image contextual semantics. For evaluation of our approach, we perform comprehensive experiments over public benchmarks and achieve superior performance under both conditions of small and large GPU memory usage limitations. We will release the model and codes in the future.

摘要
在摄像头和感知技术的快速进步下，高清晰相机在无人航空器（UAV）的应用中变得普遍。在UAV成像分析领域，分解超高清晰图像成为一项困难和复杂的任务，尤其是在面临GPU内存限制的计算设备上。本文探讨如何在GPU内存限制下实现高效和高质量的超高清晰图像分解方法。现有方法的策略是下samples the images to achieve computationally efficient segmentation, but this approach tends to overlook smaller, thinner, and curvilinear regions.为解决这问题，我们提出了一种GPU内存高效和可靠的框架，用于本地推理而无需访问背景信息。具体来说，我们引入了一种新的空间指导高分辨率查询模块，可以通过询问最近的潜在嵌入来预测每个像素的分类结果，并且只需考虑当前局部信息。此外，我们还提出了一种高效的内存基于的交互方案，以 corralling potential semantic bias of the underlying high-resolution information by associating cross-image contextual semantics.为评估我们的方法，我们在公共benchmark上进行了广泛的实验，并在小和大GPU内存使用限制下达到了superior表现。我们将在未来释放模型和代码。

A Comprehensive Survey on Deep Neural Image Deblurring

paper_url: http://arxiv.org/abs/2310.04719
repo_url: None
paper_authors: Sajjad Amrollahi Biyouki, Hoon Hwangbo
for: 图像锐化纷和提高图像质量，提高图像的纹理和对象视觉。
methods: 使用深度神经网络，包括盲目和非盲目图像锐化方法。
results: 深度神经网络在图像锐化方面带来了一场大进步，提高了性能指标和数据集的使用。但目前还存在一些挑战和研究空白，未来的研究可能把焦点放在这些领域。

Abstract
Image deblurring tries to eliminate degradation elements of an image causing blurriness and improve the quality of an image for better texture and object visualization. Traditionally, prior-based optimization approaches predominated in image deblurring, but deep neural networks recently brought a major breakthrough in the field. In this paper, we comprehensively review the recent progress of the deep neural architectures in both blind and non-blind image deblurring. We outline the most popular deep neural network structures used in deblurring applications, describe their strengths and novelties, summarize performance metrics, and introduce broadly used datasets. In addition, we discuss the current challenges and research gaps in this domain and suggest potential research directions for future works.

摘要
图像锐化尝试消除图像模糊的因素，提高图像质量，以便更好地看到图像的文字和物体视觉。在过去，基于优先的优化方法曾经主导图像锐化领域，但是最近，深度神经网络在这个领域带来了一场大的突破。在这篇评论中，我们全面回顾了最近深度神经网络在图像锐化中的进步，包括无监控和监控图像锐化。我们列出了最受欢迎的深度神经网络结构，描述了它们的优势和创新，概括性能指标，并介绍了广泛使用的数据集。此外，我们讨论了当前领域的挑战和研究缺陷，并提出了未来研究的可能性。

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

paper_url: http://arxiv.org/abs/2310.04716
repo_url: None
paper_authors: Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu
for: automatize numerous AI tasks by connecting Large Language Models (LLMs) to various domain-specific models or APIs
methods: build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor, using a visual encoder and a language decoder, and an innovative Reinforcement Learning (RL) based algorithm
results: outperforms the state-of-the-art methods by a clear margin, showing the potential as a generic UI task automation API

Abstract
Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to predict geometric coordinates in a sequence of tokens using a language decoder. We further propose an innovative Reinforcement Learning (RL) based algorithm to supervise the tokens in such sequence jointly with visually semantic metrics, which effectively strengthens the spatial decoding capability of the pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin and shows the potential as a generic UI task automation API.

摘要
近年来，大型语言模型（LLM）的流行性打开了许多自动化AI任务的可能性，通过将LLM与不同领域模型或API连接起来，使LLM serve为调度器，而领域特定模型或API则是行动执行者。尽管有很多领域特定模型/API，但它们仍然无法全面覆盖人机交互中的自动化需求。在这种情况下，我们建立了一个多模态模型，将自然语言指令与给定的UIcreenshot相关联，作为一个通用的UI任务自动化执行器。这个没有元数据的grounding模型由视觉编码器和语言解码器组成，首先在已有的文档理解任务上进行预训练，然后学习从UIcreenshot中提取空间信息的decode方法。为了利用图像到文本的预训练知识，我们采用了像素到序列的方法，通过语言解码器预测图像中的几何坐标序列。我们还提出了一种创新的强化学习（RL）算法，以便同时监督序列中的图像Semantic метрик，从而提高图像到序列的空间解码能力。经过广泛的实验，我们的提出的强化UI指令grounding模型超越了当前状态的方法，并显示出了作为通用UI任务自动化API的潜在力量。

Generalized Robust Test-Time Adaptation in Continuous Dynamic Scenarios

paper_url: http://arxiv.org/abs/2310.04714
repo_url: https://github.com/bit-da/rotta
paper_authors: Shuang Li, Longhui Yuan, Binhui Xie, Tao Yang
for: 这个研究旨在解决实际应用中出现的同时进行构成和标签shift的问题，即在测试过程中，测试数据流中的标签和特征都在不断变化。methods: 这个研究使用的方法包括Robust Parameter Adaptation、recalibration of batch normalization、source knowledge regularization和Bias-Guided Output Adaptation等。这些方法可以帮助模型在测试过程中快速适应测试数据流中的变化。results: 实验结果显示，GRoTTA方法在PTTA设定下具有较高的效果，较以往的竞争者多个项目。这显示GRoTTA方法可以帮助模型在实际应用中更好地适应测试数据流中的变化。

Abstract
Test-time adaptation (TTA) adapts the pre-trained models to test distributions during the inference phase exclusively employing unlabeled test data streams, which holds great value for the deployment of models in real-world applications. Numerous studies have achieved promising performance on simplistic test streams, characterized by independently and uniformly sampled test data originating from a fixed target data distribution. However, these methods frequently prove ineffective in practical scenarios, where both continual covariate shift and continual label shift occur simultaneously, i.e., data and label distributions change concurrently and continually over time. In this study, a more challenging Practical Test-Time Adaptation (PTTA) setup is introduced, which takes into account the concurrent presence of continual covariate shift and continual label shift, and we propose a Generalized Robust Test-Time Adaptation (GRoTTA) method to effectively address the difficult problem. We start by steadily adapting the model through Robust Parameter Adaptation to make balanced predictions for test samples. To be specific, firstly, the effects of continual label shift are eliminated by enforcing the model to learn from a uniform label distribution and introducing recalibration of batch normalization to ensure stability. Secondly, the continual covariate shift is alleviated by employing a source knowledge regularization with the teacher-student model to update parameters. Considering the potential information in the test stream, we further refine the balanced predictions by Bias-Guided Output Adaptation, which exploits latent structure in the feature space and is adaptive to the imbalanced label distribution. Extensive experiments demonstrate GRoTTA outperforms the existing competitors by a large margin under PTTA setting, rendering it highly conducive for adoption in real-world applications.

摘要
测试时适应（TTA）在推理阶段 exclusively使用无标注测试数据流来适应测试分布，这对实际应用中的模型部署具有很大的价值。许多研究已经在简单的测试流中达到了有 promise的性能，但这些方法经常在实际场景中失效，因为测试数据和标签分布在时间上同时发生变化。在本研究中，我们引入了更加具有挑战性的实际测试适应（PTTA）设定，该设定考虑了同时发生的连续变量和连续标签变换，并提出了一种通用鲁棒测试适应（GRoTTA）方法来有效地解决这个困难问题。我们首先通过鲁棒参数适应来稳定地适应测试样本。更 Specifically，我们首先消除了连续标签变换的影响，通过使用均匀标签分布来学习，并通过批量正则化来保证稳定性。其次，我们使用教师模型来更新参数，以 alleviate连续变量变换。另外，我们还利用测试流中的可能信息，通过偏好导向输出适应来细化平衡预测，这里利用了特征空间中的潜在结构，并且是适应偏高标签分布的。我们进行了广泛的实验，结果显示GRoTTA在PTTA设定下高效地击败了现有的竞争对手，这表明GRoTTA在实际应用中具有很高的潜力。

UFD-PRiME: Unsupervised Joint Learning of Optical Flow and Stereo Depth through Pixel-Level Rigid Motion Estimation

paper_url: http://arxiv.org/abs/2310.04712
repo_url: None
paper_authors: Shuai Yuan, Carlo Tomasi
for: 这篇论文是为了提出一种基于joint training的光流和立体动态模型的方法，以提高光流的精度和详细程度。
methods: 这篇论文使用了一种两个网络 architecture，第一个网络用于 estimate flow和 disparity jointly，而第二个网络用于使用 optical flow作为 Pseudo-labels来估算3D rigid motion和重建 optical flow。最后一个阶段使用了这两个网络的输出进行融合。
results: 这篇论文在 KITTI-2015 测试集上 achieved 7.36% optical flow error，与之前的state-of-the-art 9.38%的错误率相比，提高了大量的精度和详细程度。此外，这种方法也在 stero depth 方面 achieved slightly better or comparable results。

Abstract
Both optical flow and stereo disparities are image matches and can therefore benefit from joint training. Depth and 3D motion provide geometric rather than photometric information and can further improve optical flow. Accordingly, we design a first network that estimates flow and disparity jointly and is trained without supervision. A second network, trained with optical flow from the first as pseudo-labels, takes disparities from the first network, estimates 3D rigid motion at every pixel, and reconstructs optical flow again. A final stage fuses the outputs from the two networks. In contrast with previous methods that only consider camera motion, our method also estimates the rigid motions of dynamic objects, which are of key interest in applications. This leads to better optical flow with visibly more detailed occlusions and object boundaries as a result. Our unsupervised pipeline achieves 7.36% optical flow error on the KITTI-2015 benchmark and outperforms the previous state-of-the-art 9.38% by a wide margin. It also achieves slightly better or comparable stereo depth results. Code will be made available.

摘要
beide 光流和立体差是图像匹配，因此可以从共同训练中受益。深度和3D运动提供geometry rather than photometry信息，可以进一步提高光流。因此，我们设计了一个首先网络，并将流和差 jointly estimated，并在无监督下训练。其次，使用光流从首先网络中得到的pseudo-labels，对差从首先网络中得到，并在每个像素处估计3D刚性运动，并重新计算光流。最后，将两个网络的输出进行融合。与之前的方法只考虑相机运动的情况下，我们的方法还估计了动态对象的刚性运动，这些运动是应用中关键的。这导致了更好的光流，有较明显的 occlusion和物体边界。我们的无监督管道在 KITTI-2015 测试benchmark上取得了7.36%的光流错误率，与之前的状态 искусственного智能 9.38% 的差距非常大。它还实现了与之前或相当的 stero depth 结果。我们将代码公开。

Multi-scale MRI reconstruction via dilated ensemble networks

paper_url: http://arxiv.org/abs/2310.04705
repo_url: None
paper_authors: Wendi Ma, Marlon Bran Lorenzana, Wei Dai, Hongfu Sun, Shekhar S. Chandra
for: 本文旨在提出一种高效的多尺度重建网络，用于提高MRI重建图像质量。
methods: 本文使用了扩展 convolutions 技术，并采用了多路分支和堆叠连接来保留分辨率和增加缩放级别。此外，文章还提出了一种复杂值版本，使用复杂 convolutions 来利用phas信息。
results: 实验结果表明，实数版本的模型比常见重建架构和一种多尺度网络更高效，而复杂值版本在phas信息更加强的情况下得到了更好的质量result。

Abstract
As aliasing artefacts are highly structural and non-local, many MRI reconstruction networks use pooling to enlarge filter coverage and incorporate global context. However, this inadvertently impedes fine detail recovery as downsampling creates a resolution bottleneck. Moreover, real and imaginary features are commonly split into separate channels, discarding phase information particularly important to high frequency textures. In this work, we introduce an efficient multi-scale reconstruction network using dilated convolutions to preserve resolution and experiment with a complex-valued version using complex convolutions. Inspired by parallel dilated filters, multiple receptive fields are processed simultaneously with branches that see both large structural artefacts and fine local features. We also adopt dense residual connections for feature aggregation to efficiently increase scale and the deep cascade global architecture to reduce overfitting. The real-valued version of this model outperformed common reconstruction architectures as well as a state-of-the-art multi-scale network whilst being three times more efficient. The complex-valued network yielded better qualitative results when more phase information was present.

摘要
“为了减少别名遗留物的影响，许多MRI重建网络使用抢取来扩大范围和包含全球观点。然而，这会意外地阻碍细节重建，因为下推过滤导致解析瓶颈。此外，实际和想像的特征通常会被分配到不同的通道中，将相位资讯特别是高频率 текстур中的重要资讯排除。在这个工作中，我们引入了一个高效的多値重建网络，使用扩大核心来保留分辨率和实验使用复数值版本使用复数核心。受到平行扩大滤镜的启发，我们的网络可以同时处理多个接收场，处理大规模结构遗留物和细部本地特征。我们还采用紧密的复原连接来汇集特征，以增加缩减的尺度和深度汇集全球架构来降低过滤。实际版本的这个模型比常用重建架构和多値网络更高效，并且三倍更高效。复数值版本的模型在具有更多相位资讯时表现更好。”

Tree-GPT: Modular Large Language Model Expert System for Forest Remote Sensing Image Understanding and Interactive Analysis

paper_url: http://arxiv.org/abs/2310.04698
repo_url: None
paper_authors: Siqi Du, Shengjun Tang, Weixi Wang, Xiaoming Li, Renzhong Guo
for: 这个论文旨在提高森林远程感知数据的分析效率，通过将大语言模型（LLMs）integrated into forestry remote sensing data workflow。
methods: 该论文提出了一种模块化LLM专家系统，名为Tree-GPT，它将图像理解模块、域知识库和工具链 integrate into LLMs，使其能够理解图像、获得准确的知识、生成代码和在本地环境中进行数据分析。
results: 该论文测试了这些任务，包括搜索、视觉化和机器学习分析，并得到了良好的结果，表明Tree-GPT可以在森林研究和环境科学中动态使用LLMs。

Abstract
This paper introduces a novel framework, Tree-GPT, which incorporates Large Language Models (LLMs) into the forestry remote sensing data workflow, thereby enhancing the efficiency of data analysis. Currently, LLMs are unable to extract or comprehend information from images and may generate inaccurate text due to a lack of domain knowledge, limiting their use in forestry data analysis. To address this issue, we propose a modular LLM expert system, Tree-GPT, that integrates image understanding modules, domain knowledge bases, and toolchains. This empowers LLMs with the ability to comprehend images, acquire accurate knowledge, generate code, and perform data analysis in a local environment. Specifically, the image understanding module extracts structured information from forest remote sensing images by utilizing automatic or interactive generation of prompts to guide the Segment Anything Model (SAM) in generating and selecting optimal tree segmentation results. The system then calculates tree structural parameters based on these results and stores them in a database. Upon receiving a specific natural language instruction, the LLM generates code based on a thought chain to accomplish the analysis task. The code is then executed by an LLM agent in a local environment and . For ecological parameter calculations, the system retrieves the corresponding knowledge from the knowledge base and inputs it into the LLM to guide the generation of accurate code. We tested this system on several tasks, including Search, Visualization, and Machine Learning Analysis. The prototype system performed well, demonstrating the potential for dynamic usage of LLMs in forestry research and environmental sciences.

摘要
The image understanding module extracts structured information from forest remote sensing images by utilizing automatic or interactive generation of prompts to guide the Segment Anything Model (SAM) in generating and selecting optimal tree segmentation results. The system then calculates tree structural parameters based on these results and stores them in a database. Upon receiving a specific natural language instruction, the LLM generates code based on a thought chain to accomplish the analysis task. The code is then executed by an LLM agent in a local environment. For ecological parameter calculations, the system retrieves the corresponding knowledge from the knowledge base and inputs it into the LLM to guide the generation of accurate code.The prototype system was tested on several tasks, including search, visualization, and machine learning analysis, and performed well, demonstrating the potential for dynamic usage of LLMs in forestry research and environmental sciences.

SeeDS: Semantic Separable Diffusion Synthesizer for Zero-shot Food Detection

paper_url: http://arxiv.org/abs/2310.04689
repo_url: https://github.com/lancezpf/seeds
paper_authors: Pengfei Zhou, Weiqing Min, Yang Zhang, Jiajun Song, Ying Jin, Shuqiang Jiang
for: zeroshot food detection (ZSFD)
methods: semantic separable diffusion synthesizer (SeeDS) framework, including semantic separable synthesizing module (S$^3$M) and region feature denoising diffusion model (RFDDM)
results: state-of-the-art ZSFD performance on two food datasets (ZSFooD and UECFOOD-256), and effectiveness on general ZSD datasets (PASCAL VOC and MS COCO)

Abstract
Food detection is becoming a fundamental task in food computing that supports various multimedia applications, including food recommendation and dietary monitoring. To deal with real-world scenarios, food detection needs to localize and recognize novel food objects that are not seen during training, demanding Zero-Shot Detection (ZSD). However, the complexity of semantic attributes and intra-class feature diversity poses challenges for ZSD methods in distinguishing fine-grained food classes. To tackle this, we propose the Semantic Separable Diffusion Synthesizer (SeeDS) framework for Zero-Shot Food Detection (ZSFD). SeeDS consists of two modules: a Semantic Separable Synthesizing Module (S$^3$M) and a Region Feature Denoising Diffusion Model (RFDDM). The S$^3$M learns the disentangled semantic representation for complex food attributes from ingredients and cuisines, and synthesizes discriminative food features via enhanced semantic information. The RFDDM utilizes a novel diffusion model to generate diversified region features and enhances ZSFD via fine-grained synthesized features. Extensive experiments show the state-of-the-art ZSFD performance of our proposed method on two food datasets, ZSFooD and UECFOOD-256. Moreover, SeeDS also maintains effectiveness on general ZSD datasets, PASCAL VOC and MS COCO. The code and dataset can be found at https://github.com/LanceZPF/SeeDS.

摘要
食物检测已成为食物计算中的基本任务，支持多媒体应用程序，包括食物推荐和饮食监测。要处理实际场景，食物检测需要本地化和识别新的食物对象，需要零Instance检测（ZSD）。然而，食物的semantic特征和内部特征多样性带来了ZSD方法在细腻食物类型之间分辨的挑战。为解决这个问题，我们提出了Semantic Separable Diffusion Synthesizer（SeeDS）框架，用于零Instance食物检测（ZSFD）。SeeDS包括两个模块：Semantic Separable Synthesizing Module（S$^3$M）和Region Feature Denoising Diffusion Model（RFDDM）。S$^3$M学习了复杂的食物特征的分离 semantic representation，从原料和菜系中提取出各种Semantic信息，并将其拼接成特征。RFDDM使用了一种新的扩散模型，生成了多样化的区域特征，并通过细腻的合成特征提高了ZSFD的性能。我们的提议方法在两个食物数据集上实现了状态的ZSFD性能，即ZSFooD和UECFOOD-256。此外，SeeDS还保持了在通用ZSD数据集上的有效性，包括PASCAL VOC和MS COCO。codes和数据集可以在https://github.com/LanceZPF/SeeDS中找到。

PatchProto Networks for Few-shot Visual Anomaly Classification

paper_url: http://arxiv.org/abs/2310.04688
repo_url: None
paper_authors: Jian Wang, Yue Zhuo
for: 本研究は、缺乏异常样本的实际问题に针对的，即几何学学习（FSL）。
methods: 该研究提出了一种基于几何学学习的方法，称为PatchProto网络，它仅提取了异常区域的 CNN特征，作为几何学学习的原型。
results: 对于MVTec-AD数据集，相比基本的几何学学习分类器，PatchProto网络显著提高了几何学学习异常分类精度。

Abstract
The visual anomaly diagnosis can automatically analyze the defective products, which has been widely applied in industrial quality inspection. The anomaly classification can classify the defective products into different categories. However, the anomaly samples are hard to access in practice, which impedes the training of canonical machine learning models. This paper studies a practical issue that anomaly samples for training are extremely scarce, i.e., few-shot learning (FSL). Utilizing the sufficient normal samples, we propose PatchProto networks for few-shot anomaly classification. Different from classical FSL methods, PatchProto networks only extract CNN features of defective regions of interest, which serves as the prototypes for few-shot learning. Compared with basic few-shot classifier, the experiment results on MVTec-AD dataset show PatchProto networks significantly improve the few-shot anomaly classification accuracy.

摘要
《图像异常诊断可以自动分析异常产品，广泛应用于工业质量检测。异常分类可以将异常产品分为不同类别。然而，异常样本在实践中很难获取，这限制了 canonical 机器学习模型的训练。本文研究了实际上异常样本训练很 scarce 的问题，即 few-shot learning（FSL）。使用 suffcient normal samples，我们提议 PatchProto 网络 для few-shot 异常分类。与传统 FSL 方法不同，PatchProto 网络只提取异常区域的 CNN 特征，这些特征 serves 为 few-shot 学习的原型。与基本 FSL 分类器比较，我们在 MVTec-AD 数据集上进行了实验，结果显示 PatchProto 网络可以显著提高 few-shot 异常分类精度。》Note: "异常" (ànòu) in the text refers to "anomaly" or "defect", and "异常样本" (ànòu yàngxì) refers to "anomalous samples" or "defective products".

High Visual-Fidelity Learned Video Compression

paper_url: http://arxiv.org/abs/2310.04679
repo_url: None
paper_authors: Meng Li, Yibo Shi, Jing Wang, Yunqi Huang
for: 提高视频应用的质量，提出高视质量学习视频压缩框架(HVFVC)。
methods: 提出一种新的信任度基于的特征重建方法，以解决新出现的区域重建质量不佳的问题，并使用周期赔偿损失来抑制减采融合操作和优化导致的检查板抖波问题。
results: 对比VVC标准，HVFVC在50%的比特率下达到了出色的感知质量，显著超越VVC标准。

Abstract
With the growing demand for video applications, many advanced learned video compression methods have been developed, outperforming traditional methods in terms of objective quality metrics such as PSNR. Existing methods primarily focus on objective quality but tend to overlook perceptual quality. Directly incorporating perceptual loss into a learned video compression framework is nontrivial and raises several perceptual quality issues that need to be addressed. In this paper, we investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.

摘要
(Simplified Chinese translation)随着视频应用的增长需求，许多高级学习视频压缩方法已经被开发出来，比传统方法在PSNR等 объекive 质量指标上表现更高。现有方法主要关注对象质量，忽略了人类视觉质量。直接在学习视频压缩框架中引入人类视觉损失是非常困难的，并且会引起一些人类视觉质量问题。在这篇论文中，我们investigated these issues in learned video compression and propose a novel High Visual-Fidelity Learned Video Compression framework (HVFVC). Specifically, we design a novel confidence-based feature reconstruction method to address the issue of poor reconstruction in newly-emerged regions, which significantly improves the visual quality of the reconstruction. Furthermore, we present a periodic compensation loss to mitigate the checkerboard artifacts related to deconvolution operation and optimization. Extensive experiments have shown that the proposed HVFVC achieves excellent perceptual quality, outperforming the latest VVC standard with only 50% required bitrate.

AG-CRC: Anatomy-Guided Colorectal Cancer Segmentation in CT with Imperfect Anatomical Knowledge

paper_url: http://arxiv.org/abs/2310.04677
repo_url: https://github.com/rongzhao-zhang/ag-crc
paper_authors: Rongzhao Zhang, Zhian Bai, Ruoying Yu, Wenrao Pang, Lingyun Wang, Lifeng Zhu, Xiaofan Zhang, Huan Zhang, Weiguo Hu
for: 本研究旨在帮助解决肠RECTAL癌（CRC）从 computed tomography（CT）成像中进行精准的肿瘤分 segmentation，通过利用现有的 deep learning 模型生成的多器官分割（MOS）masks 和自适应的学习策略。
methods: 本研究提出了一种新的 Anatomy-Guided 分 segmentation框架，包括获取 MOS masks，提取更加稳定的器官兴趣（OOI）mask，以及一种基于器官的自适应训练 patch sampling 策略和一种基于管道器官的自适应学习方法。
results: 对于 two CRC segmentation 数据集，提出的方法可以 Achieve 5% 到 9% 的 Dice 指标提升，而且ablation study 进一步证明每个提出的组件的有效性。

Abstract
When delineating lesions from medical images, a human expert can always keep in mind the anatomical structure behind the voxels. However, although high-quality (though not perfect) anatomical information can be retrieved from computed tomography (CT) scans with modern deep learning algorithms, it is still an open problem how these automatically generated organ masks can assist in addressing challenging lesion segmentation tasks, such as the segmentation of colorectal cancer (CRC). In this paper, we develop a novel Anatomy-Guided segmentation framework to exploit the auto-generated organ masks to aid CRC segmentation from CT, namely AG-CRC. First, we obtain multi-organ segmentation (MOS) masks with existing MOS models (e.g., TotalSegmentor) and further derive a more robust organ of interest (OOI) mask that may cover most of the colon-rectum and CRC voxels. Then, we propose an anatomy-guided training patch sampling strategy by optimizing a heuristic gain function that considers both the proximity of important regions (e.g., the tumor or organs of interest) and sample diversity. Third, we design a novel self-supervised learning scheme inspired by the topology of tubular organs like the colon to boost the model performance further. Finally, we employ a masked loss scheme to guide the model to focus solely on the essential learning region. We extensively evaluate the proposed method on two CRC segmentation datasets, where substantial performance improvement (5% to 9% in Dice) is achieved over current state-of-the-art medical image segmentation models, and the ablation studies further evidence the efficacy of every proposed component.

摘要
当分类医学图像时，人类专家总是可以保持在脏部结构的背景下。然而，尽管现代深度学习算法可以从 computed tomography（CT）扫描机器学习得到高质量（尚未完美）的解剖信息，但是仍然是一个开放的问题如何使用自动生成的器官涂抹来帮助困难的肿瘤分割任务，例如肿瘤分割。在这篇论文中，我们开发了一种新的 Anatomy-Guided 分割框架，以利用自动生成的器官涂抹来帮助 CT 上肿瘤分割，即 AG-CRC。首先，我们通过现有的 MOS 模型（例如 TotalSegmentor）来获得多个器官分割（MOS）mask，并 derivate 更加robust的器官对象（OOI）涂抹，可以覆盖大部分的肠Rectum和肿瘤 voxels。然后，我们提出了一种基于器官解剖 topology的自适应训练 patch sampling 策略，通过优化一个启发函数来考虑Important region（例如肿瘤或器官对象）的 proximity 和样本多样性。第三，我们设计了一种基于管状器官 topology 的自适应学习方案，以进一步提高模型性能。最后，我们采用了一种 masked loss 方案，以导引模型尽量只学习重要的学习区域。我们对两个 CRC 分割数据集进行了广泛的评估，并实现了现有医学图像分割模型的显著性能提高（5% 到 9% 的 Dice），并且ablation studies 进一步证明了每一个提posed component的有效性。

EasyPhoto: Your Smart AI Photo Generator

paper_url: http://arxiv.org/abs/2310.04672
repo_url: https://github.com/aigc-apps/sd-webui-EasyPhoto
paper_authors: Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, Jun Huang
for: 这个论文是为了介绍一个名为EasyPhoto的WebUI插件，它使得用户可以通过提供5-20个相关图像来生成人工智能肖像。
methods: 这个插件使用了Gradio库和Stable Diffusion模型，并通过训练LoRA模型来进行特征提取和生成AI照片。
results: 该插件可以根据用户提供的图像生成多种不同的AI照片，包括使用自定义模板和拓展SDXL模型来生成具有更多元素的图像。Here’s the full text in Simplified Chinese:
for: 这个论文是为了介绍一个名为EasyPhoto的WebUI插件，它使得用户可以通过提供5-20个相关图像来生成人工智能肖像。
methods: 这个插件使用了Gradio库和Stable Diffusion模型，并通过训练LoRA模型来进行特征提取和生成AI照片。
results: 该插件可以根据用户提供的图像生成多种不同的AI照片，包括使用自定义模板和拓展SDXL模型来生成具有更多元素的图像。I hope that helps! Let me know if you have any other questions.

Abstract
Stable Diffusion web UI (SD-WebUI) is a comprehensive project that provides a browser interface based on Gradio library for Stable Diffusion models. In this paper, We propose a novel WebUI plugin called EasyPhoto, which enables the generation of AI portraits. By training a digital doppelganger of a specific user ID using 5 to 20 relevant images, the finetuned model (according to the trained LoRA model) allows for the generation of AI photos using arbitrary templates. Our current implementation supports the modification of multiple persons and different photo styles. Furthermore, we allow users to generate fantastic template image with the strong SDXL model, enhancing EasyPhoto's capabilities to deliver more diverse and satisfactory results. The source code for EasyPhoto is available at: https://github.com/aigc-apps/sd-webui-EasyPhoto. We also support a webui-free version by using diffusers: https://github.com/aigc-apps/EasyPhoto. We are continuously enhancing our efforts to expand the EasyPhoto pipeline, making it suitable for any identification (not limited to just the face), and we enthusiastically welcome any intriguing ideas or suggestions.

摘要
stable diffusion网UI(SD-WebUI)是一个完整的项目，提供基于Gradio库的浏览器界面 для稳定扩散模型。在这篇论文中，我们提出了一个新的WebUI插件called EasyPhoto，它允许用户通过训练一个特定用户ID的数字双胞虫（使用5-20个相关图片进行训练）来生成人工智能照片。我们的当前实现支持多个人物和不同的照片风格的修改。此外，我们允许用户使用强大的SDXL模型来生成备受喜欢的模板图片，从而提高EasyPhoto的能力，为用户提供更多的多样化和满意的结果。EasyPhoto的源代码位于：https://github.com/aigc-apps/sd-webui-EasyPhoto。我们还支持一个无WebUI版本，使用diffusers：https://github.com/aigc-apps/EasyPhoto。我们 continuous enhance our efforts to expand the EasyPhoto pipeline, making it suitable for any identification（不限于脸部），欢迎任何有趣的想法或建议。

Visual Abductive Reasoning Meets Driving Hazard Prediction: Problem Formulation and Dataset

paper_url: http://arxiv.org/abs/2310.04671
repo_url: https://github.com/dhpr-dataset/dhpr-dataset
paper_authors: Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Masahiro Takahashi, Ryoma Niihara, Takayuki Okatani
for: 本研究旨在预测司机驾驶过程中可能会遇到的危险。
methods: 该研究使用单个输入图像， captured by 汽车前视摄像头，进行预测。与传统驾驶危险预测方法不同，这种方法不是通过计算模拟或视频异常检测来实现。而是通过高级推理，从静止图像中预测未来事件。
results: 该研究创建了一个名为 DHPR（驾驶危险预测和推理）数据集，包含15W个摄像头图像和相应的车速、假设危险描述和场景中可见元素。这些数据被人工标注员标注，以便识别危险场景并提供可能在几秒钟后发生的危险描述。研究发现，使用不同方法的基eline表现不佳，标志着该领域还有很多研究空间。

Abstract
This paper addresses the problem of predicting hazards that drivers may encounter while driving a car. We formulate it as a task of anticipating impending accidents using a single input image captured by car dashcams. Unlike existing approaches to driving hazard prediction that rely on computational simulations or anomaly detection from videos, this study focuses on high-level inference from static images. The problem needs predicting and reasoning about future events based on uncertain observations, which falls under visual abductive reasoning. To enable research in this understudied area, a new dataset named the DHPR (Driving Hazard Prediction and Reasoning) dataset is created. The dataset consists of 15K dashcam images of street scenes, and each image is associated with a tuple containing car speed, a hypothesized hazard description, and visual entities present in the scene. These are annotated by human annotators, who identify risky scenes and provide descriptions of potential accidents that could occur a few seconds later. We present several baseline methods and evaluate their performance on our dataset, identifying remaining issues and discussing future directions. This study contributes to the field by introducing a novel problem formulation and dataset, enabling researchers to explore the potential of multi-modal AI for driving hazard prediction.

摘要
Translated into Simplified Chinese:这篇论文关注在车辆驾驶中预测司机可能遇到的危险问题，使用车载 видео记录机拍摄的单个输入图像。与现有方法不同，这些方法基于计算机生成的模拟或视频异常检测，而这些研究强调高级推理从静止图像中。问题需要预测和理解未来事件，基于不确定的观察，这属于视觉推理。为促进这个未explored的领域的研究，这篇论文创建了名为DHPR（驾驶危险预测和理解）数据集，该数据集包含15,000个摄像头图像街景，每个图像都有速度、假设的危险描述和场景中可见的视觉元素。这些被人工标注员标注为危险场景和可能在几秒钟后发生的危险描述。这篇论文提出了多种基线方法，并对其表现进行评估，并识别未解决的问题和未来方向。这篇论文对驾驶多模态AI的潜在应用做出了贡献，开发了一个新的问题定义和数据集，为研究人员提供了探索可能的多模态AI驾驶危险预测的机会。

Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition

paper_url: http://arxiv.org/abs/2310.04664
repo_url: None
paper_authors: Jie Zhu, Yuan Zong, Jingang Shi, Cheng Lu, Hongli Chang, Wenming Zheng
for: 本研究强调微表情识别（MER）领域，提出了一种可靠和灵活的深度学习方法，即学习排名开始-发生-结束表示（LTR3O）。
methods: LTR3O方法使用了一种名为3O结构的动态减少大小序列结构，该结构包括开始、发生和结束帧，以表示微表情（ME）。这种结构使得可以后续学习ME特异性特征。LTR3O方法还使用了多个3O表示候选者来对每个ME样本进行排名和补做，以确保3O表示候选者的情感表达能够与大表情（MaM）的分布相一致。
results: 实验结果表明，LTR3O方法在三个常用的ME数据库（CASME II、SMIC、SAMM）上表现出了优于当前状态艺术MER方法的可靠性和灵活性。

Abstract
This paper focuses on the research of micro-expression recognition (MER) and proposes a flexible and reliable deep learning method called learning to rank onset-occurring-offset representations (LTR3O). The LTR3O method introduces a dynamic and reduced-size sequence structure known as 3O, which consists of onset, occurring, and offset frames, for representing micro-expressions (MEs). This structure facilitates the subsequent learning of ME-discriminative features. A noteworthy advantage of the 3O structure is its flexibility, as the occurring frame is randomly extracted from the original ME sequence without the need for accurate frame spotting methods. Based on the 3O structures, LTR3O generates multiple 3O representation candidates for each ME sample and incorporates well-designed modules to measure and calibrate their emotional expressiveness. This calibration process ensures that the distribution of these candidates aligns with that of macro-expressions (MaMs) over time. Consequently, the visibility of MEs can be implicitly enhanced, facilitating the reliable learning of more discriminative features for MER. Extensive experiments were conducted to evaluate the performance of LTR3O using three widely-used ME databases: CASME II, SMIC, and SAMM. The experimental results demonstrate the effectiveness and superior performance of LTR3O, particularly in terms of its flexibility and reliability, when compared to recent state-of-the-art MER methods.

摘要

VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

paper_url: http://arxiv.org/abs/2310.04655
repo_url: https://github.com/ericyinyzy/VLAttack
paper_authors: Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, Fenglong Ma
for: 这篇论文旨在探讨预训练的语音视觉（VL）模型在多modal任务上的敌方性 robustness。
methods: 该论文提出了一种新的实际任务，通过预训练VL模型生成图像和文本杂乱样本，用于攻击黑盒子细化模型。其中，图像级别使用一种新的块级相似攻击策略（BSA）学习图像杂乱，文本级别采用现有的文本攻击策略独立于图像模式攻击。多modal级别则提出了一种迭代交叉搜索攻击（ICSA）方法，在图像和文本两个纬度上不断更新敌意样本。
results: 该论文通过对三种广泛使用的VL预训练模型进行六种任务的攻击，实现了所有任务的攻击成功率最高，而与现有基线相比，表明预训练VL模型的部署存在重要的缺点。

Abstract
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLAttack to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new block-wise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack three widely-used VL pretrained models for six tasks on eight datasets. Experimental results show that the proposed VLAttack framework achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a significant blind spot in the deployment of pre-trained VL models. Codes will be released soon.

摘要
视觉语言（VL）预训模型已经在许多多modal任务上表现出色，但是它们的抗击机制尚未得到全面的探索。现有的方法主要集中在白盒 Setting下进行抗击，这是不现实的。在这篇论文中，我们目的是调查一种新的实用任务，用预训 VL 模型来针对黑盒精度调整模型在不同下游任务上的攻击。为此，我们提出了 VLAttack 攻击框架，用于生成攻击样本。在单modal水平上，我们提出了一种新的块级相似攻击（BSA）策略，用于学习图像攻击，以破坏通用表示。此外，我们采用了现有的文本攻击策略，独立于图像Modal攻击，生成文本攻击。在多modal水平上，我们设计了一种新的迭代交叉搜索攻击（ICSA）方法，用于更新 periodic 的攻击图像文本对，从单modal水平的输出开始。我们对三个广泛使用的 VL 预训模型进行了六个任务的攻击，并对八个数据集进行了广泛的实验。实验结果表明，我们的 VLAttack 框架在所有任务上取得了最高的攻击成功率，超过了现有的基线，这表明预训 VL 模型在部署中存在一定的盲点。代码即将发布。

X-Transfer: A Transfer Learning-Based Framework for Robust GAN-Generated Fake Image Detection

paper_url: http://arxiv.org/abs/2310.04639
repo_url: None
paper_authors: Lei Zhang, Hao Chen, Shu Hu, Bin Zhu, Xi Wu, Jinrong Hu, Xin Wang
for: 这个研究旨在提出一个新的生成随机网络侦测算法，以解决在各种领域中的伪造图像问题。
methods: 本研究使用了两个姐妹神经网络，通过平行的梯度传递来强化传播学习。此外，我们还结合了AUC损失函数和梯度估计损失函数，以提高模型的性能。
results: 我们在多个脸部图像dataset上进行了广泛的实验，结果显示我们的模型在传播学习方法上优于常规的传播方法，并且在非脸部dataset上也获得了出色的表现，这证明了我们的模型具有更广泛的应用前景。

Abstract
Generative adversarial networks (GANs) have remarkably advanced in diverse domains, especially image generation and editing. However, the misuse of GANs for generating deceptive images raises significant security concerns, including face replacement and fake accounts, which have gained widespread attention. Consequently, there is an urgent need for effective detection methods to distinguish between real and fake images. Some of the current research centers around the application of transfer learning. Nevertheless, it encounters challenges such as knowledge forgetting from the original dataset and inadequate performance when dealing with imbalanced data during training. To alleviate the above issues, this paper introduces a novel GAN-generated image detection algorithm called X-Transfer. This model enhances transfer learning by utilizing two sibling neural networks that employ interleaved parallel gradient transmission. This approach also effectively mitigates the problem of excessive knowledge forgetting. In addition, we combine AUC loss term and cross-entropy loss to enhance the model's performance comprehensively. The AUC loss approximates the AUC metric using WMW statistics, ensuring differentiability and improving the performance of traditional AUC evaluation. We carry out comprehensive experiments on multiple facial image datasets. The results show that our model outperforms the general transferring approach, and the best accuracy achieves 99.04%, which is increased by approximately 10%. Furthermore, we demonstrate excellent performance on non-face datasets, validating its generality and broader application prospects.

摘要
生成对抗网络（GAN）在多个领域取得了突出的进步，尤其是图像生成和修改。然而，GAN的滥用可能导致生成假图像的安全问题，如人脸替换和假账户，这些问题已经吸引了广泛的关注。因此，有一项急需有效的检测方法来 отличить真实图像和假图像。一些当前的研究集中在应用传输学习。然而，它遇到了一些挑战，如原始数据集中的知识忘记和训练过程中的数据不均衡。为了解决以上问题，本文提出了一种新的GAN生成图像检测算法called X-Transfer。这种模型在使用两个兄弟神经网络之间的平行梯度传输方法来增强传输学习。此外，我们还将AUC损失函数和交叉熵损失函数结合使用，以提高模型的性能。我们对多个人脸图像集进行了广泛的实验，结果显示，我们的模型在普通的传输方法上出色，并且最高准确率达到99.04%，提高了约10%。此外，我们还证明了我们的模型在非人脸图像集上表现出色，证明其通用性和更广泛的应用前景。

Metadata-Conditioned Generative Models to Synthesize Anatomically-Plausible 3D Brain MRIs

paper_url: http://arxiv.org/abs/2310.04630
repo_url: None
paper_authors: Wei Peng, Tomas Bosschieter, Jiahong Ouyang, Robert Paul, Ehsan Adeli, Qingyu Zhao, Kilian M. Pohl
for: 这个论文旨在提高神经成像研究的数据多样性，使用生成AI模型生成年龄和性别相关的Synthetic MRI数据，以便更好地理解大脑结构和功能的变化。
methods: 该论文提出了一种新的生成模型，叫做BrainSynth，可以生成metadata-conditioned的Synthetic MRI数据，并且可以评估这些数据的可读性和生物学可信度。
results: 研究发现，使用BrainSynth生成的Synthetic MRI数据中的大脑区域具有较高的生物学可信度，并且可以帮助提高神经成像研究中的训练效果。

Abstract
Generative AI models hold great potential in creating synthetic brain MRIs that advance neuroimaging studies by, for example, enriching data diversity. However, the mainstay of AI research only focuses on optimizing the visual quality (such as signal-to-noise ratio) of the synthetic MRIs while lacking insights into their relevance to neuroscience. To gain these insights with respect to T1-weighted MRIs, we first propose a new generative model, BrainSynth, to synthesize metadata-conditioned (e.g., age- and sex-specific) MRIs that achieve state-of-the-art visual quality. We then extend our evaluation with a novel procedure to quantify anatomical plausibility, i.e., how well the synthetic MRIs capture macrostructural properties of brain regions, and how accurately they encode the effects of age and sex. Results indicate that more than half of the brain regions in our synthetic MRIs are anatomically accurate, i.e., with a small effect size between real and synthetic MRIs. Moreover, the anatomical plausibility varies across cortical regions according to their geometric complexity. As is, our synthetic MRIs can significantly improve the training of a Convolutional Neural Network to identify accelerated aging effects in an independent study. These results highlight the opportunities of using generative AI to aid neuroimaging research and point to areas for further improvement.

摘要
优化生成AI模型可以创造出高质量的人工大脑MRI图像，从而提高神经成像研究的数据多样性。然而，主流的AI研究仅专注于提高生成图像的视觉质量（如信号噪声比）而忽略了它们对神经科学的意义。为了获得这些意义，我们首先提出了一种新的生成模型——BrainSynth，用于生成基于metadata（如年龄和性别）的 conditioned MRI图像，并达到了当前最高的视觉质量。然后，我们延展我们的评估方法，包括一种新的评估方法来衡量生成图像的吸收性，即评估生成图像是否能够准确捕捉大脑区域的宏结构特征，以及是否能够正确表达年龄和性别的效果。结果显示，我们的生成图像中的大脑区域有超过半数是吸收正确的，即与实际MRI图像之间的效果效果较小。此外，吸收正确性随 cortical 区域的几何复杂性变化。总之，我们的生成图像可以大幅提高一种Convolutional Neural Network的训练，以识别加速年龄效应。这些结果表明使用生成AI可以帮助神经成像研究，并指出了进一步改进的方向。