cs.CV - 2023-11-12

Augmented Bridge Matching

paper_url: http://arxiv.org/abs/2311.06978
repo_url: None
paper_authors: Valentin De Bortoli, Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Weilie Nie
for: 这篇论文探讨了一种新的流程和桥接匹配过程，它们可以描述diffusion模型。这种新的过程可以在两个给定的分布之间学习随机（和决定）过程，并且可以扩展到学习任意诱导任意转移任务。
methods: 论文使用了流程和桥接匹配过程，这些过程可以 interpolate между两个给定的分布。然而，这些过程并不一定会保留coupling信息，除非遵循更加强的优化条件。
results: 论文表明，通过对流程的速度场（或推移）加入初始样本点的信息，可以恢复coupling信息。这样，我们失去了Markov性质，但保留了两个分布之间的 coupling信息。论文还通过用于学习图像翻译任务的实验，证明了这种更新的效果。

Abstract
Flow and bridge matching are a novel class of processes which encompass diffusion models. One of the main aspect of their increased flexibility is that these models can interpolate between arbitrary data distributions i.e. they generalize beyond generative modeling and can be applied to learning stochastic (and deterministic) processes of arbitrary transfer tasks between two given distributions. In this paper, we highlight that while flow and bridge matching processes preserve the information of the marginal distributions, they do \emph{not} necessarily preserve the coupling information unless additional, stronger optimality conditions are met. This can be problematic if one aims at preserving the original empirical pairing. We show that a simple modification of the matching process recovers this coupling by augmenting the velocity field (or drift) with the information of the initial sample point. Doing so, we lose the Markovian property of the process but preserve the coupling information between distributions. We illustrate the efficiency of our augmentation in learning mixture of image translation tasks.

摘要
“流和桥匹配是一种新的过程类型，包含扩散模型。其中一个主要优点是这些模型可以 interpolate между任意数据分布，即可以泛化 beyond 生成模型，并可以应用于学习随机（和决定的）过程的任意传输任务。在这篇论文中，我们指出了流和桥匹配过程保留了边缘分布信息，但是不一定保留了对应关系信息，除非遵循更加强的优化条件。这可能会导致损失原始的empirical pairing。我们示出了一种简单的修改方法，可以重新获得这个对应关系信息，通过在速度场（或拖动）中添加初始样本点的信息。这样，我们失去了MarkovProperty的属性，但保留了分布之间的对应关系。我们在图像翻译任务中证明了这种修改的效率。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

CD-COCO: A Versatile Complex Distorted COCO Database for Scene-Context-Aware Computer Vision

paper_url: http://arxiv.org/abs/2311.06976
repo_url: https://github.com/aymanbegh/cd-coco
paper_authors: Ayman Beghdadi, Azeddine Beghdadi, Malik Mallem, Lotfi Beji, Faouzi Alaya Cheikh
for: 提高计算机视觉任务的Robustness，通过人工扩充训练数据库或设计深度学习模型，以适应不同的图像获取环境。
methods: 使用本地和全局photo-realistic扭曲，基于图像场景信息和对象深度信息，以提高图像处理的photo-realism。
results: 提供了一个多样化的图像数据库，可以提高对象检测、场景分割和扭曲类型分类等计算机视觉任务的Robustness。

Abstract
The recent development of deep learning methods applied to vision has enabled their increasing integration into real-world applications to perform complex Computer Vision (CV) tasks. However, image acquisition conditions have a major impact on the performance of high-level image processing. A possible solution to overcome these limitations is to artificially augment the training databases or to design deep learning models that are robust to signal distortions. We opt here for the first solution by enriching the database with complex and realistic distortions which were ignored until now in the existing databases. To this end, we built a new versatile database derived from the well-known MS-COCO database to which we applied local and global photo-realistic distortions. These new local distortions are generated by considering the scene context of the images that guarantees a high level of photo-realism. Distortions are generated by exploiting the depth information of the objects in the scene as well as their semantics. This guarantees a high level of photo-realism and allows to explore real scenarios ignored in conventional databases dedicated to various CV applications. Our versatile database offers an efficient solution to improve the robustness of various CV tasks such as Object Detection (OD), scene segmentation, and distortion-type classification methods. The image database, scene classification index, and distortion generation codes are publicly available \footnote{\url{https://github.com/Aymanbegh/CD-COCO}

摘要
To create a versatile database, we built upon the well-known MS-COCO database and applied local and global photo-realistic distortions. These distortions were generated by considering the scene context of the images, ensuring a high level of photo-realism. We exploited the depth information of the objects in the scene, as well as their semantics, to generate distortions that are both realistic and diverse.Our versatile database offers an efficient solution to improve the robustness of various computer vision tasks, such as object detection, scene segmentation, and distortion-type classification methods. The image database, scene classification index, and distortion generation codes are publicly available at \url{https://github.com/Aymanbegh/CD-COCO}.

Adaptive recurrent vision performs zero-shot computation scaling to unseen difficulty levels

paper_url: http://arxiv.org/abs/2311.06964
repo_url: None
paper_authors: Vijay Veerabadran, Srinivas Ravishankar, Yuan Tang, Ritik Raina, Virginia R. de Sa
for: investigate the adaptive computation of recurrent neural networks in visual reasoning tasks, and whether it can enable zero-shot generalization to novel difficulty levels.
methods: combine convolutional recurrent neural networks (ConvRNNs) with a learnable halting mechanism based on Graves (2016), and explore various implementations of adaptive ConvRNNs (AdRNNs) such as tying weights across layers and biologically inspired recurrent networks with lateral connections and gating.
results: AdRNNs learn to dynamically halt processing early or late to solve easier or harder problems, and zero-shot generalize to more difficult problem settings not shown during training by dynamically increasing the number of recurrent iterations at test time.

Abstract
Humans solving algorithmic (or) reasoning problems typically exhibit solution times that grow as a function of problem difficulty. Adaptive recurrent neural networks have been shown to exhibit this property for various language-processing tasks. However, little work has been performed to assess whether such adaptive computation can also enable vision models to extrapolate solutions beyond their training distribution's difficulty level, with prior work focusing on very simple tasks. In this study, we investigate a critical functional role of such adaptive processing using recurrent neural networks: to dynamically scale computational resources conditional on input requirements that allow for zero-shot generalization to novel difficulty levels not seen during training using two challenging visual reasoning tasks: PathFinder and Mazes. We combine convolutional recurrent neural networks (ConvRNNs) with a learnable halting mechanism based on Graves (2016). We explore various implementations of such adaptive ConvRNNs (AdRNNs) ranging from tying weights across layers to more sophisticated biologically inspired recurrent networks that possess lateral connections and gating. We show that 1) AdRNNs learn to dynamically halt processing early (or late) to solve easier (or harder) problems, 2) these RNNs zero-shot generalize to more difficult problem settings not shown during training by dynamically increasing the number of recurrent iterations at test time. Our study provides modeling evidence supporting the hypothesis that recurrent processing enables the functional advantage of adaptively allocating compute resources conditional on input requirements and hence allowing generalization to harder difficulty levels of a visual reasoning problem without training.

摘要
人类解决算法逻辑问题通常会 exhibit 解决时间随问题难度增长。适应式循环神经网络已经在不同的语言处理任务上显示出这种性质。然而，对于视觉模型是否可以通过适应计算来解决超出训练分布难度水平的问题，尚未得到了足够的研究。在这种研究中，我们调查了适应计算的重要功能作用：动态根据输入要求调整计算资源，以实现零aser普适性。我们结合了卷积循环神经网络（ConvRNN）和学习 halt 机制，并 explore 了不同的适应 ConvRNN（AdRNN）的实现方式，从简单的Weight 套接到更复杂的生物学发现逻辑循环网络。我们的研究表明，1） AdRNNs 可以在解决更容易或更难的问题时适应停止处理，2）这些 RNNs 在训练没有看到的更加困难的问题设定中进行零aser普适性。我们的研究提供了对适应计算的模型证据，支持逻辑处理允许通过 conditional 计算资源的分配来实现零aser普适性，而无需训练。

SegReg: Segmenting OARs by Registering MR Images and CT Annotations

paper_url: http://arxiv.org/abs/2311.06956
repo_url: https://github.com/steve-zeyu-zhang/SegReg
paper_authors: Zeyu Zhang, Xuyin Qi, Bowen Zhang, Biao Wu, Hien Le, Bora Jeong, Minh-Son To, Richard Hartley
for: 这个研究是为了提高肿瘤头颈治疗规划中的肿瘤组织（Organ at Risk，OAR）分类，以提高诊断效率和精确性。
methods: 这个方法使用了弹性对称normalization来对MRI融合CT扫描进行匹配和分类，以提高OAR分类的精度和效率。
results: 这个方法比CT仅的基eline提高了16.78%的mDSC和18.77%的mIoU，显示了它能够有效地结合CT的凖度精度和MRI的软组织视觉化，实现了便捷且精确的自动OAR分类。

Abstract
Organ at risk (OAR) segmentation is a critical process in radiotherapy treatment planning such as head and neck tumors. Nevertheless, in clinical practice, radiation oncologists predominantly perform OAR segmentations manually on CT scans. This manual process is highly time-consuming and expensive, limiting the number of patients who can receive timely radiotherapy. Additionally, CT scans offer lower soft-tissue contrast compared to MRI. Despite MRI providing superior soft-tissue visualization, its time-consuming nature makes it infeasible for real-time treatment planning. To address these challenges, we propose a method called SegReg, which utilizes Elastic Symmetric Normalization for registering MRI to perform OAR segmentation. SegReg outperforms the CT-only baseline by 16.78% in mDSC and 18.77% in mIoU, showing that it effectively combines the geometric accuracy of CT with the superior soft-tissue contrast of MRI, making accurate automated OAR segmentation for clinical practice become possible.

摘要
Organ at risk (OAR) 分 Segmentation 是肿瘤治疗规划中的关键过程，特别是头颈肿瘤。然而，在临床实践中，辐射生物学专家主要通过手动操作 CT 扫描来进行 OAR 分 Segmentation。这个手动过程非常时间consuming 和昂贵，因此限制了可以得到有效的肿瘤治疗的病人数量。此外， CT 扫描表现下降的软组织对比度，尤其是在肿瘤辐射治疗中。Despite MRI 提供更高的软组织视化，它的时间consuming 性使其不可靠实时规划。为了解决这些挑战，我们提出了一种方法called SegReg，该方法利用弹性对称normalization 来将 MRI 注册到 OAR 分 Segmentation。SegReg 比 CT 基eline 高出 16.78% 的 mDSC 和 18.77% 的 mIoU，表明它可以有效地结合 CT 的准确性和 MRI 的软组织对比度，使得临床实践中的自动 OAR 分 Segmentation 变得可能。

Video-based sympathetic arousal assessment via peripheral blood flow estimation

paper_url: http://arxiv.org/abs/2311.06930
repo_url: None
paper_authors: Bjoern Braun, Daniel McDuff, Tadas Baltrusaitis, Christian Holz
for: 这篇论文旨在测量 sympathetic arousal 的标准 markers，但传统的 EDA 测量需要电极在皮肤上固定接触。这篇论文提出了一种新的方法，通过测量面或手上的血液流动来推断 sympathetic arousal。methods: 该方法使用了 RGB 摄像头测量面或手上的血液流动，并使用了同步录制的视频和 EDA 和 photoplethysmography (PPG) 信号作为参照标准。results: 研究结果表明，可以使用只有视频或 PPG 信号来测量 sympathetic arousal，并且得到了高度相关的结果（ median 相关性为 0.57-0.63）。此外，研究还表明，最佳的 sympathetic arousal 推断来自于面、手或手掌。

Abstract
Electrodermal activity (EDA) is considered a standard marker of sympathetic activity. However, traditional EDA measurement requires electrodes in steady contact with the skin. Can sympathetic arousal be measured using only an optical sensor, such as an RGB camera? This paper presents a novel approach to infer sympathetic arousal by measuring the peripheral blood flow on the face or hand optically. We contribute a self-recorded dataset of 21 participants, comprising synchronized videos of participants' faces and palms and gold-standard EDA and photoplethysmography (PPG) signals. Our results show that we can measure peripheral sympathetic responses that closely correlate with the ground truth EDA. We obtain median correlations of 0.57 to 0.63 between our inferred signals and the ground truth EDA using only videos of the participants' palms or foreheads or PPG signals from the foreheads or fingers. We also show that sympathetic arousal is best inferred from the forehead, finger, or palm.

摘要
电气活动 (EDA) 被视为 sympathetic 活动标准标志。然而，传统的 EDA 测量需要电极在皮肤上稳定接触。这篇论文提出了一种新的方法，可以通过 оптиче方式测量表照 sympathetic 活动。我们提供了 21 名参与者的自录录视频数据集，包括参与者的脸和手掌视频和标准 EDA 和光谱激光血流压力 (PPG) 信号。我们的结果显示，我们可以通过只使用参与者的脸、手掌或手指的视频来测量周围 sympathetic 响应，并且与真实 EDA 的对应相吻合度为 0.57-0.63。我们还发现， sympathetic 活动最好从脸、手掌或手指进行测量。

Setting a Baseline for long-shot real-time Player and Ball detection in Soccer Videos

paper_url: http://arxiv.org/abs/2311.06892
repo_url: https://github.com/kmouts/soccernet_v3_h250
paper_authors: Konstantinos Moutselos, Ilias Maglogiannis
for: 本研究的目的是提供一个更好的足球数据集，以便进行人员和球的检测，并提出了一种使用YOLO normalized annotation format进行训练和评估的方法。
methods: 本研究使用了SoccerNet v3的编辑版本，并提供了代码和度量器，以便在未来的比较中作为参照。
results: 研究发现，使用YOLO8n模型可以在实时长距离检测球和人员的情况下，比 FootAndBall 模型更好。

Abstract
Players and ball detection are among the first required steps on a football analytics platform. Until recently, the existing open datasets on which the evaluations of most models were based, were not sufficient. In this work, we point out their weaknesses, and with the advent of the SoccerNet v3, we propose and deliver to the community an edited part of its dataset, in YOLO normalized annotation format for training and evaluation. The code of the methods and metrics are provided so that they can be used as a benchmark in future comparisons. The recent YOLO8n model proves better than FootAndBall in long-shot real-time detection of the ball and players on football fields.

摘要
《玩家和球的探测是足球分析平台上的首要步骤。ntil recently，现有的公开数据集，在大多数模型的评估上是不充分的。在这项工作中，我们指出了它们的弱点，并随着SoccerNet v3的出现，我们对社区提供了修订后的数据集，使用YOLO normalized annotation格式进行训练和评估。我们提供了方法和指标的代码，以便作为未来的比较标准。最新的YOLO8n模型在长距离实时探测足球场上的球和玩家表现出色。》Note that "足球" (zúqiú) in the text refers to "soccer" in English.

Concept-wise Fine-tuning Matters in Preventing Negative Transfer

paper_url: http://arxiv.org/abs/2311.06868
repo_url: None
paper_authors: Yunqiao Yang, Long-Kai Huang, Ying Wei
for: 提高 Fine-tuning 的效果，尤其是对于 rare 和偶合 correlated 特征的影响
methods: 提出 Concept-wise fine-tuning (Concept-Tuning) 方法，通过在 patch 级别划分特征，提高 feature 表示的精度
results: 在 eleven 个 dataset 上，Concept-Tuning 方法与 priors 的 state-of-the-art fine-tuning 方法相比，显著提高了效果（最多提高4.76%），并在不同的预训练策略、网络架构和样本大小上具有广泛的可行性。

Abstract
A multitude of prevalent pre-trained models mark a major milestone in the development of artificial intelligence, while fine-tuning has been a common practice that enables pretrained models to figure prominently in a wide array of target datasets. Our empirical results reveal that off-the-shelf finetuning techniques are far from adequate to mitigate negative transfer caused by two types of underperforming features in a pre-trained model, including rare features and spuriously correlated features. Rooted in structural causal models of predictions after fine-tuning, we propose a Concept-wise fine-tuning (Concept-Tuning) approach which refines feature representations in the level of patches with each patch encoding a concept. Concept-Tuning minimizes the negative impacts of rare features and spuriously correlated features by (1) maximizing the mutual information between examples in the same category with regard to a slice of rare features (a patch) and (2) applying front-door adjustment via attention neural networks in channels and feature slices (patches). The proposed Concept-Tuning consistently and significantly (by up to 4.76%) improves prior state-of-the-art fine-tuning methods on eleven datasets, diverse pre-training strategies (supervised and self-supervised ones), various network architectures, and sample sizes in a target dataset.

摘要
一群普遍存在的预训练模型标志着人工智能的发展历程中的一个重要里程碑，而适应已成为预训练模型在多个目标数据集中突出表现的常见做法。我们的实验结果表明，直接使用存在问题的特征的预训练技术是不够有效地 mitigate 负面传递。基于结构 causal 模型的预测后 fine-tuning，我们提出了概念层 fine-tuning（Concept-Tuning）方法，它在 patch 级别对特征进行了更加细化的修正。Concept-Tuning 通过（1）在同类目标中的例子之间增加例子之间的相互信息（regarding 一个质量的patch），以及（2）通过注意力神经网络在通道和特征片（patch）中进行前门调整来减少罕见特征和偶合特征的负面影响。我们的 Concept-Tuning 方法在十一个数据集上，包括不同的预训练策略（指导和自我指导）、不同的网络架构和样本大小等多个因素上，与之前的最佳 fine-tuning 方法进行了比较，并表现出了显著改善（最高提升4.76%）。

Contrastive Learning of View-Invariant Representations for Facial Expressions Recognition

paper_url: http://arxiv.org/abs/2311.06852
repo_url: None
paper_authors: Shuvendu Roy, Ali Etemad
for: 提高非FR面部表达识别精度，抵御不同视角入参影响
methods: 基于对比学习的新视角不变 facial expression recognition 框架 ViewFX，通过自我超vised contrastive loss 学习视角不变特征，并通过supervised contrastive loss 驱动每个表情的特征归一化
results: 在两个多视角面部表达识别数据集上进行测试，表现比前作高，新创造了两个数据集的状态之一，并在不同视角和输入 Parameters 的情况下显示更高的鲁棒性和灵活性

Abstract
Although there has been much progress in the area of facial expression recognition (FER), most existing methods suffer when presented with images that have been captured from viewing angles that are non-frontal and substantially different from those used in the training process. In this paper, we propose ViewFX, a novel view-invariant FER framework based on contrastive learning, capable of accurately classifying facial expressions regardless of the input viewing angles during inference. ViewFX learns view-invariant features of expression using a proposed self-supervised contrastive loss which brings together different views of the same subject with a particular expression in the embedding space. We also introduce a supervised contrastive loss to push the learnt view-invariant features of each expression away from other expressions. Since facial expressions are often distinguished with very subtle differences in the learned feature space, we incorporate the Barlow twins loss to reduce the redundancy and correlations of the representations in the learned representations. The proposed method is a substantial extension of our previously proposed CL-MEx, which only had a self-supervised loss. We test the proposed framework on two public multi-view facial expression recognition datasets, KDEF and DDCF. The experiments demonstrate that our approach outperforms previous works in the area and sets a new state-of-the-art for both datasets while showing considerably less sensitivity to challenging angles and the number of output labels used for training. We also perform detailed sensitivity and ablation experiments to evaluate the impact of different components of our model as well as its sensitivity to different parameters.

摘要
尽管在人脸表达识别（FER）领域已经做出了很多进步，但大多数现有方法在输入角度不同于训练过程中的角度时受到影响。在这篇论文中，我们提出了 ViewFX，一种基于对比学习的新型视角不变的FER框架，能够在推理过程中准确地识别不同角度的人脸表达。ViewFX使用我们提议的自我监督对比损失来学习不同视角的表达特征，并通过推理过程中的不同视角来将这些特征带入嵌入空间中。我们还引入了一种监督对比损失来将学习的不同视角特征远离其他表达的特征。由于人脸表达通常通过非常微小的差异来区分，我们采用了Barlow twins损失来减少学习的表示空间中的重复和相关性。我们的方法是CL-MEx的扩展，只有自我监督loss。我们在两个公共多视角人脸表达识别数据集（KDEF和DDCF）进行了测试，实验结果表明，我们的方法在这两个数据集上超越了之前的成果，并在不同角度和输入标签数量的情况下设置了新的状态法。我们还进行了详细的敏感性和缺失 эксперименты来评估我们的模型的不同组件和参数的影响。

Sampler Scheduler for Diffusion Models

paper_url: http://arxiv.org/abs/2311.06845
repo_url: https://github.com/carzit/sd-webui-samplers-scheduler
paper_authors: Zitong Cheng
for: 这个论文的目的是提出一种多抽样器（ODE/SDE）的可行性，以解决 diffusion-based generative models 中的抽样问题。
methods: 该论文使用了多种主流抽样器（ODE/SDE），并通过分析和总结每种抽样器的更新方程，实现在同一个抽样过程中使用不同的抽样器。
results: 实验结果表明，这种多抽样器调度策略可以提高抽样效率和质量。例如，在 CIFAR-10 数据集上，使用 ODE Sampler Scheduler 时，FID 分数为 1.91，比 DPM++ 2M 、DPM2 和 Heun 等方法更好。同时， combining SDE 和 ODE 的抽样调度策略可以更好地解决抽样问题。

Abstract
Diffusion modeling (DM) has high-quality generative performance, and the sampling problem is an important part of the DM performance. Thanks to efficient differential equation solvers, the sampling speed can be reduced while higher sampling quality is guaranteed. However, currently, there is a contradiction in samplers for diffusion-based generative models: the mainstream sampler choices are diverse, each with its own characteristics in terms of performance. However, only a single sampler algorithm can be specified on all sampling steps in the generative process. This often makes one torn between sampler choices; in other words, it makes it difficult to fully utilize the advantages of each sampler. In this paper, we propose the feasibility of using different samplers (ODE/SDE) on different sampling steps of the same sampling process based on analyzing and generalizing the updating formulas of each mainstream sampler, and experimentally demonstrate that such a multi-sampler scheduling improves the sampling results to some extent. In particular, we also verify that the combination of using SDE in the early sampling steps and ODE in the later sampling steps solves the inherent problems previously caused by using both singly. We show that our design changes improve the sampling efficiency and quality in previous work. For instance, when Number of Function Evaluations (NFE) = 24, the ODE Sampler Scheduler achieves a FID score of 1.91 on the CIFAR-10 dataset, compared to 2.02 for DPM++ 2M, 1.97 for DPM2, and 11.90 for Heun for the same NFE. Meanwhile the Sampler Scheduler with the combined scheduling of SDE and ODE reaches 1.899, compared to 18.63 for Euler a, 3.14 for DPM2 a and 23.14 for DPM++ SDE.

摘要
Diffusion 模型（DM）具有高质量的生成性能， sampling 问题是 DM 性能的重要组成部分。due to efficient differential equation solvers, the sampling speed can be reduced while higher sampling quality is guaranteed. However, currently, there is a contradiction in samplers for diffusion-based generative models: the mainstream sampler choices are diverse, each with its own characteristics in terms of performance. However, only a single sampler algorithm can be specified on all sampling steps in the generative process. This often makes one torn between sampler choices; in other words, it makes it difficult to fully utilize the advantages of each sampler. In this paper, we propose the feasibility of using different samplers (ODE/SDE) on different sampling steps of the same sampling process based on analyzing and generalizing the updating formulas of each mainstream sampler, and experimentally demonstrate that such a multi-sampler scheduling improves the sampling results to some extent. In particular, we also verify that the combination of using SDE in the early sampling steps and ODE in the later sampling steps solves the inherent problems previously caused by using both singly. We show that our design changes improve the sampling efficiency and quality in previous work. For instance, when Number of Function Evaluations (NFE) = 24, the ODE Sampler Scheduler achieves a FID score of 1.91 on the CIFAR-10 dataset, compared to 2.02 for DPM++ 2M, 1.97 for DPM2, and 11.90 for Heun for the same NFE. Meanwhile the Sampler Scheduler with the combined scheduling of SDE and ODE reaches 1.899, compared to 18.63 for Euler a, 3.14 for DPM2 a, and 23.14 for DPM++ SDE.

Osteoporosis Prediction from Hand and Wrist X-rays using Image Segmentation and Self-Supervised Learning

paper_url: http://arxiv.org/abs/2311.06834
repo_url: None
paper_authors: Hyungeun Lee, Ung Hwang, Seungwon Yu, Chang-Hun Lee, Kijung Yoon
for: 这篇文章旨在探讨使用手和腕X射线影像来预测骨质疾病（osteoporosis），以提高检测率而无需增加成本或时间。
methods: 这篇文章使用了一个基础模型来进行影像分类，然后使用自我超vised learning的方法来提取有意义的表示，最后使用一个超vised learning的方法来类别骨质疾病。
results: 这篇文章的结果显示，使用这种方法可以获得一个较高的类别分数（AUC=0.83），这表明这种方法可以很好地预测骨质疾病。

Abstract
Osteoporosis is a widespread and chronic metabolic bone disease that often remains undiagnosed and untreated due to limited access to bone mineral density (BMD) tests like Dual-energy X-ray absorptiometry (DXA). In response to this challenge, current advancements are pivoting towards detecting osteoporosis by examining alternative indicators from peripheral bone areas, with the goal of increasing screening rates without added expenses or time. In this paper, we present a method to predict osteoporosis using hand and wrist X-ray images, which are both widely accessible and affordable, though their link to DXA-based data is not thoroughly explored. Initially, our method segments the ulnar, radius, and metacarpal bones using a foundational model for image segmentation. Then, we use a self-supervised learning approach to extract meaningful representations without the need for explicit labels, and move on to classify osteoporosis in a supervised manner. Our method is evaluated on a dataset with 192 individuals, cross-referencing their verified osteoporosis conditions against the standard DXA test. With a notable classification score (AUC=0.83), our model represents a pioneering effort in leveraging vision-based techniques for osteoporosis identification from the peripheral skeleton sites.

摘要
骨质疏松是一种广泛存在和慢性的代谢疾病，经常未被诊断和治疗，这是由于骨骼矿物厚度测试（DXA）的有限访问而导致的。为了解决这个挑战，当前的进展都在转移到查看周边骨骼区域的指标，以提高检测率而无需增加成本或时间。在这篇论文中，我们提出了使用手和手腕X射线图像来预测骨质疏松的方法，这些图像都是非常容易获得和便宜的，但它们与DXA测试的关系尚未得到了充分探讨。我们的方法首先使用基础模型来对手和手腕X射线图像进行分割，然后使用无监督学习方法来提取有意义的表示，最后在监督学习方式下来进行分类。我们的方法在一个包含192名个体的数据集上进行评估，并与标准DXA测试进行对比。我们的模型在这些数据上获得了 Notable的分类分数（AUC=0.83），表明我们的方法可以有效地利用视觉技术来识别骨质疏松。

On original and latent space connectivity in deep neural networks

paper_url: http://arxiv.org/abs/2311.06816
repo_url: None
paper_authors: Boyang Gu, Anastasia Borovykh
for: 研究了 Whether inputs from the same class can be connected by a continuous path in the original or latent representation space, and how the neural network views its own input space and the structure of the latent spaces.
methods: 使用了 Neural network models to study the connectivity of same-class inputs and the structure of the latent spaces.
results: 发现了 All points on the path are mapped by the neural network model to the same class, and paths, linear or nonlinear, connecting same-class inputs exist in all cases studied.

Abstract
We study whether inputs from the same class can be connected by a continuous path, in original or latent representation space, such that all points on the path are mapped by the neural network model to the same class. Understanding how the neural network views its own input space and how the latent spaces are structured has value for explainability and robustness. We show that paths, linear or nonlinear, connecting same-class inputs exist in all cases studied.

摘要
我们研究 Whether inputs from the same class can be connected by a continuous path, in original or latent representation space, such that all points on the path are mapped by the neural network model to the same class. 理解 neural network 对自己的输入空间的看法和离散空间的结构有价值，用于解释性和稳定性。我们显示，在所有研究的 случаeschina，同类输入的路径，线性或非线性，都存在。

MetaMix: Meta-state Precision Searcher for Mixed-precision Activation Quantization

paper_url: http://arxiv.org/abs/2311.06798
repo_url: None
paper_authors: Han-Byul Kim, Joo Hyung Lee, Sungjoo Yoo, Hong-Seok Kim
for: 提高混合精度量化的精度和效率，解决活动不稳定性问题
methods: 提出一种名为MetaMix的新方法，包括位数选择和Weight训练两个阶段，均可减少混合精度量化中的活动不稳定性问题，并且可以快速完成位数选择和Weight训练
results: 通过在MobileNet v2和v3，以及ResNet-18上进行图像Net的实验，显示我们提出的方法可以超越混合和单精度SOTA方法，在精度vs运算数上push混合精度量化的边界

Abstract
Mixed-precision quantization of efficient networks often suffer from activation instability encountered in the exploration of bit selections. To address this problem, we propose a novel method called MetaMix which consists of bit selection and weight training phases. The bit selection phase iterates two steps, (1) the mixed-precision-aware weight update, and (2) the bit-search training with the fixed mixed-precision-aware weights, both of which combined reduce activation instability in mixed-precision quantization and contribute to fast and high-quality bit selection. The weight training phase exploits the weights and step sizes trained in the bit selection phase and fine-tunes them thereby offering fast training. Our experiments with efficient and hard-to-quantize networks, i.e., MobileNet v2 and v3, and ResNet-18 on ImageNet show that our proposed method pushes the boundary of mixed-precision quantization, in terms of accuracy vs. operations, by outperforming both mixed- and single-precision SOTA methods.

摘要
通常，混合精度量化的高效网络会遇到活动不稳定性问题，这是在探索比选择中出现的问题。为解决这个问题，我们提出了一种新方法called MetaMix，它包括比选择和重量训练两个阶段。比选择阶段包括两步：（1）混合精度意识的Weight更新，和（2）固定混合精度意识的比选择训练，这两个步骤共同减少了混合精度量化中的活动不稳定性，并且对快速和高质量的比选择做出了贡献。重量训练阶段利用了在比选择阶段训练的 weights和 step sizes，并对它们进行了细调，从而提供了快速的训练。我们对高效和难以量化的网络，如MobileNet v2和v3，以及ResNet-18在ImageNet上进行了实验，结果表明，我们的提出的方法可以超越混合和单精度的SOTA方法，在精度vs操作方面Push the boundary of mixed-precision quantization。

Deep Perspective Transformation Based Vehicle Localization on Bird’s Eye View

paper_url: http://arxiv.org/abs/2311.06796
repo_url: https://github.com/ipm-hpc/perspective-bev-transformer
paper_authors: Abtin Mahyar, Hossein Motamednia, Dara Rahmati
for: 提高自动驾驶车 Navigation 系统的准确性和效率，提供丰富的环境数据 для下游任务。
methods: 使用 bird’s-eye-view 映射将 perspective 视图RGB 图像转换为分割环境汽车的映射，提供高效且成本低的环境信息获取方法。
results: 提供了一个新的合成数据集，包含了一系列包含 ego 车和其环境的帧图像，为同类下游任务提供了丰富的资源。

Abstract
An accurate understanding of a self-driving vehicle's surrounding environment is crucial for its navigation system. To enhance the effectiveness of existing algorithms and facilitate further research, it is essential to provide comprehensive data to the routing system. Traditional approaches rely on installing multiple sensors to simulate the environment, leading to high costs and complexity. In this paper, we propose an alternative solution by generating a top-down representation of the scene, enabling the extraction of distances and directions of other cars relative to the ego vehicle. We introduce a new synthesized dataset that offers extensive information about the ego vehicle and its environment in each frame, providing valuable resources for similar downstream tasks. Additionally, we present an architecture that transforms perspective view RGB images into bird's-eye-view maps with segmented surrounding vehicles. This approach offers an efficient and cost-effective method for capturing crucial environmental information for self-driving cars. Code and dataset are available at https://github.com/IPM-HPC/Perspective-BEV-Transformer.

摘要
<> translate into Simplified Chinese一个准确的自驾车环境理解对其导航系统是非常重要的。为了提高现有算法的效iveness和促进进一步的研究，提供全面的数据给路由系统是必要的。传统的方法通过安装多个感测器来模拟环境，这会导致高成本和复杂性。在这篇论文中，我们提出一个 alternativesolution，通过生成顶部视图的场景表示，以提取相对于egos车的其他车辆的距离和方向。我们介绍了一个新的合成数据集，该数据集在每帧中提供了 egos车和其environments的广泛信息，这将为下游任务提供优质的资源。此外，我们提出了一种将平视图RGB图像转换为鸟瞰视图地图的架构，该approach可以有效地和经济地记录自驾车环境中的重要信息。代码和数据集可以在https://github.com/IPM-HPC/Perspective-BEV-Transformer上下载。

CL-Flow:Strengthening the Normalizing Flows by Contrastive Learning for Better Anomaly Detection

paper_url: http://arxiv.org/abs/2311.06794
repo_url: None
paper_authors: Shunfeng Wang, Yueyang Li, Haichi Luo, Chenyang Bi
for: 这个论文主要关注于自适应异常检测领域中的异常样本稀缺问题，提出了一种自我监督异常检测方法， combinig contrastive learning with 2D-Flow，以提高检测精度和减少计算成本。
methods: 本文提出了一种新的异常生成方法，通过模拟真实的工业场景来生成异常样本，并对2D-Flow框架进行了改进，通过多种代理任务来练级网络，以提高异常检测精度。
results: 比较主流的无监督方法，本文的自我监督方法在MVTecAD和BTAD datasets上达到了新的州态艺术率记录，分别为99.6%和96.8%。

Abstract
In the anomaly detection field, the scarcity of anomalous samples has directed the current research emphasis towards unsupervised anomaly detection. While these unsupervised anomaly detection methods offer convenience, they also overlook the crucial prior information embedded within anomalous samples. Moreover, among numerous deep learning methods, supervised methods generally exhibit superior performance compared to unsupervised methods. Considering the reasons mentioned above, we propose a self-supervised anomaly detection approach that combines contrastive learning with 2D-Flow to achieve more precise detection outcomes and expedited inference processes. On one hand, we introduce a novel approach to anomaly synthesis, yielding anomalous samples in accordance with authentic industrial scenarios, alongside their surrogate annotations. On the other hand, having obtained a substantial number of anomalous samples, we enhance the 2D-Flow framework by incorporating contrastive learning, leveraging diverse proxy tasks to fine-tune the network. Our approach enables the network to learn more precise mapping relationships from self-generated labels while retaining the lightweight characteristics of the 2D-Flow. Compared to mainstream unsupervised approaches, our self-supervised method demonstrates superior detection accuracy, fewer additional model parameters, and faster inference speed. Furthermore, the entire training and inference process is end-to-end. Our approach showcases new state-of-the-art results, achieving a performance of 99.6\% in image-level AUROC on the MVTecAD dataset and 96.8\% in image-level AUROC on the BTAD dataset.

摘要
在异常检测领域，缺乏异常样本导致当前研究强调无监督异常检测。而这些无监督异常检测方法尽管方便，但它们也忽略了异常样本中关键的先前信息。此外，深度学习方法中，监督方法通常比无监督方法表现更优。针对以上原因，我们提出一种自我监督异常检测方法，将对比学习与2D-Flow结合使用，以实现更精准的检测结果和加速的检测过程。一方面，我们提出了一种新的异常生成方法，生成了符合实际工业场景的异常样本，并同时提供了代表性的注释。另一方面，通过获得大量异常样本，我们改进了2D-Flow框架，通过对多个代理任务进行细化 parameter 的网络。我们的方法使得网络可以从自己生成的标签中学习更精准的映射关系，同时保持2D-Flow的轻量级特性。相比主流无监督方法，我们的自我监督方法在检测精度、额外参数数量和检测速度方面均表现出优异。此外，整个训练和检测过程是端到端的。我们的方法在MVTecAD数据集上达到了图像级AUROC99.6%和BTAD数据集上达到了图像级AUROC96.8%的新状态纪录。

IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

paper_url: http://arxiv.org/abs/2311.06792
repo_url: None
paper_authors: Zhaoyuan Yang, Zhengyang Yu, Zhiwei Xu, Jaskirat Singh, Jing Zhang, Dylan Campbell, Peter Tu, Richard Hartley
for: 生成smooth、direct、realistic的图像变换
methods: 利用扩散模型、 lokally线性和Gaussian latent space进行图像 interpolating
results: suppresses ghosting artifacts、achieves smooth、direct、realistic image morphingHere’s a more detailed explanation of each point:
for: The paper is written to present a new method for image morphing, which is a technique used to transform one image into another. The goal is to create smooth, direct, and realistic changes between the two images.
methods: The proposed method uses a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS). This method leverages a latent diffusion model that has distinct conditional distributions and data embeddings for each of the two images, especially when they are from different classes. To bridge the gap between the two images, the method interpolates in the locally linear and continuous text embedding space and Gaussian latent space. Additionally, the method uses an adaptive bottleneck constraint based on a novel relative perceptual path diversity score to control the bottleneck size and balance the diversity along the path with its directness.
results: The proposed method can achieve smooth, direct, and realistic image morphing. Extensive experiments validate that the method can be applied to other image generation tasks. Additionally, the method suppresses ghosting artifacts, which are common in traditional image morphing techniques.

Abstract
We present a diffusion-based image morphing approach with perceptually-uniform sampling (IMPUS) that produces smooth, direct, and realistic interpolations given an image pair. A latent diffusion model has distinct conditional distributions and data embeddings for each of the two images, especially when they are from different classes. To bridge this gap, we interpolate in the locally linear and continuous text embedding space and Gaussian latent space. We first optimize the endpoint text embeddings and then map the images to the latent space using a probability flow ODE. Unlike existing work that takes an indirect morphing path, we show that the model adaptation yields a direct path and suppresses ghosting artifacts in the interpolated images. To achieve this, we propose an adaptive bottleneck constraint based on a novel relative perceptual path diversity score that automatically controls the bottleneck size and balances the diversity along the path with its directness. We also propose a perceptually-uniform sampling technique that enables visually smooth changes between the interpolated images. Extensive experiments validate that our IMPUS can achieve smooth, direct, and realistic image morphing and be applied to other image generation tasks.

摘要
我们提出了一种基于扩散的图像融合方法（IMPUS），该方法可以生成基于图像对的平滑、直接和实际的插值。我们的模型具有不同类别图像的独特条件分布和数据嵌入，因此在 interpolating 图像时需要 bridge 这个差异。我们使用了在本地线性和连续的文本嵌入空间和高斯嵌入空间进行 interpolating，首先优化终点文本嵌入，然后将图像映射到嵌入空间使用概率流ODE。不同于现有的工作，我们的模型适应化可以实现直接的插值路径，并且可以抑制抽象残余 artifacts。为了实现这一点，我们提出了一种自适应瓶颈约束，该约束基于一种新的相对感知路径多样性分数，可以自动控制瓶颈大小，并且可以在路径上保持视觉平滑的变化。我们还提出了一种可视平滑的抽象采样技术，可以在 interpolating 图像时使得变化更加平滑。我们的IMPUS可以实现平滑、直接和实际的图像插值，并且可以应用于其他图像生成任务。

InfMLLM: A Unified Framework for Visual-Language Tasks

paper_url: http://arxiv.org/abs/2311.06791
repo_url: https://github.com/mightyzau/infmllm
paper_authors: Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi
for: 这个论文的目的是扩展大语言模型（LLM）的能力，以涵盖更广泛的语言相关应用。
methods: 该论文使用了三个阶段的训练方案：开始 WITH 轻量级含义预处理，然后是中量级多任务混合训练，最后是 LLJ 练习以提高 instrux following 能力。
results: 该论文的实验结果表明，通过使用 pool-adapter 模块保持视觉嵌入的 pozitional 信息，对于如 visual grounding 等任务具有特别的 beneficial 效果。 InfMLLM 的性能达到了或与最新的 MLLM 相当。代码和模型将在：\url{https://github.com/mightyzau/InfMLLM} 上开源。

Abstract
Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}.

摘要
大型语言模型（LLM）已经证明了它们在处理广泛的语言相关应用方面的卓越多样性。为了扩展LLM的能力到更广泛的模式输入，多模式大型语言模型（MLLM）在最近引起了越来越多的关注。这个工作探索了将LLM应用到更多的视觉语言相关任务，特别是图像描述、视觉问题答案（VQA）和视觉定位。为此，我们采用了三阶段训练方案：首先是轻量级Alignment预训练，然后是中量级多任务混合训练，最后是LLM精细调整以提高指令遵循能力。在训练过程中，GPU内存的需求逐渐增加。为了有效地管理LLM接受的视觉嵌入的数量，我们提出了一个简单的视觉适配器模组，名为pool适配器。我们的实验表明，通过pool适配器保留视觉嵌入的位置信息是特别有益于像定位任务。我们统称我们的提案为InfMLLM，并对多个benchmark数据集进行了广泛的评估。我们的结果显示InfMLLM在不同的任务上均可以 achieve state-of-the-art（SOTA）性或与最近的MLLMs相似的性能。我们将代码和模型公开发布在：。

Explainability of Vision Transformers: A Comprehensive Review and New Perspectives

paper_url: http://arxiv.org/abs/2311.06786
repo_url: None
paper_authors: Rojina Kashefi, Leili Barekatain, Mohammad Sabokrou, Fatemeh Aghaeipoor
for: 本研究旨在解释视觉转换器（ViT）的工作原理和决策基础。
methods: 本研究提出了不同的解释方法，并对其进行了分类和评价。
results: 本研究提供了一份完整的评价标准和解释工具框架，以及未经探索的重要方向和未来投资领域。

Abstract
Transformers have had a significant impact on natural language processing and have recently demonstrated their potential in computer vision. They have shown promising results over convolution neural networks in fundamental computer vision tasks. However, the scientific community has not fully grasped the inner workings of vision transformers, nor the basis for their decision-making, which underscores the importance of explainability methods. Understanding how these models arrive at their decisions not only improves their performance but also builds trust in AI systems. This study explores different explainability methods proposed for visual transformers and presents a taxonomy for organizing them according to their motivations, structures, and application scenarios. In addition, it provides a comprehensive review of evaluation criteria that can be used for comparing explanation results, as well as explainability tools and frameworks. Finally, the paper highlights essential but unexplored aspects that can enhance the explainability of visual transformers, and promising research directions are suggested for future investment.

摘要
<>transformers 对自然语言处理和计算机视觉有很大的影响，最近在基本计算机视觉任务中表现出色。然而，科学社区对视觉transformers 的内部工作和决策基础还没有全面理解，这标志着解释方法的重要性。理解这些模型如何做出决策不仅提高了它们的性能，还可以帮助建立对 AI 系统的信任。本文探讨了不同的解释方法，并对它们进行了分类，按照它们的动机、结构和应用场景进行排序。此外，文章还提供了评估解释结果的标准化评价标准，以及解释工具和框架。最后，文章强调了对视觉transformers 的解释仍然存在一些不足之处，并提出了未来投入的潜在研究方向。>>>

Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

paper_url: http://arxiv.org/abs/2311.06783
repo_url: https://github.com/Q-Future/Q-Instruct
paper_authors: Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, Geng Xue, Wenxiu Sun, Qiong Yan, Weisi Lin
for: 增强基础模型对低级视觉任务的能力，包括对图像的识别、分类、描述等。
methods: 基于GPT-4V的多模态基础模型，通过大规模人工反馈的集成，提高低级视觉任务的能力。
results: 经过大规模人工反馈的集成，可以提高基础模型对低级视觉任务的能力，并且可以让基础模型更好地理解和评价图像的低级视觉特征。

Abstract
Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.

摘要
多Modal基础模型，如GPT-4V，已经带来了一个新的LOW级视觉理解和任务框架，可以回应人类的各种自然指令。现有的基础模型有很多潜在的能力，但它们的相关能力还需要进一步提高。为了提高这些模型，我们进行了大规模的主观实验，收集了大量真正的人类反馈，每个反馈都包括一个详细的LOW级视觉出现（如清晰度、颜色、亮度等）的描述，以及一个总结，平均长度为45个单词。我们构建了**Q-Pathway**数据集，包含58000个详细的人类反馈，对18973张图像进行了多种LOW级视觉出现。此外，为了让基础模型能够有效地回应多种问题，我们设计了一种GPT参与的转换，将这些反馈转换为多种格式的200000个指令响应对。实验结果表明，**Q-Instruct**可以不断提高基础模型的LOW级视觉理解和识别能力。我们预计，我们的数据集、模型 zoological和示例将在未来为一个人类智能感知、理解LOW级视觉出现和评估视觉质量的未来开拓道路。我们的数据集、模型 zoo和示例可以在：https://q-future.github.io/Q-Instruct 中找到。