results: 我们的实验结果表明,BDC-Adapter 可以自由处理非线性关系,全面捕捉独立性,与当前状态的方法相比,具有大幅度的提升。Abstract
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.
摘要
大规模预训练视语模型(VLM),如CLIP和ALIGN,已经引入了学习可转移的视觉表示的新 paradigm。最近,研究人员对下游视觉任务适应这些模型的轻量级练习技术表示了很大的兴趣。我们认为现今最佳练习方法,如Tip-Adapter,只考虑了查询图像特征和支持几个少量训练样本的特征之间的covariance,这只captures linear relations,可能导致误导性的独立性概念。为解决这个问题,在这项工作中,我们创新地引入了浮动距离covariance(BDC)到视觉语言理解领域。BDC指标可以模型所有可能的关系,提供一种可靠的指标来衡量特征相互关系。基于这,我们提出了一种新方法called BDC-Adapter,它通过组合BDC原型相似性逻辑和多模态逻辑网络预测来实现分类任务。我们的广泛实验结果表明,我们的提议的BDC-Adapter可以自由地处理非线性关系,具有完全characterize独立性的优势,在与当前状态艺术方法比较大的差异。
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
results: 实验结果表明,SeRum在文档理解任务中达到了国际级性能,并在文本检索任务中获得了竞争力。Abstract
We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.
摘要
我们提出了一种新的综合型文档理解模型,叫做SeRum(选择区域理解模型),用于从文档图像中提取有意义信息,包括文档分析、检索和办公自动化。 unlike现有的方法,SeRum不仅通过多个阶段技术实现,而且计算成本较高。 SeRum将文档图像理解和识别任务转化为当地解码过程,使用内容相关的字符串融合模块。这种机制使得模型更加注重查询解码器生成的区域兴趣,从而提高模型的效果和加速生成方案的解码速度。我们还设计了一些预训练任务,以增强模型的理解和地方意识。实验结果表明,SeRum在文档理解任务上达到了现有最佳性能,并在文本检索任务上获得了竞争性的成绩。SeRum代表了综合型文档理解的重要进步,它可以帮助实现高效、高效的文档理解。
Business Process Text Sketch Automation Generation Using Large Language Model
paper_authors: Rui Zhu, Quanzhou Hu, Wenxin Li, Honghao Xiao, Chaogang Wang, Zixin Zhou for: This paper aims to address the challenge of business process document generation in the absence of datasets, and to provide a solution for improving the correctness of data-driven deep learning techniques in this domain.methods: The authors propose an approach that transforms Conditional Process Trees (CPTs) into Business Process Text Sketches (BPTSs) using Large Language Models (LLMs). They also introduce a divide-and-conquer strategy to break down difficult CPTs into smaller, more manageable parts.results: The authors report a correct rate of 93.42% using their proposed method, which is 45.17% better than traditional prompting methods. Their approach has the potential to provide a large number of datasets for the process model extraction (PME) domain.Abstract
Business Process Management (BPM) is gaining increasing attention as it has the potential to cut costs while boosting output and quality. Business process document generation is a crucial stage in BPM. However, due to a shortage of datasets, data-driven deep learning techniques struggle to deliver the expected results. We propose an approach to transform Conditional Process Trees (CPTs) into Business Process Text Sketches (BPTSs) using Large Language Models (LLMs). The traditional prompting approach (Few-shot In-Context Learning) tries to get the correct answer in one go, and it can find the pattern of transforming simple CPTs into BPTSs, but for close-domain and CPTs with complex hierarchy, the traditional prompts perform weakly and with low correctness. We suggest using this technique to break down a difficult CPT into a number of basic CPTs and then solve each one in turn, drawing inspiration from the divide-and-conquer strategy. We chose 100 process trees with depths ranging from 2 to 5 at random, as well as CPTs with many nodes, many degrees of selection, and cyclic nesting. Experiments show that our method can achieve a correct rate of 93.42%, which is 45.17% better than traditional prompting methods. Our proposed method provides a solution for business process document generation in the absence of datasets, and secondly, it becomes potentially possible to provide a large number of datasets for the process model extraction (PME) domain.
摘要