results: 论文表明,Qwen-VL系列模型在图像描述、问答、视觉定位等多个任务中表现出色,并且在零基础captioning、视觉或文档视觉问答等任务中具有优异表现。Abstract
We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
摘要
我们介绍Qwen-VL系列,一组大规模的视觉语言模型,旨在理解和处理文本和图像。包括Qwen-VL和Qwen-VL-Chat两种模型,它们在图像描述、问答、视觉定位和自适应互动等任务中表现出色。评估范围涵盖零引入描述、视觉或文档视问题回答以及固定。我们显示Qwen-VL超越现有的大型视觉语言模型(LVLM)。我们介绍它们的架构、训练、能力和性能,强调它们在多媒体人工智能领域的贡献。可以在https://github.com/QwenLM/Qwen-VL获取代码、demo和模型。
results: 论文表明,Code Llama 在多个代码测试 benchmark 上达到了当前开放模型的最佳性能,并且在某些情况下超过了 Llama 2 70B 的性能。Abstract
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.
摘要
我们发布了 Code Llama,一家大型语言模型,用于程式码,基于 Llama 2 提供现代性能的开放模型,并具有填充功能、大型输入上下文支持和零代指令跟踪能力。我们提供多种版本,以覆盖广泛应用:基础模型(Code Llama)、Python 特化版本(Code Llama - Python)以及指令跟踪模型(Code Llama - Instruct),每个版本都有 7B、13B 和 34B 句子数的参数。所有模型都是在字串16k tokens的序列上训练,并在输入字串长度到100k tokens时显示改进。7B 和 13B Code Llama 和 Code Llama - Instruct 版本支持填充基于周围的内容。Code Llama 在开放模型中达到了一些程式码benchmark的州OF-THE-ART表现,其中包括 HumanEval 和 MBPP,分别得分53%和55%。特别是,Code Llama - Python 7B 超过 Llama 2 70B 的 HumanEval 和 MBPP 分数,并且我们的所有模型在 MultiPL-E 上超过每个公开的模型。我们发布 Code Llama 的授权是允许研究和商业用途的开放授权。
Beyond Document Page Classification: Design, Datasets, and Challenges
results: 对于提posed的多页文档分类数据集进行了实验研究,发现现有的benchmarks已经失去了相关性,需要更新以评估实际中的完整文档。这也提出了更加成熟的评估方法,包括准确评估、时间-内存复杂度评估和各种实际分布转移(例如,生成vs扫描噪声、页码重构)。Abstract
This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}
摘要
An experimental study on proposed multi-page document classification datasets shows that current benchmarks are no longer relevant and need to be updated to evaluate complete documents as they naturally occur in practice. The study also highlights the need for more mature evaluation methodologies, including calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). The paper concludes on a hopeful note by recommending concrete avenues for future improvements.