cs.CL - 2023-07-19

Android in the Wild: A Large-Scale Dataset for Android Device Control

paper_url: http://arxiv.org/abs/2307.10088
repo_url: https://github.com/google-research/google-research
paper_authors: Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap
For: 这个论文旨在提供一个大型的设备控制数据集，用于研究设备控制系统的语言理解和可视上下文的推断。* Methods: 该论文使用了人类示例来描述设备交互，包括屏幕和操作，以及相应的自然语言指令。它包含715k个集和30k个唯一的指令，四个版本的Android（v10-13），八种设备类型（Pixel 2 XL到Pixel 6），以及不同的屏幕分辨率。* Results: 该论文报告了两个代理的性能分布在数据集中，并提出了一种新的挑战：从视觉上的动作进行推断。而不是简单的UI元素基础的动作，动作空间包括精准的手势（例如水平滚动来操作轮播widget）。

Abstract
There is a growing interest in device-control systems that can interpret human natural language instructions and execute them on a digital device by directly controlling its user interface. We present a dataset for device-control research, Android in the Wild (AITW), which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10-13),and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context. This dataset poses a new challenge: actions available through the user interface must be inferred from their visual appearance. And, instead of simple UI element-based actions, the action space consists of precise gestures (e.g., horizontal scrolls to operate carousel widgets). We organize our dataset to encourage robustness analysis of device-control systems, i.e., how well a system performs in the presence of new task descriptions, new applications, or new platform versions. We develop two agents and report performance across the dataset. The dataset is available at https://github.com/google-research/google-research/tree/master/android_in_the_wild.

摘要
“现在有一个增长的兴趣是将人类自然语言指令转换为数位设备的控制系统，以直接控制其用户界面。我们发布了一个叫Android在野（AITW）的数据集，该数据集比现有的数据集有很大的规模。该数据集包含人类对设备互动的示例，包括萤幕和动作，以及对应的自然语言指令。它包含715,000个集和30,000个专案，涵盖四个版本的Android（v10-13）和八款设备（Pixel 2 XL到Pixel 6），具有不同的萤幕分辨率。它包含多步任务，需要自然语言理解和视觉上下文。这个数据集对于设备控制系统的研究提出了新的挑战：需要从视觉上的动作推导出可用的动作。而不是简单的UI元素基础的动作，动作空间包括精确的手势（例如，横向滑块操作卡车ousel widget）。我们将数据集分为多个分支，以便鼓励设备控制系统的韧性分析，即在新的任务描述、新的应用程序或新的平台版本下，系统的表现如何。我们开发了两个代理，并在数据集上进行了性能评估。数据集可以在https://github.com/google-research/google-research/tree/master/android_in_the_wild上取得。”

Generating Mathematical Derivations with Large Language Models

paper_url: http://arxiv.org/abs/2307.09998
repo_url: https://github.com/jmeadows17/deriving-equations-with-llms
paper_authors: Jordan Meadows, Marco Valentino, Andre Freitas
for: 本研究旨在使用大型自然语言模型（LLM） derive 数学结果，以探索这些模型的局限性，并可能支持数学发现。
methods: 我们使用符号引擎生成方程式，并 comparing 不同预训练策略的 Robustness 和泛化能力。
results: 我们发现， fine-tuned FLAN-T5-large (MathT5) 模型在所有静态和 OUT-OF-distribution 测试集上表现出色，但是 fine-tuned 模型对未看过的符号和方程结构变化显示高度敏感。此外，我们还发现了一些常见的逻辑错误，如包含错误、无关的和重复的方程。最后，我们发现现有的评价指标不能准确评价生成的数学文本质量。

Abstract
The derivation of mathematical results in specialised fields, using Large Language Models (LLMs), is an emerging research direction that can help identify models' limitations, and potentially support mathematical discovery. In this paper, we leverage a symbolic engine to generate derivations of equations at scale, and investigate the capabilities of LLMs when deriving goal equations from premises. Specifically, we employ in-context learning for GPT and fine-tune a range of T5 models to compare the robustness and generalisation of pre-training strategies to specialised models. Empirical results show that fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and out-of-distribution test sets in conventional scores. However, an in-depth analysis reveals that the fine-tuned models are more sensitive to perturbations involving unseen symbols and (to a lesser extent) changes to equation structure. In addition, we analyse 1.7K equations, and over 200 derivations, to highlight common reasoning errors such as the inclusion of incorrect, irrelevant, and redundant equations. Finally, we explore the suitability of existing metrics for evaluating mathematical derivations and find evidence that, while they can capture general properties such as sensitivity to perturbations, they fail to highlight fine-grained reasoning errors and essential differences between models. Overall, this work demonstrates that training models on synthetic data may improve their math capabilities beyond much larger LLMs, but current metrics are not appropriately assessing the quality of generated mathematical text.

摘要
大数据集中的数学结果推导，使用大语言模型（LLM），是一个emerging的研究方向，可以帮助我们了解模型的局限性，并可能支持数学发现。在这篇论文中，我们利用符号引擎生成方程式的推导，并对LLM的推导能力进行了调整。具体来说，我们使用上下文学习来训练GPT，并对T5模型进行了较为精细的调整。我们通过对特定模型进行准确的训练来比较预处理策略的稳定性和泛化性。实验结果表明，对MathT5模型进行了精细的训练后，其在所有静态和非静态测试集上的成绩都高于GPT模型。然而，我们的深入分析表明，这些精细调整后的模型在未看过的符号和方程结构变化时更加敏感。此外，我们对1.7K方程和200个推导进行了分析，发现了一些常见的逻辑错误，如包含错误、无关和繁殖的方程。最后，我们检查了现有的评价指标是否能够正确评估生成的数学文本质量，发现它们可以捕捉大致的敏感性，但是不能捕捉细致的逻辑错误和模型之间的重要差异。总的来说，这项研究表明，通过训练模型在大数据集中可以提高其数学能力，但现有的评价指标不能准确评估生成的数学文本质量。

GUIDO: A Hybrid Approach to Guideline Discovery & Ordering from Natural Language Texts

paper_url: http://arxiv.org/abs/2307.09959
repo_url: https://github.com/nils-freyer/guido
paper_authors: Nils Freyer, Dustin Thewes, Matthias Meinecke
for: 提取文书中的工作流程网络，以简化指南或正式化文书中的过程描述
methods: 使用BERT模型来分类句子的相关性，并使用依赖分析来提取相关句子中的工作流程模型
results: GUIDO方法可以很好地提取工作流程模型，与纯机器学本来的方法相比，注解成本较低。

Abstract
Extracting workflow nets from textual descriptions can be used to simplify guidelines or formalize textual descriptions of formal processes like business processes and algorithms. The task of manually extracting processes, however, requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches, in turn, require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO, a hybrid approach to the process model extraction task that first, classifies sentences regarding their relevance to the process model, using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant, using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.

摘要
可以从文本描述中提取工作流程网络，以简化指南或正式化文本描述形式的过程，如商业过程和算法。 however， manually extracting processes requires domain expertise and effort. While automatic process model extraction is desirable, annotating texts with formalized process models is expensive. Therefore, there are only a few machine-learning-based extraction approaches. Rule-based approaches require domain specificity to work well and can rarely distinguish relevant and irrelevant information in textual descriptions. In this paper, we present GUIDO， a hybrid approach to the process model extraction task that first, classifies sentences based on their relevance to the process model using a BERT-based sentence classifier, and second, extracts a process model from the sentences classified as relevant using dependency parsing. The presented approach achieves significantly better results than a pure rule-based approach. GUIDO achieves an average behavioral similarity score of $0.93$. Still, in comparison to purely machine-learning-based approaches, the annotation costs stay low.Note: I used the Google Translate API to translate the text into Simplified Chinese. Please note that the translation may not be perfect and may require some adjustments to accurately convey the intended meaning.

Large Language Models can accomplish Business Process Management Tasks

paper_url: http://arxiv.org/abs/2307.09923
repo_url: None
paper_authors: Michael Grohs, Luka Abb, Nourhan Elsayed, Jana-Rebecca Rehse
for: 本研究旨在探讨自然语言处理技术如何应用于商业流程管理（BPM）中，以提高组织活动的效率和成果。
methods: 本研究使用大语言模型（LLM）来解决三种 exemplary BPM 任务：从文本描述中检索 Imperative 过程模型、从文本描述中检索 Declarative 过程模型、以及根据文本描述评估过程任务的合适性 для机器过程自动化。
results: 研究表明，无需较多的配置或提示工程，LLMs 可以与现有解决方案相比或者更好地完成这些任务，并讨论未来 BPM 研究的未来和实际应用中的可能性。

Abstract
Business Process Management (BPM) aims to improve organizational activities and their outcomes by managing the underlying processes. To achieve this, it is often necessary to consider information from various sources, including unstructured textual documents. Therefore, researchers have developed several BPM-specific solutions that extract information from textual documents using Natural Language Processing techniques. These solutions are specific to their respective tasks and cannot accomplish multiple process-related problems as a general-purpose instrument. However, in light of the recent emergence of Large Language Models (LLMs) with remarkable reasoning capabilities, such a general-purpose instrument with multiple applications now appears attainable. In this paper, we illustrate how LLMs can accomplish text-related BPM tasks by applying a specific LLM to three exemplary tasks: mining imperative process models from textual descriptions, mining declarative process models from textual descriptions, and assessing the suitability of process tasks from textual descriptions for robotic process automation. We show that, without extensive configuration or prompt engineering, LLMs perform comparably to or better than existing solutions and discuss implications for future BPM research as well as practical usage.

摘要

DAPrompt: Deterministic Assumption Prompt Learning for Event Causality Identification

paper_url: http://arxiv.org/abs/2307.09813
repo_url: None
paper_authors: Wei Xiang, Chuanhong Zhan, Bang Wang
for: 本研究旨在解决事件关系识别任务中，是否存在 causal 关系 между两个事件的问题。
methods: 我们提出了一种新的 deterministic assumption prompt learning 模型，称为 DAPrompt，它基于预训练语言模型中嵌入的百科全书式知识。
results: 实验结果表明，相比现状态的算法，DAPrompt 模型在 EventStoryLine corpora 和 Causal-TimeBank corpus 上显示了显著的性能提升。

Abstract
Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a deterministic assumption on the existence of causal relation between two events and then evaluate its rationality to either accept or reject the assumption. The design motivation is to try the most utilization of the encyclopedia-like knowledge embedded in a pre-trained language model. In light of such considerations, we propose a deterministic assumption prompt learning model, called DAPrompt, for the ECI task. In particular, we design a simple deterministic assumption template concatenating with the input event pair, which includes two masks as predicted events' tokens. We use the probabilities of predicted events to evaluate the assumption rationality for the final event causality decision. Experiments on the EventStoryLine corpus and Causal-TimeBank corpus validate our design objective in terms of significant performance improvements over the state-of-the-art algorithms.

摘要

IncDSI: Incrementally Updatable Document Retrieval

paper_url: http://arxiv.org/abs/2307.10323
repo_url: None
paper_authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q. Weinberger
for: 这篇论文是关于文档检索的一种新的方法，即使用神经网络来直接将查询转换为相应的文档。
methods: 这种方法使用一种名为Differentiable Search Index的新的搜索索引方法，可以在训练时间内添加新的文档，而不需要重新训练整个数据集。
results: 这种方法可以在20-50毫秒内添加新的文档，而不需要重新训练整个数据集，并且与完全训练新模型相比，其性能 Competitive。

Abstract
Differentiable Search Index is a recently proposed paradigm for document retrieval, that encodes information about a corpus of documents within the parameters of a neural network and directly maps queries to corresponding documents. These models have achieved state-of-the-art performances for document retrieval across many benchmarks. These kinds of models have a significant limitation: it is not easy to add new documents after a model is trained. We propose IncDSI, a method to add documents in real time (about 20-50ms per document), without retraining the model on the entire dataset (or even parts thereof). Instead we formulate the addition of documents as a constrained optimization problem that makes minimal changes to the network parameters. Although orders of magnitude faster, our approach is competitive with re-training the model on the whole dataset and enables the development of document retrieval systems that can be updated with new information in real-time. Our code for IncDSI is available at https://github.com/varshakishore/IncDSI.

摘要
干净搜索索引是一种最近提出的文档检索模式，它将文档库内容编码到神经网络参数中，直接将查询符号映射到相应的文档。这些模型在许多标准准测试 benchmark 上实现了状态足的表现。然而，这些类型的模型有一定的限制：添加新文档不容易。我们提出了 IncDSI，一种在实时（约20-50ms每个文档）添加文档的方法，不需要重新训练整个数据集（或者部分）。相反，我们将文档添加形式为约束优化问题，以便减少网络参数的变化。虽然速度是当前的许多orders of magnitude，但我们的方法与重新训练整个数据集相比，并不逊色。这使得可以在实时更新文档检索系统，以便在实时添加新信息。我们的 IncDSI 代码可以在 GitHub 上找到：https://github.com/varshakishore/IncDSI。

On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models

paper_url: http://arxiv.org/abs/2307.09793
repo_url: None
paper_authors: Sarah Gao, Andrew Kean Gao
for: 本研究旨在找出大语言模型（LLMs）中具有共同特征和趋势的家族和Subgroup,以便更好地理解和分析这些模型的性能和应用 potential.
methods: 该研究使用了 hierarchical clustering 和 n-grams 以及 term frequency-inverse document frequency（TF-IDF）来找出 LLMS 中的共同特征和趋势,并建立了一个 web 应用程序来快速生成多种可见化 visualization.
results: 研究发现，使用 TF-IDF 和 n-grams 可以准确地找出 LLMS 中的家族和Subgroup,并且可以通过 Constellation web 应用程序来快速生成多种可见化 visualization,以便更好地理解和分析这些模型的性能和应用 potential.

Abstract
Since late 2022, Large Language Models (LLMs) have become very prominent with LLMs like ChatGPT and Bard receiving millions of users. Hundreds of new LLMs are announced each week, many of which are deposited to Hugging Face, a repository of machine learning models and datasets. To date, nearly 16,000 Text Generation models have been uploaded to the site. Given the huge influx of LLMs, it is of interest to know which LLM backbones, settings, training methods, and families are popular or trending. However, there is no comprehensive index of LLMs available. We take advantage of the relatively systematic nomenclature of Hugging Face LLMs to perform hierarchical clustering and identify communities amongst LLMs using n-grams and term frequency-inverse document frequency. Our methods successfully identify families of LLMs and accurately cluster LLMs into meaningful subgroups. We present a public web application to navigate and explore Constellation, our atlas of 15,821 LLMs. Constellation rapidly generates a variety of visualizations, namely dendrograms, graphs, word clouds, and scatter plots. Constellation is available at the following link: https://constellation.sites.stanford.edu/.

摘要

Mood Classification of Bangla Songs Based on Lyrics

paper_url: http://arxiv.org/abs/2307.10314
repo_url: None
paper_authors: Maliha Mahajebin, Mohammad Rifat Ahmmad Rashid, Nafees Mansoor
for: 这个研究旨在分类孔乃诗歌的情感类型，以便更好地理解人们对音乐的感受。
methods: 该研究使用自然语言处理和Bert算法来分析4000首孔乃诗歌的歌词，并将其分为四种情感类型：快乐、悲伤、爱情和放松。
results: 研究发现，4000首孔乃诗歌中有1513首表达悲伤的情感，1362首表达爱情的情感，886首表达快乐的情感，而剩下的239首则属于放松的情感。这些结果表明，自动地分类孔乃诗歌的情感类型是可行的，并且准确性较高。

Abstract
Music can evoke various emotions, and with the advancement of technology, it has become more accessible to people. Bangla music, which portrays different human emotions, lacks sufficient research. The authors of this article aim to analyze Bangla songs and classify their moods based on the lyrics. To achieve this, this research has compiled a dataset of 4000 Bangla song lyrics, genres, and used Natural Language Processing and the Bert Algorithm to analyze the data. Among the 4000 songs, 1513 songs are represented for the sad mood, 1362 for the romantic mood, 886 for happiness, and the rest 239 are classified as relaxation. By embedding the lyrics of the songs, the authors have classified the songs into four moods: Happy, Sad, Romantic, and Relaxed. This research is crucial as it enables a multi-class classification of songs' moods, making the music more relatable to people's emotions. The article presents the automated result of the four moods accurately derived from the song lyrics.

摘要
音乐可以诱发多种情感，技术的发展使得音乐更加容易访问。孟加拉音乐，表达不同人类情感的形式，尚未得到充分的研究。本文的作者想要分析孟加拉歌曲，根据歌词来分类其情感。为了实现这一目标，本研究编译了4000首孟加拉歌曲的歌词、类型，并使用自然语言处理和Bert算法来分析数据。总共有1513首歌曲表达了悲伤的情感，1362首表达了爱情的情感，886首表达了喜乐的情感，剩下的239首被分类为宁静。通过嵌入歌曲 lyrics，作者将歌曲分为四种情感：快乐、悲伤、爱情和宁静。这项研究非常重要，因为它使得歌曲的情感更加可 relate 到人们的情感，从而使得音乐更加美妙。文章展示了自动从歌曲 lyrics 中提取出的四种情感的准确分类结果。

CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

paper_url: http://arxiv.org/abs/2307.09705
repo_url: https://github.com/x-plug/cvalues
paper_authors: Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou
for: 评估大语言模型（LLMs）是否符合人类价值观念的重要性在不断增长，这篇论文提出了首个中文人类价值评估标准（CValues），用于评估 LLMS 在安全性和责任性两个方面的价值Alignment。
methods: 该论文使用了人工收集的阴性安全提问和责任提问，以及专业专家卷积的多选提问，以提供中文 LLMS 的全面价值评估。
results: 研究发现，大多数中文 LLMS 在安全性方面表现良好，但在责任性方面还有很大的提升空间。此外，自动和人类评估都是评估中文 LLMS 的人类价值Alignment 的重要方法。

Abstract
With the rapid evolution of large language models (LLMs), there is a growing concern that they may pose risks or have negative social impacts. Therefore, evaluation of human values alignment is becoming increasingly important. Previous work mainly focuses on assessing the performance of LLMs on certain knowledge and reasoning abilities, while neglecting the alignment to human values, especially in a Chinese context. In this paper, we present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs in terms of both safety and responsibility criteria. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains by professional experts. To provide a comprehensive values evaluation of Chinese LLMs, we not only conduct human evaluation for reliable comparison, but also construct multi-choice prompts for automatic evaluation. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility. Moreover, both the automatic and human evaluation are important for assessing the human values alignment in different aspects. The benchmark and code is available on ModelScope and Github.

摘要
随着大语言模型（LLM）的快速演化，有关它们可能带来风险或有负面社会影响的担忧增加。因此，评估人类价值Alignment在当前变得越来越重要。先前的工作主要集中于评估LLM的某些知识和理解能力，而忽略了与人类价值的Alignment，尤其在中文上下文中。在这篇论文中，我们介绍了CValues，第一个中文人类价值评估标准，用于评估LLM的安全和责任性能力。我们手动收集了10个场景下的逆向安全提示和8个领域下的责任提示，并构建了多选提示以进行自动评估。我们发现，大多数中文LLM在安全性方面表现良好，但在责任性方面有很大的改进空间。此外，人工评估和自动评估都是评估中文LLM的人类价值Alignment的重要方法。我们的找到结果和标准化的代码将在ModelScope和Github上公布。

Efficient Guided Generation for Large Language Models

paper_url: http://arxiv.org/abs/2307.09702
repo_url: https://github.com/normal-computing/outlines
paper_authors: Brandon T. Willard, Rémi Louf
for: 本文提出了一种使用 finite-state machine 框架来解决神经文本生成问题的方法。
methods: 该方法使用 regular expressions 和 context-free grammars 来引导文本生成，并可以在语言模型中构建索引，从而保证生成的文本结构可靠。
results: 该方法在性能上显著超过了现有的解决方案，并允许在不同领域中应用域特定的知识和约束。

Abstract
In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model's vocabulary. The approach is model agnostic, allows one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines

摘要
在这篇文章中，我们展示了如何将神经文本生成问题构思为finite-state机器的状态转移问题。这个框架导致了一种高效的使用正则表达式和context-free grammar来引导文本生成的方法，可以建立语言模型词汇索引。该方法是模型无关的，允许承载域特定的知识和限制，并能够建立可靠的界面，保证生成的文本结构。它增加了少量的токен序列生成过程的负担，并显著超越了现有的解决方案。我们在Python开源库Outlines中提供了实现。

Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

paper_url: http://arxiv.org/abs/2307.09701
repo_url: None
paper_authors: Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi
for: The paper aims to address the practical challenges in evaluating and comparing the efficiency of natural language processing (NLP) models, and to provide a standardized and centralized platform for fair and reproducible evaluations.
methods: The paper introduces Pentathlon, a benchmark for holistic and realistic evaluation of NLP model efficiency, which focuses on inference and offers a strictly-controlled hardware platform, a suite of metrics, and a software library for seamless integration.
results: The paper hopes to stimulate algorithmic innovations in building efficient NLP models and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.Here is the same information in Simplified Chinese text:
for: 本文旨在解决现代自然语言处理（NLP）系统的计算需求增加了研究障碍和环境问题的实际挑战，并提供一个标准化和中心化的评估平台。
methods: 本文介绍了 Pentathlon，一个用于整体和现实启用的 NLP 模型效率评估 benchmark，它将注重协议，提供一个严格控制的硬件平台，以及一个综合指标集和可顺应式软件库。
results: 本文希望通过促进高效 NLP 模型的算法创新和社会环境因素的认识，推动未来一代 NLP 模型的发展。

Abstract
Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.

摘要
现代自然语言处理（NLP）系统的计算需求不断增加，使得进行前沿研究的门槛增加，同时也提出了严重的环境问题。然而，模型效率的进步受到了实际检验和比较的实际挑战。例如，硬件控制因为不同机构的访问权限而具有不同水平的困难。此外，在计算力（FLOPs）中提高的改进 frequently fails to translate into real-world applications.为此，我们介绍了 Pentathlon，一个用于整体和现实应用场景的模型效率评价标准。Pentathlon 专注于推理，占模型生命周期的主要部分。它提供一个严格控制的硬件平台，并遵循现实应用场景的设计。它包含一 suite of metrics targeting different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption。Pentathlon 还提供了一个可以轻松地 интеGRATE到任何代码库的软件库，可以实现评价。作为一个标准化和中心化的评价平台，Pentathlon 可以减少对比较性和可重现性的工作负担。虽然首先关注于自然语言处理（NLP）模型，但Pentathlon 设计可扩展到其他领域。我们期望Pentathlon 能够促进模型建构的算法创新，并且推动未来一代 NLP 模型的开发中对社会和环境的考虑。

Analyzing sports commentary in order to automatically recognize events and extract insights

paper_url: http://arxiv.org/abs/2307.10303
repo_url: https://github.com/yanismiraoui/analyzing-sports-commentary-in-order-to-automatically-recognize-events-and-extract-insights
paper_authors: Yanis Miraoui
for: 本研究旨在使用不同自然语言处理技术和方法自动识别体育活动中的主要动作。
methods: 本研究使用了不同的自然语言处理技术和方法，包括文本分析和分类，以EXTRACT主要动作的抽象。
results: 研究发现，可以使用情感分析来检测主要动作。

Abstract
In this paper, we carefully investigate how we can use multiple different Natural Language Processing techniques and methods in order to automatically recognize the main actions in sports events. We aim to extract insights by analyzing live sport commentaries from different sources and by classifying these major actions into different categories. We also study if sentiment analysis could help detect these main actions.

摘要
在这篇论文中，我们仔细研究了如何使用多种自然语言处理技术和方法来自动识别体育活动中的主要动作。我们目标是通过分析不同来源的直播体育评论来提取情感，并将这些主要动作分类为不同类别。我们还研究了情感分析是否可以帮助探测这些主要动作。

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

paper_url: http://arxiv.org/abs/2307.09532
repo_url: https://github.com/damithdr/legal-classification
paper_authors: Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov
for: 本研究旨在解决长文档分类问题，以提高现有的转换器模型在多个领域中的应用能力。
methods: 本研究使用模型融合技术来解决长文档分类问题，并与BERT和Longformer架构进行比较。
results: 研究发现，模型融合技术可以提高转换器模型在长文档分类任务中的表现，并且比BERT和Longformer架构更具有灵活性和可扩展性。

Abstract
Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

摘要
文本分类是一个长期研究的领域，在自然语言处理（NLP）中进行研究。适应多个领域的NLP带来了许多新的挑战，其中之一是长文档分类。虽然现代变换器模型在文本分类方面提供了出色的结果，但大多数变换器模型受限于输入序列 longest length，因此在长文档分类问题上表现不佳。在这项研究中，我们探讨使用模型融合来解决长文档分类问题，并与知名的BERT和Longformer架构进行比较。

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

paper_url: http://arxiv.org/abs/2307.09474
repo_url: None
paper_authors: Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Haoran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, Xiangyu Zhang
for: 提高多模态大语言模型（MLLMs）的用户-人工智能互动性。
methods: 使用多种参照表示方式，如点和方正矩形作为参照提示，使MLLMs能够更加精细地关注特定区域。
results: ChatSpot模型在不同的互动方式和任务上表现出色，提供了更加灵活和抽象的互动体验。

Abstract
Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance.

摘要
人机合作性是多模态大语言模型（MLLM）的关键方面，但现有的端到端 MLLM 只允许用户通过语言指令与其交互，导致交互准确性和效率受限。在这项研究中，我们提出精确引用指令，使用多种参考表示such as points and boxes作为引用提示，以引导 MLLM 关注特定区域。这使得 MLLM 可以更加精准地进行交互。基于精确引用指令，我们提出 ChatSpot，一个综合的端到端多模态大语言模型，支持多种交互方式，包括鼠标点击、拖拽和绘制方块，以提供更加灵活和无缝的交互体验。此外，我们构建了一个多层次视力语言指令遵循 dataset，基于现有数据集和 GPT-4 生成。此外，我们设计了一系列评估任务，以评估区域识别和交互的效果。实验结果表明 ChatSpot 表现出色。

A comparative analysis of SRGAN models

paper_url: http://arxiv.org/abs/2307.09456
repo_url: None
paper_authors: Fatemeh Rezapoor Nikroo, Ajinkya Deshmukh, Anantha Sharma, Adrian Tam, Kaarthik Kumar, Cleo Norris, Aditya Dangi
for: 这些模型是用于实现单图像超分辨的，以提高图像的分辨率和质量。
methods: 这些模型使用了多种state-of-the-art SRGAN模型，包括ESRGAN、Real-ESRGAN和EDSR，并使用了一个管道来评估这些模型的性能。
results: 研究发现，ESDR-BASE模型从huggingface库中的模型最高效，它在量化指标和主观视觉质量评估中都达到了最佳效果，并且具有最低的计算开销。 EDSR模型可以生成高PSNR和SSIM值的图像，并且可以通过Tesseract OCR引擎获得高质量的OCR结果。

Abstract
In this study, we evaluate the performance of multiple state-of-the-art SRGAN (Super Resolution Generative Adversarial Network) models, ESRGAN, Real-ESRGAN and EDSR, on a benchmark dataset of real-world images which undergo degradation using a pipeline. Our results show that some models seem to significantly increase the resolution of the input images while preserving their visual quality, this is assessed using Tesseract OCR engine. We observe that EDSR-BASE model from huggingface outperforms the remaining candidate models in terms of both quantitative metrics and subjective visual quality assessments with least compute overhead. Specifically, EDSR generates images with higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values and are seen to return high quality OCR results with Tesseract OCR engine. These findings suggest that EDSR is a robust and effective approach for single-image super-resolution and may be particularly well-suited for applications where high-quality visual fidelity is critical and optimized compute.

摘要
在本研究中，我们评估了多种现代SRGAN（超分解生成对抗网络）模型，ESRGAN、Real-ESRGAN和EDSR，在一个实际图像降解管道中的表现。我们的结果表明，一些模型能够显著提高输入图像的分辨率，同时保持视觉质量，这被评估使用Tesseract OCR引擎。我们发现，来自huggingface的EDSR-BASE模型在对比其他候选模型的情况下，在量化指标和主观视觉质量评估中占据领先地位，同时具有最小的计算开销。specifically，EDSR生成的图像具有更高的峰峰信号噪声比（PSNR）和结构相似度指标（SSIM）值，并且能够返回高质量的OCR结果。这些发现表明，EDSR是一种稳定有效的单图超分解方法，可以用于应用场景需要高质量视觉准确性和优化计算。

Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers

paper_url: http://arxiv.org/abs/2307.09455
repo_url: None
paper_authors: Jaeyoung Kim, Kyuheon Jung, Dongbin Na, Sion Jang, Eunbin Park, Sungchul Choi
for: 这篇论文的目的是为了帮助语言模型检测对应预设（In-distribution，ID）和非对应预设（Out-of-distribution，OOD）样本的区别，以避免语言模型预测错误的情况。
methods: 这篇论文使用了一种简单而有效的方法called Pseudo Outlier Exposure（POE），它可以将ID类别的 tokens顺序排序，从而生成一个伪类别的数据集，来训练弃置网络。POE方法不需要任何外部OOD数据，且可以与现有的Transformers架构一起使用。
results: 根据该论文的结果，POE方法与现有的方法相比，在多个文本分类 benchmark上具有相似的性能，且可以实现ID和OOD样本之间的区别。

Abstract
For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.

摘要
Translation notes:* "real-world language applications" is translated as "实际语言应用" (shí jí yǔ yán bìng)* "in-distribution" is translated as "内部分布" (nèi bù zhāng)* "out-of-distribution" is translated as "外部分布" (wài bù zhāng)* "overconfident" is translated as "过于自信" (guò yú zì xìn)* "semantic representation" is translated as " semantics 表示" (xiàng yì bǐng dài)* "surrogate OOD dataset" is translated as "代理 OOD 数据集" (dài lǐ OOD shù jiàn)* "rejection network" is translated as "拒绝网络" (guà jiè wǎng luò)* "off-the-shelf Transformers" is translated as "Ready-to-use Transformers" (准备好的 Transformers)* "state-of-the-art algorithms" is translated as "当前最佳算法" (dāng qián zuì jiā algoritmos)

Let’s ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

paper_url: http://arxiv.org/abs/2307.09416
repo_url: None
paper_authors: Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe
for: 这篇论文主要是为了解决图像生成领域的自动评估问题，即用语言模型和视觉问答技术来评估生成的图像质量和与提示的一致性。
methods: 该论文提出了一种新的自动评估方法，即视觉概念评估（ViCE），它通过复制人类认知过程来评估图像质量。ViCE组合了大型语言模型和视觉问答技术，并通过问答系统来调查图像，以获得评估结果。
results: 据论文所示，ViCE方法可以准确地评估图像质量和与提示的一致性，并且可以与人类评估结果相匹配。这些结果表明，ViCE方法可以成为图像生成和图像目标编辑任务中的一种有用的自动评估工具。

Abstract
Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only human-based evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated.

摘要
To address this issue, we propose a novel automated method for Visual Concept Evaluation (ViCE), which assesses the consistency between a generated/edited image and the corresponding prompt/instructions. This method is inspired by human cognitive behavior and combines the strengths of large language models (LLMs) and visual question answering (VQA) into a unified pipeline.The ViCE method outlines visual concepts, formulates image-specific verification questions, utilizes a Q&A system to investigate the image, and scores the combined outcome. Although this approach is in its preliminary assessment stage, the results are promising and open the door to a new form of automatic evaluation that could have significant impact as image generation and image target editing tasks become more sophisticated.

Zero-shot Query Reformulation for Conversational Search

paper_url: http://arxiv.org/abs/2307.09384
repo_url: https://github.com/dayuyang1999/zeqr
paper_authors: Dayu Yang, Yue Zhang, Hui Fang
for: 提高 conversational search 中的搜寻效果，解决资料缺乏问题
methods: 提出了一个Zero-shot Query Reformulation（ZeQR）框架，利用语言模型解决 Raw 查询中的核心对比和排除问题，不需要对话搜寻数据进行超级vision
results: 透过实验表明，ZeQR 方法可以优于现有的基eline，提高查询意图理解和搜寻效果

Abstract
As the popularity of voice assistants continues to surge, conversational search has gained increased attention in Information Retrieval. However, data sparsity issues in conversational search significantly hinder the progress of supervised conversational search methods. Consequently, researchers are focusing more on zero-shot conversational search approaches. Nevertheless, existing zero-shot methods face three primary limitations: they are not universally applicable to all retrievers, their effectiveness lacks sufficient explainability, and they struggle to resolve common conversational ambiguities caused by omission. To address these limitations, we introduce a novel Zero-shot Query Reformulation (ZeQR) framework that reformulates queries based on previous dialogue contexts without requiring supervision from conversational search data. Specifically, our framework utilizes language models designed for machine reading comprehension tasks to explicitly resolve two common ambiguities: coreference and omission, in raw queries. In comparison to existing zero-shot methods, our approach is universally applicable to any retriever without additional adaptation or indexing. It also provides greater explainability and effectively enhances query intent understanding because ambiguities are explicitly and proactively resolved. Through extensive experiments on four TREC conversational datasets, we demonstrate the effectiveness of our method, which consistently outperforms state-of-the-art baselines.

摘要
随着语音助手的普及，对话搜索在信息检索领域受到了越来越多的关注。然而，对话搜索数据中的数据缺乏问题对超级vised conversational search方法的进步带来了很大的障碍。因此，研究人员更加关注zero-shot conversational search方法。然而，现有的zero-shot方法存在三个主要的限制：它们不适用于所有搜索引擎，它们的效iveness缺乏足够的解释力，并且它们无法解决通常的对话歧义，导致 omission 问题。为了解决这些限制，我们提出了一种新的Zero-shot Query Reformulation（ZeQR）框架，该框架基于对话上下文来重新 reformulate 查询，无需对 conversational search 数据进行监督。具体来说，我们的框架利用特定设计 для机器阅读理解任务的语言模型来解决 raw 查询中的两个常见歧义：核心引用和 omission。与现有的zero-shot方法相比，我们的方法可以无需额外适应或索引化应用于任何搜索引擎。此外，我们的方法还提供了更高的解释力，因为歧义被明确地和 про动地解决。通过对四个 TREC 对话数据集进行广泛的实验，我们证明了我们的方法的效果， consistently 超过了当前的基eline。

S	M	T	W	T	F	S
« December
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2
3	4	5	6	7	8	9