cs.CL - 2023-10-19

NameGuess: Column Name Expansion for Tabular Data

  • paper_url: http://arxiv.org/abs/2310.13196
  • repo_url: https://github.com/amazon-science/nameguess
  • paper_authors: Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Shen Wang, Huzefa Rangwala, George Karypis
  • for: This paper aims to address the challenge of abbreviated column names in large volumes of tabular data, which can negatively impact performance on various data search, access, and understanding tasks.
  • methods: The paper introduces a new task called NameGuess, which expands column names in database schemas as a natural language generation problem. The authors create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark. They enhance auto-regressive language models by conditioning on table content and column header names to improve performance.
  • results: The fine-tuned model (with 2.7B parameters) matches human performance in the NameGuess task, and the authors conduct a comprehensive analysis to validate the effectiveness of table content in NameGuess and identify promising future opportunities. The code for the paper has been made available at https://github.com/amazon-science/nameguess.
    Abstract Recent advances in large language models have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NameGuess, to expand column names (used in database schema) as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the complexities associated with polysemy and ambiguity in NameGuess, we enhance auto-regressive language models by conditioning on table content and column header names -- yielding a fine-tuned model (with 2.7B parameters) that matches human performance. Furthermore, we conduct a comprehensive analysis (on multiple LLMs) to validate the effectiveness of table content in NameGuess and identify promising future opportunities. Code has been made available at https://github.com/amazon-science/nameguess.
    摘要 To create a training dataset for NameGuess, we employed a new data fabrication method and compiled a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the challenges associated with polysemy and ambiguity in NameGuess, we enhanced auto-regressive language models by conditioning on table content and column header names, resulting in a fine-tuned model with 2.7B parameters that matches human performance.We conducted a comprehensive analysis of multiple large language models to validate the effectiveness of table content in NameGuess and identify promising future opportunities. The code for NameGuess has been made available on GitHub at .

Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection

  • paper_url: http://arxiv.org/abs/2310.13183
  • repo_url: None
  • paper_authors: Jianwei Li, Weizhi Gao, Qi Lei, Dongkuan Xu
  • for: 提高模型的准确率,透过减少模型的神经元或参数数量来缩小模型的大小。
  • methods: 提出一种随机生成减少Mask的策略,并且采用有效的Mask选择规则,从多个Mask候选者中选择最佳的Mask。
  • results: 在GLUE dataset上进行了广泛的实验,并达到了当前最佳性能水平,特别是在高水平的缩放性能上表现出色。
    Abstract It is widely acknowledged that large and sparse models have higher accuracy than small and dense models under the same model size constraints. This motivates us to train a large model and then remove its redundant neurons or weights by pruning. Most existing works pruned the networks in a deterministic way, the performance of which solely depends on a single pruning criterion and thus lacks variety. Instead, in this paper, we propose a model pruning strategy that first generates several pruning masks in a designed random way. Subsequently, along with an effective mask-selection rule, the optimal mask is chosen from the pool of mask candidates. To further enhance efficiency, we introduce an early mask evaluation strategy, mitigating the overhead associated with training multiple masks. Our extensive experiments demonstrate that this approach achieves state-of-the-art performance across eight datasets from GLUE, particularly excelling at high levels of sparsity.
    摘要 广泛认可的大型和稀疏模型在同等模型大小约束下表现更高准确性。这种情况motivates我们训练一个大型模型,然后从其中 removes redundant neurons or weights by pruning.现有的大多数工作采用了 deterministic pruning方法,其性能受到单一采样决定因子的限制,lacks variety。在这篇论文中,我们提议一种模型剪除策略,首先生成多个剪除面积 candidatese。然后,通过一个有效的面积选择规则,选择 pool of mask candidates中的优化面积。为了进一步提高效率,我们引入了一种早期剪除评估策略, Mitigate the overhead associated with training multiple masks.我们的广泛实验表明,这种方法在GLUE数据集上 achieves state-of-the-art performance,特别是在高水平的稀疏性下表现出色。

Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models

  • paper_url: http://arxiv.org/abs/2310.13127
  • repo_url: None
  • paper_authors: Zhihan Zhang, Shuohang Wang, Wenhao Yu, Yichong Xu, Dan Iter, Qingkai Zeng, Yang Liu, Chenguang Zhu, Meng Jiang
  • for: 提高大型自然语言处理器(LLM)的任务性能,无需特定任务的精心调整。
  • methods: 利用LLM自然语言指令的内生生成能力,生成多个候选指令,然后使用经过训练的分数模型对其进行排序。
  • results: 在118个 OUT-OF-DOMAIN任务上,Auto-Instruct比人工写的指令和现有的LLM生成指令都更高,并且具有很好的普适性,能够在其他LLM上进行排序。
    Abstract Large language models (LLMs) can perform a wide range of tasks by following natural language instructions, without the necessity of task-specific fine-tuning. Unfortunately, the performance of LLMs is greatly influenced by the quality of these instructions, and manually writing effective instructions for each task is a laborious and subjective process. In this paper, we introduce Auto-Instruct, a novel method to automatically improve the quality of instructions provided to LLMs. Our method leverages the inherent generative ability of LLMs to produce diverse candidate instructions for a given task, and then ranks them using a scoring model trained on a variety of 575 existing NLP tasks. In experiments on 118 out-of-domain tasks, Auto-Instruct surpasses both human-written instructions and existing baselines of LLM-generated instructions. Furthermore, our method exhibits notable generalizability even with other LLMs that are not incorporated into its training process.
    摘要 大型语言模型(LLM)可以完成广泛的任务,只需按照自然语言指令进行操作,无需特定任务的精细调整。然而,LLM的性能受到指令质量的影响,并且手动编写每个任务的有效指令是一项劳动ioso和主观的过程。在这篇论文中,我们介绍了Auto-Instruct,一种新的方法,可以自动提高提供给LLM的指令质量。我们的方法利用LLM的内在的生成能力,生成任务相关的多个候选指令,然后使用基于多个存在的575 NLP任务的分数模型来排序。在118个 OUT-OF-DOMAIN任务上进行实验,Auto-Instruct超过了人类编写的指令和现有的LLM生成指令基eline。此外,我们的方法在其他LLM中也 exhibits notable generalizability。

Unsupervised Candidate Answer Extraction through Differentiable Masker-Reconstructor Model

  • paper_url: http://arxiv.org/abs/2310.13106
  • repo_url: None
  • paper_authors: Zhuoer Wang, Yicheng Wang, Ziwei Zhu, James Caverlee
  • for: 提高问题生成系统中候选答案提取的精度和效果
  • methods: 提出了一种新的无监督候选答案提取方法,利用文本含义结构自动提取答案
  • results: 对两个批处理的数据进行了广泛的测试和评估,并显示了与监督方法相当的性能,同时具有自动提取答案的优势。
    Abstract Question generation is a widely used data augmentation approach with extensive applications, and extracting qualified candidate answers from context passages is a critical step for most question generation systems. However, existing methods for candidate answer extraction are reliant on linguistic rules or annotated data that face the partial annotation issue and challenges in generalization. To overcome these limitations, we propose a novel unsupervised candidate answer extraction approach that leverages the inherent structure of context passages through a Differentiable Masker-Reconstructor (DMR) Model with the enforcement of self-consistency for picking up salient information tokens. We curated two datasets with exhaustively-annotated answers and benchmark a comprehensive set of supervised and unsupervised candidate answer extraction methods. We demonstrate the effectiveness of the DMR model by showing its performance is superior among unsupervised methods and comparable to supervised methods.
    摘要 问题生成是广泛使用的数据增强方法,提取合适的答案候选者从文本段落是大多数问题生成系统中的关键步骤。然而,现有的答案候选EXTRACTION方法依赖于语言规则或标注数据,面临到偏vie annotation问题和总体化难题。为了解决这些限制,我们提出了一种新的无监督候选答案EXTRACTION方法,利用文本段落的自然结构,通过可 diffeomorphisms Masker-Reconstructor(DMR)模型,并通过自我一致性来捕捉突出的信息 токен。我们抽取了两个 dataset with exhaustive annotation answers,并 benchmark了一组完全监督和无监督候选答案EXTRACTION方法。我们示出了 DMR 模型的效果,其性能在无监督方法中至上,与监督方法相当。

  • paper_url: http://arxiv.org/abs/2310.13092
  • repo_url: https://github.com/clairebarale/probing_legal_entity_types
  • paper_authors: Claire Barale, Michael Rovatsos, Nehal Bhuta
  • for: 本研究旨在探讨语言模型(LM)在预训练阶段积累到各种语言知识的能力,以及这些知识是否可以用于下游任务。
  • methods: 本研究使用 Entity Typing 作为评估法律知识的代理任务,并使用两种提示方法(cloze sentence和QA-based template)进行系统性的评估和分析。
  • results: 研究结果显示(1)Llama2 在某些实体类型上表现良好,并且可能通过优化提示模板具有大量提升的潜力;(2)法律预训练集的 LM 表现不一致,可能因为预训练集的变化;(3)LM 能够类型实体,包括多token实体;(4)所有模型都在某些法律子领域中的实体类型上表现不佳;(5)Llama2 显示在 sintactic 信号上过度忽略,而BERT-based 架构则比较具有这种缺点。
    Abstract Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Typing, serving as a proxy for evaluating legal knowledge as an essential aspect of text comprehension, and a foundational task to numerous downstream legal NLP applications. Through systematic evaluation and analysis and two types of prompting (cloze sentences and QA-based templates) and to clarify the nature of these acquired cues, we compare diverse types and lengths of entities both general and domain-specific entities, semantics or syntax signals, and different LM pretraining corpus (generic and legal-oriented) and architectures (encoder BERT-based and decoder-only with Llama2). We show that (1) Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates, (2) law-oriented LMs show inconsistent performance, possibly due to variations in their training corpus, (3) LMs demonstrate the ability to type entities even in the case of multi-token entities, (4) all models struggle with entities belonging to sub-domains of the law (5) Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures.
    摘要 语言模型(LM)在预训练阶段已经证明了它们可以掌握多种语言知识,有可能作为下游任务的意外监督来提供价值。然而,有限的研究已经进行到了域pecific知识的检索,特别是法律知识。我们提议探索Entity Typing任务,作为法律文本理解的重要方面,以及许多下游法律NLP应用的基础任务。通过系统atic评估和分析,以及两种提示(cloze句和QA模板),我们比较了不同类型和长度的实体、具体或 синтакси依据信号,以及不同的LM预训练集(通用和法律 oriented)和架构(encoderBERT基于和Decoder Only with Llama2)。我们发现:1. Llama2在某些实体方面表现良好,并且具有可以通过优化提示模板进行提升的潜力。2.法律 Orientated LMs在性能上存在差异,可能是由其训练集的变化引起的。3.LMs可以对多token实体进行类型化, inclusive 多个域的法律实体。4.所有模型都在特定的法律子领域中的实体表现不佳。5. Llama2显示在 sintactic 信号上有很多缺失,这是BERT基于架构中存在的缺陷。

GARI: Graph Attention for Relative Isomorphism of Arabic Word Embeddings

  • paper_url: http://arxiv.org/abs/2310.13068
  • repo_url: https://github.com/asif6827/gari
  • paper_authors: Muhammad Asif Ali, Maha Alshmrani, Jianbin Qin, Yan Hu, Di Wang
  • for: 本研究目的是提高语义相似性 embedding 空间之间的相对含义性。
  • methods: 该方法结合分布式训练目标和多个含义损失,通过图注意力网络引导定义相对含义空间的 embedding。
  • results: 实验结果表明,对阿拉伯语数据集进行训练,GARI 可以提高平均精度@1 的表现,相比前期研究提高40.95%和76.80% 在适应域和领域偏移设置下。
    Abstract Bilingual Lexical Induction (BLI) is a core challenge in NLP, it relies on the relative isomorphism of individual embedding spaces. Existing attempts aimed at controlling the relative isomorphism of different embedding spaces fail to incorporate the impact of semantically related words in the model training objective. To address this, we propose GARI that combines the distributional training objectives with multiple isomorphism losses guided by the graph attention network. GARI considers the impact of semantical variations of words in order to define the relative isomorphism of the embedding spaces. Experimental evaluation using the Arabic language data set shows that GARI outperforms the existing research by improving the average P@1 by a relative score of up to 40.95% and 76.80% for in-domain and domain mismatch settings respectively. We release the codes for GARI at https://github.com/asif6827/GARI.
    摘要 《双语 lexical 推导 (BLI) 是 NLP 领域的核心挑战,它基于各个嵌入空间之间的相对同构性。现有的尝试都无法将各个嵌入空间之间的相对同构性控制在模型训练目标中。为此,我们提出了 GARI,它将分布式训练目标与多种同构损失相结合,并由图注意力网络引导。GARI 考虑了 semantic 变化的影响,以定义各个嵌入空间之间的相对同构性。经验证使用阿拉伯语数据集表明,GARI 可以超越现有研究,提高平均 P@1 的表现,在领域匹配设置下提高了40.95%,在领域异同设置下提高了76.80%。我们在 GitHub 上发布了 GARI 代码,请参考

SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving

  • paper_url: http://arxiv.org/abs/2310.12960
  • repo_url: None
  • paper_authors: Xueliang Zhao, Xinting Huang, Wei Bi, Lingpeng Kong
  • for: 提高人工智能中的数学问题解决能力
  • methods: 使用新的框架 called SEGO,通过连接下目步骤和问题解决概率来确定更好的下目步骤,并根据特定的标准进行优化。
  • results: 通过实验证明,SEGO可以在两个标准测试集上(GSM8K和MATH)提高问题解决性能,表明SEGO在人工智能驱动的数学问题解决中具有潜力。
    Abstract Large Language Models (LLMs) have driven substantial progress in artificial intelligence in recent years, exhibiting impressive capabilities across a wide range of tasks, including mathematical problem-solving. Inspired by the success of subgoal-based methods, we propose a novel framework called \textbf{SE}quential sub\textbf{G}oal \textbf{O}ptimization (SEGO) to enhance LLMs' ability to solve mathematical problems. By establishing a connection between the subgoal breakdown process and the probability of solving problems, SEGO aims to identify better subgoals with theoretical guarantees. Addressing the challenge of identifying suitable subgoals in a large solution space, our framework generates problem-specific subgoals and adjusts them according to carefully designed criteria. Incorporating these optimized subgoals into the policy model training leads to significant improvements in problem-solving performance. We validate SEGO's efficacy through experiments on two benchmarks, GSM8K and MATH, where our approach outperforms existing methods, highlighting the potential of SEGO in AI-driven mathematical problem-solving. Data and code associated with this paper will be available at https://github.com/zhaoxlpku/SEGO
    摘要

On the Representational Capacity of Recurrent Neural Language Models

  • paper_url: http://arxiv.org/abs/2310.12942
  • repo_url: https://github.com/rycolab/rnn-turing-completeness
  • paper_authors: Franz Nowak, Anej Svete, Li Du, Ryan Cotterell
  • for: 本研究 investigate language models (LMs) based on recurrent neural networks (RNNs) 的计算表达能力。
  • methods: 该研究使用 rational weights 和 hidden states,并且使用 unbounded computation time 来展示 RNNs 的 Turing completeness。
  • results: 研究表明,使用 probabilistic Turing machine (PTM) 和 real-time computation 的情况下,RLMs 可以模拟任何 probabilistic Turing machine (PTM),但是在实际应用中,RLMs 的计算时间是有限的,因此这个结果是 RLMs 的Upper bound。此外,研究还提供了一个 lower bound,表明在实际应用中,RLMs 只能模拟 deterministic real-time rational PTMs。
    Abstract This work investigates the computational expressivity of language models (LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) famously showed that RNNs with rational weights and hidden states and unbounded computation time are Turing complete. However, LMs define weightings over strings in addition to just (unweighted) language membership and the analysis of the computational power of RNN LMs (RLMs) should reflect this. We extend the Turing completeness result to the probabilistic case, showing how a rationally weighted RLM with unbounded computation time can simulate any probabilistic Turing machine (PTM). Since, in practice, RLMs work in real-time, processing a symbol at every time step, we treat the above result as an upper bound on the expressivity of RLMs. We also provide a lower bound by showing that under the restriction to real-time computation, such models can simulate deterministic real-time rational PTMs.
    摘要

A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

  • paper_url: http://arxiv.org/abs/2310.12936
  • repo_url: None
  • paper_authors: Yi Zhou, Jose Camacho-Collados, Danushka Bollegala
  • For: This paper aims to study the relationship between various factors of pre-trained Masked Language Models (MLMs) and the social biases they learn, as well as their downstream task performance.* Methods: The authors conduct a comprehensive study using 39 pre-trained MLMs with different model sizes, training objectives, tokenization methods, training data domains, and languages.* Results: The study sheds light on important factors often neglected in prior literature, such as tokenization or model objectives, and provides insights into the relationship between these factors and the social biases learned by MLMs.Here is the same information in Simplified Chinese text:
  • for: 这paper的目的是研究pre-trained Masked Language Models(MLMs)中不同因素对它们学习的社会偏见以及其下游任务性能。
  • methods: 作者通过使用39个不同的pre-trained MLMs,包括不同的模型大小、训练目标、tokenization方法、训练数据领域和语言,进行了全面的研究。
  • results: 研究发现了许多在先前文献中被忽略的因素,如tokenization或模型目标,与MLMs学习的社会偏见以及其下游任务性能之间的关系。
    Abstract Various types of social biases have been reported with pretrained Masked Language Models (MLMs) in prior work. However, multiple underlying factors are associated with an MLM such as its model size, size of the training data, training objectives, the domain from which pretraining data is sampled, tokenization, and languages present in the pretrained corpora, to name a few. It remains unclear as to which of those factors influence social biases that are learned by MLMs. To study the relationship between model factors and the social biases learned by an MLM, as well as the downstream task performance of the model, we conduct a comprehensive study over 39 pretrained MLMs covering different model sizes, training objectives, tokenization methods, training data domains and languages. Our results shed light on important factors often neglected in prior literature, such as tokenization or model objectives.
    摘要 各种社会偏见已在先前的工作中对预训练的掩码语言模型(MLM)进行报道。然而,预训练MLM的多个下面因素有关:其模型大小、训练数据大小、训练目标、训练数据领域、tokenization方法和预训练 corpora 中的语言等。尚未清楚哪些因素影响预训练MLM学习的社会偏见,以及模型的下游任务性能。为了研究预训练MLM的模型因素和学习的社会偏见之间的关系,以及模型的下游任务性能,我们进行了39个预训练MLM的全面研究,覆盖不同的模型大小、训练目标、tokenization方法、训练数据领域和语言。我们的结果揭示了在先前文献中经常被忽略的一些因素,如tokenization或模型目标。

A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

  • paper_url: http://arxiv.org/abs/2310.12892
  • repo_url: None
  • paper_authors: Songbo Hu, Han Zhou, Moy Yuan, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Anna Korhonen, Ivan Vulić
  • for: 本研究的目的是探讨和分析多语言自然语言处理(NLP)中语言差异导致的任务性能差异。
  • methods: 我们首先定义了新的量化度量方式,用于衡量多语言对话(ToD)系统的性能差异。我们采用了一系列控制的实验,以证明性能差异取决于任务类型、基础预训语言模型、目标语言以及ToD数据的量。
  • results: 我们的分析表明,现有的ToD系统存在适应和内在偏见。例如,使用英语ToD数据进行同步训练的阿拉伯语或土耳其语ToD系统仍然表现较差。我们的分析还提供了对ToD数据收集和系统开发的实践建议。
    Abstract Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled experiments, we demonstrate that performance disparities depend on a number of factors: the nature of the ToD task at hand, the underlying pretrained language model, the target language, and the amount of ToD annotated data. We empirically prove the existence of the adaptation and intrinsic biases in current ToD systems: e.g., ToD systems trained for Arabic or Turkish using annotated ToD data fully parallel to English ToD data still exhibit diminished ToD task performance. Beyond providing a series of insights into the performance disparities of ToD systems in different languages, our analyses offer practical tips on how to approach ToD data collection and system development for new languages.
    摘要 (注:以下是使用简化字符的中文翻译)实现在世界上许多语言上运行的Robust语言技术是多语言NLP的中心目标。在这项工作中,我们对多语言对话任务(ToD)系统的性能差异进行了评估和分析。我们首先定义了新的绝对和相对一致性度量,用于捕捉不同语言和语言之间的性能差异。通过一系列控制的实验,我们证明了性能差异取决于多个因素:ToD任务的性质、基础预训言语模型、目标语言和ToD数据的量。我们经验证了现有ToD系统中的适应和内在偏见:例如,使用完全相同的ToD数据来训练阿拉伯语或土耳其语的ToD系统仍然会导致ToD任务性能下降。我们的分析不仅提供了不同语言ToD系统的性能差异的几种视角,还提供了如何在新语言上收集ToD数据和开发ToD系统的实践建议。

StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding

  • paper_url: http://arxiv.org/abs/2310.12874
  • repo_url: None
  • paper_authors: Cheng Jiayang, Lin Qiu, Tsz Ho Chan, Tianqing Fang, Weiqi Wang, Chunkit Chan, Dongyu Ru, Qipeng Guo, Hongming Zhang, Yangqiu Song, Yue Zhang, Zheng Zhang
  • for: 本研究旨在评估人类理解和生成analogy的能力,通过构建了首个大规模的故事级相似性 corpora(\textsc{StoryAnalogy),包括24K个故事对从多个领域中,人类标注了两个相似性从扩展的结构对应理论。
  • methods: 我们设计了一系列测试,用于评估故事级analogy的识别和生成能力,这是首次对故事级analogy进行评估。
  • results: 我们发现,即使使用最新的大语言模型(LLMs),例如ChatGPT和LLaMa,也只能达到30%的准确率(相比人类的85%准确率)。此外,我们发现\textsc{StoryAnalogy}数据可以改善LLMs中的analogy生成质量,其中一个精制的FlanT5-xxl模型与零容量ChatGPT模型具有相似的性能。
    Abstract Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.
    摘要 人类理智中的Analogy-making是非常重要的。在这篇论文中,我们评估了人们可以认出和生成Analogy的能力,通过构建了首个类型的大规模故事级Analogy corpus,namely \textsc{StoryAnalogy},其包含24K个故事对from多个领域,并有人类标注了两种相似性from the extended Structure-Mapping Theory。我们设计了一系列测试,这是首次评估故事级Analogy的认出和生成能力。有意思的是,我们发现,不仅 sentence embedding models,而且最近的大语言模型(LLMs),如ChatGPT和LLaMa,在同义测试中只取得了约30%的准确率(与人类的超过85%准确率相比)。此外,我们发现, \textsc{StoryAnalogy} 中的数据可以提高 LLMs 中的同义生成质量,其中一个精心 fine-tuned FlanT5-xxl 模型在多项选择问题中与零shot ChatGPT 具有相似的性能。

The Locality and Symmetry of Positional Encodings

  • paper_url: http://arxiv.org/abs/2310.12864
  • repo_url: https://github.com/tigerchen52/locality_symmetry
  • paper_authors: Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek
  • for: 这篇论文主要研究了Positional Encodings(PEs)在基于transformer的语言模型中的应用,以增强句子表示质量。
  • methods: 该论文采用了系统的研究方法,包括对PEs的分析、两个新的探测任务的引入以评估现有PEs的弱点,以及对下游任务表现的分析。
  • results: 研究发现,PEs在基于BERT的语言模型中具有两种常见的特性:Locality和Symmetry。这两种特性与下游任务表现高度相关,而现有PEs在两个新的探测任务中表现较差。这些结果可能为开发更好的PEs提供基础。代码可以在\faGithub~ \url{https://github.com/tigerchen52/locality\_symmetry} 获取。
    Abstract Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models is not fully understood, especially given recent findings that various positional encodings are insensitive to word order. In this work, we conduct a systematic study of positional encodings in \textbf{Bidirectional Masked Language Models} (BERT-style) , which complements existing work in three aspects: (1) We uncover the core function of PEs by identifying two common properties, Locality and Symmetry; (2) We show that the two properties are closely correlated with the performances of downstream tasks; (3) We quantify the weakness of current PEs by introducing two new probing tasks, on which current PEs perform poorly. We believe that these results are the basis for developing better PEs for transformer-based language models. The code is available at \faGithub~ \url{https://github.com/tigerchen52/locality\_symmetry}
    摘要 位置编码(PEs)用于注入单词顺序信息到转换器基于的语言模型中。尽管它们可以显著提高句子表示的质量,但它们特定的贡献到语言模型中还不完全了解,特别是在最近的发现中,各种位置编码都是不敏感于单词顺序的。在这项工作中,我们进行了系统性的研究,包括以下三个方面:1. 我们揭示了位置编码的核心功能,并确定了两种常见的属性:地方性和对称性。2. 我们发现这两种属性与下游任务的表现密切相关。3. 我们 introduce了两个新的探测任务,以证明现有的位置编码在这些任务上表现不佳。我们认为这些结果是开发更好的位置编码的基础。代码可以在 \faGithub 上获取,链接为 \url{https://github.com/tigerchen52/locality_symmetry}。

Probing LLMs for hate speech detection: strengths and vulnerabilities

  • paper_url: http://arxiv.org/abs/2310.12860
  • repo_url: None
  • paper_authors: Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, Punyajoy Saha
  • for: 本研究旨在探讨社交媒体平台以及研究人员如何使用大语言模型检测偏恶或攻击性语言,并尝试使用说明、附加信息和受害者社区信息来提高检测精度。
  • methods: 本研究使用了不同的提示变化、输入信息和评估大语言模型的零批设定(没有添加任何上下文示例)。选择了三个大语言模型(GPT-3.5、text-davinci和Flan-T5)和三个数据集(HateXplain、隐式仇恨和ToxicSpans)。
  • results: 结果表明,在平均情况下,包含目标信息在检测过程中可以提高模型性能(约20-30%),而添加说明也可以提高模型性能(约10-20%)。此外,我们还提供了错误案例分类和模型决策错误的解释,这些敏感点自动组成了‘监狱’提示,需要开发行业规模的安全措施,以使模型更加可靠。
    Abstract Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute 'jailbreak' prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.
    摘要

EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2310.12851
  • repo_url: None
  • paper_authors: Hanan Hamza, Fiza Gafoor, Fathima Sithara, Gayathri Anil, V. S. Anoop
  • for: This paper aims to improve the accuracy of speech emotion recognition by integrating deep learning techniques and addressing the challenges of speaker diarization and emotion identification.
  • methods: The proposed method combines a pre-existing speaker diarization pipeline with a Convolutional Neural Network (CNN) based emotion identification model, using features such as MFCC, ZCR, RMS, and data augmentation techniques.
  • results: The proposed model achieved an unweighted accuracy of 63% in identifying emotional states within speech signals, demonstrating its effectiveness in accurately recognizing emotions in spoken language.
    Abstract In the era of advanced artificial intelligence and human-computer interaction, identifying emotions in spoken language is paramount. This research explores the integration of deep learning techniques in speech emotion recognition, offering a comprehensive solution to the challenges associated with speaker diarization and emotion identification. It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN) to achieve higher precision. The proposed model was trained on data from five speech emotion datasets, namely, RAVDESS, CREMA-D, SAVEE, TESS, and Movie Clips, out of which the latter is a speech emotion dataset created specifically for this research. The features extracted from each sample include Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), Root Mean Square (RMS), and various data augmentation algorithms like pitch, noise, stretch, and shift. This feature extraction approach aims to enhance prediction accuracy while reducing computational complexity. The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
    摘要 在人工智能和计算机之间的交互时代,识别语音中的情感是非常重要的。这项研究探讨了深度学习技术在语音情感识别中的应用,提供了全面的解决方案,以便更好地识别speaker的情感。它提出了一个结合现有的说话人识别管道和基于卷积神经网络(CNN)的情感识别模型,以提高准确性。该模型在五个语音情感数据集上进行训练,分别是RAVDESS、CREMA-D、SAVEE、TESS和Movie Clips,其中Movie Clips是特意为本研究创建的语音情感数据集。每个样本中提取的特征包括Mel Frequency Cepstral Coefficients(MFCC)、Zero Crossing Rate(ZCR)、Root Mean Square(RMS)以及各种数据增强算法如滥声、噪声、延展、偏移等。这种特征提取方法的目的是提高预测准确性,同时减少计算复杂度。该模型在无权重的情况下达到63%的准确率,表明在语音信号中准确地识别情感状态的能力强大。

Knowledge-Augmented Language Model Verification

  • paper_url: http://arxiv.org/abs/2310.12836
  • repo_url: https://github.com/JinheonBaek/KALMV
  • paper_authors: Jinheon Baek, Soyeong Jeong, Minki Kang, Jong C. Park, Sung Ju Hwang
  • for: 提高语言模型(LM)内置知识的概率生成文本的准确性
  • methods: 使用外部知识源扩展LM的知识,并使用一个小型LM进行验证和修正
  • results: 在多个问答benchmark上验证了验证步骤的效iveness,验证器可以准确地识别抽象和生成错误,使LM提供更加准确的输出
    Abstract Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters. Yet, LMs often generate the factually incorrect responses to the given queries, since their knowledge may be inaccurate, incomplete, and outdated. To address this problem, previous works propose to augment LMs with the knowledge retrieved from an external knowledge source. However, such approaches often show suboptimal text generation performance due to two reasons: 1) the model may fail to retrieve the knowledge relevant to the given query, or 2) the model may not faithfully reflect the retrieved knowledge in the generated text. To overcome these, we propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier, which is a small LM that is trained to detect those two types of errors through instruction-finetuning. Then, when the verifier recognizes an error, we can rectify it by either retrieving new knowledge or generating new text. Further, we use an ensemble of the outputs from different instructions with a single verifier to enhance the reliability of the verification processes. We validate the effectiveness of the proposed verification steps on multiple question answering benchmarks, whose results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs. Our code is available at https://github.com/JinheonBaek/KALMV.
    摘要 现代语言模型(LM)在生成文本时有卓越的能力,但它们经常生成错误的回答。这是因为LM的知识可能是不准确、不完整或过时的。以前的研究建议将知识从外部知识源添加到LM中,但这些方法通常会导致文本生成性能下降。这是因为模型可能无法检索与给定查询相关的知识,或者模型无法忠实地反映检索到的知识在生成文本中。为解决这些问题,我们提议采用一个分离的验证器,这是一个小型LM,通过 instruciton-finetuning 来检测模型输出中的两种类型错误。当验证器发现错误时,我们可以通过 Either retrieving new knowledge or generating new text来纠正错误。此外,我们使用不同 instruciton 的输出 ensemble 以提高验证过程的可靠性。我们在多个问答 benchmark 上验证了我们的方法,结果显示,我们的验证器可以准确地检测检索和生成错误,使LM提供更加准确的输出。我们的代码可以在 GitHub 上找到:https://github.com/JinheonBaek/KALMV。

GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents

  • paper_url: http://arxiv.org/abs/2310.12821
  • repo_url: None
  • paper_authors: Xin Zeng, Xiaoyu Wang, Tengxiang Zhang, Chun Yu, Shengdong Zhao, Yiqiang Chen
  • for: 提高现有的手势识别系统,使之能够连接手势与交互 GUI 元素或系统功能。
  • methods: 使用大语言模型(LLMs),将手势描述转化为对话系统中的问题,并通过对话进行识别和适应。
  • results: 在两个实际场景中进行了测试:视频流和智能家居 IoT 控制,并 achieved 80.11% 和 90.78% 的零shot Top-5 拟合率。
    Abstract Current gesture recognition systems primarily focus on identifying gestures within a predefined set, leaving a gap in connecting these gestures to interactive GUI elements or system functions (e.g., linking a 'thumb-up' gesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture understanding and grounding framework leveraging large language models (LLMs). Gesture descriptions are formulated based on hand landmark coordinates from gesture videos and fed into our dual-agent dialogue system. A gesture agent deciphers these descriptions and queries about the interaction context (e.g., interface, history, gaze data), which a context agent organizes and provides. Following iterative exchanges, the gesture agent discerns user intent, grounding it to an interactive function. We validated the gesture description module using public first-view and third-view gesture datasets and tested the whole system in two real-world settings: video streaming and smart home IoT control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video streaming and 90.78% for smart home tasks, showing potential of the new gesture understanding paradigm.
    摘要

Causal-structure Driven Augmentations for Text OOD Generalization

  • paper_url: http://arxiv.org/abs/2310.12803
  • repo_url: None
  • paper_authors: Amir Feder, Yoav Wald, Claudia Shi, Suchi Saria, David Blei
  • for: 这篇论文是用于提高文本分类器在实际应用中的一致性和稳定性,特别是在医疗领域,避免因为偶发因素而导致的泛化问题。
  • methods: 这篇论文提出了一种使用假设构造的数据来实现干预,并使用ounterfactual数据增强来学习更加稳定的文本分类器。
  • results: 这篇论文通过实验显示,使用这种方法可以提高文本分类器在对应用中的一致性和稳定性,并且比基eline的均衡学习算法更有效率。
    Abstract The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.
    摘要 文本分类器的依赖关系可能会导致在部署时出现差异,从而引起关注其在安全关键领域如医疗中的使用。在这项工作中,我们提议使用对干扰因素的干扰数据增强,受知 causal 结构的数据指导,以便模拟干扰并学习更加鲁棒的文本分类器。我们表明,这种策略在预测问题中,标签与特征之间存在干扰关系时是合适的。在这些问题的假设下,我们讨论了对干扰数据增强的最佳样本复杂度,与重要性重新权重相比。在实践中,我们使用辅助数据进行匹配,基于 diff-in-diff 方法,并使用大型自然语言模型(LLM)来表示文本中的条件概率。通过对医学 narative 和 semi-synthetic 数据进行广泛的实验,我们证明了我们的干扰模拟策略可以提高对于样本外(OOD)准确率,比基础不变学习算法更好。

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

  • paper_url: http://arxiv.org/abs/2310.12798
  • repo_url: https://github.com/acharkq/molca
  • paper_authors: Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, Tat-Seng Chua
  • for: 本研究旨在帮助语言模型更好地理解分子的二维图形结构,以提高其对分子的理解能力。
  • methods: 本研究提出了一种名为MolCA的方法,它使用交叉模式项目器和单模型适配器来将语言模型与分子图形空间连接。
  • results: 对于分子描述、IUPAC名称预测和分子文本检索等任务,MolCA在比较基eline之上显示出了显著的提高。
    Abstract Language Models (LMs) have demonstrated impressive molecule understanding ability on various 1D text-related tasks. However, they inherently lack 2D graph perception - a critical ability of human professionals in comprehending molecules' topological structures. To bridge this gap, we propose MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter. MolCA enables an LM (e.g., Galactica) to understand both text- and graph-based molecular contents via the cross-modal projector. Specifically, the cross-modal projector is implemented as a Q-Former to connect a graph encoder's representation space and an LM's text space. Further, MolCA employs a uni-modal adapter (i.e., LoRA) for the LM's efficient adaptation to downstream tasks. Unlike previous studies that couple an LM with a graph encoder via cross-modal contrastive learning, MolCA retains the LM's ability of open-ended text generation and augments it with 2D graph information. To showcase its effectiveness, we extensively benchmark MolCA on tasks of molecule captioning, IUPAC name prediction, and molecule-text retrieval, on which MolCA significantly outperforms the baselines. Our codes and checkpoints can be found at https://github.com/acharkq/MolCA.
    摘要 Language Models (LMs) 有表现出很强的分子理解能力在各种一维文本相关任务上。然而,它们缺乏二维图像感知能力,这是人类专业人员理解分子的核心能力。为了bridging这个差距,我们提议了MolCA:分子图像语言模型化with Cross-Modal Projector和Uni-Modal Adapter。MolCA使得LM(例如Galactica)能够理解文本和图像基本分子内容。具体来说,cross-modal projector是通过Q-Former连接图像编码器的表示空间和LM的文本空间来实现的。此外,MolCA还使用uni-modal adapter(即LoRA)来有效地适应下游任务。与前一些研究 coupling LM与图像编码器via cross-modal对抗学习不同,MolCA保留了LM的开放式文本生成能力,并将其与二维图像信息相结合。为证明其效果,我们对MolCA进行了广泛的 benchmarking,并发现它在分子描述、IUPAC名称预测和分子文本检索等任务上表现出色,与基线比较明显提高。codes和checkpoints可以在https://github.com/acharkq/MolCA上找到。

Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization

  • paper_url: http://arxiv.org/abs/2310.12794
  • repo_url: https://github.com/ningyuxu/structural_concepts_correspondence
  • paper_authors: Ningyu Xu, Qi Zhang, Jingting Ye, Menghan Zhang, Xuanjing Huang
  • for: 这个论文旨在研究如何通过显式匹配语言之间概念匹配来提高跨语言泛化。
  • methods: 这个论文使用了语义学方法来研究语言之间概念匹配的可行性,并提出了一种基于元学习的方法来学习匹配不同语言的概念空间。
  • results: 实验结果表明,该方法可以达到与当前状态OFART的竞争力,并且特别地有助于低资源语言来增强其泛化能力。
    Abstract Large language models (LLMs) have exhibited considerable cross-lingual generalization abilities, whereby they implicitly transfer knowledge across languages. However, the transfer is not equally successful for all languages, especially for low-resource ones, which poses an ongoing challenge. It is unclear whether we have reached the limits of implicit cross-lingual generalization and if explicit knowledge transfer is viable. In this paper, we investigate the potential for explicitly aligning conceptual correspondence between languages to enhance cross-lingual generalization. Using the syntactic aspect of language as a testbed, our analyses of 43 languages reveal a high degree of alignability among the spaces of structural concepts within each language for both encoder-only and decoder-only LLMs. We then propose a meta-learning-based method to learn to align conceptual spaces of different languages, which facilitates zero-shot and few-shot generalization in concept classification and also offers insights into the cross-lingual in-context learning phenomenon. Experiments on syntactic analysis tasks show that our approach achieves competitive results with state-of-the-art methods and narrows the performance gap between languages, particularly benefiting those with limited resources.
    摘要 Translated into Simplified Chinese:大型语言模型(LLM)已经展现出较强的跨语言泛化能力,其中模型通过隐式方式传递知识到不同语言。然而,这种传递不是对所有语言都 equally successful,特别是低资源语言,这成为一个持续的挑战。是否已经达到了隐式跨语言泛化的限制,并且是否可以进行显式知识传递,这些问题仍然存在。在这篇论文中,我们调查了在不同语言之间显式对概念匹配的潜在可能性,以提高跨语言泛化。使用语言的 sintactic 方面作为测试台,我们对43种语言进行了分析,发现这些语言之间的概念空间之间存在高度的可对应性,包括encoder-only和decoder-only LLMP。我们then propose了一种基于meta-学习的方法,可以学习不同语言之间的概念空间的对应关系,从而实现零例学习和几例学习在概念分类中的优秀表现。实验表明,我们的方法可以与现有的方法竞争,同时将语言资源差距缩小。

Label-Aware Automatic Verbalizer for Few-Shot Text Classification

  • paper_url: http://arxiv.org/abs/2310.12778
  • repo_url: None
  • paper_authors: Thanakorn Thaminkaew, Piyawat Lertvittayakumjorn, Peerapon Vateekul
  • for: 提高 few-shot text classification 的效果
  • methods: 使用 Label-Aware Automatic Verbalizer (LAAV),即通过将手动标签与 “and” 连接来增强模型生成更有效的词语
  • results: 对五种语言的五个 dataset 进行实验,显示 LAAV 明显超越现有的逻辑抽象器,并且对mid-to-low resource语言的推荐更加有利。
    Abstract Prompt-based learning has shown its effectiveness in few-shot text classification. One important factor in its success is a verbalizer, which translates output from a language model into a predicted class. Notably, the simplest and widely acknowledged verbalizer employs manual labels to represent the classes. However, manual selection does not guarantee the optimality of the selected words when conditioned on the chosen language model. Therefore, we propose Label-Aware Automatic Verbalizer (LAAV), effectively augmenting the manual labels to achieve better few-shot classification results. Specifically, we use the manual labels along with the conjunction "and" to induce the model to generate more effective words for the verbalizer. The experimental results on five datasets across five languages demonstrate that LAAV significantly outperforms existing verbalizers. Furthermore, our analysis reveals that LAAV suggests more relevant words compared to similar approaches, especially in mid-to-low resource languages.
    摘要

  • paper_url: http://arxiv.org/abs/2310.12766
  • repo_url: https://github.com/sociovestix/lenu
  • paper_authors: Alexander Arimond, Mauro Molteni, Dominik Jany, Zornitsa Manolova, Damian Borth, Andreas G. F. Hoepner
  • for: 这 paper 的目的是使用 Transformer-based 语言模型来分类实体法律形式从原始的法律实体名称中。
  • methods: 这 paper 使用了多种 BERT 变种,并与多种传统基线进行比较。
  • results: 这 paper 的评估结果表明,预先训练的 BERT 变种在 F1 分数和 Macro F1 分数中都高于传统文本分类方法,并且在多个选择的法律制度中进行第三方专家评审后,结果得到了证实。
    Abstract We propose the application of Transformer-based language models for classifying entity legal forms from raw legal entity names. Specifically, we employ various BERT variants and compare their performance against multiple traditional baselines. Our evaluation encompasses a substantial subset of freely available Legal Entity Identifier (LEI) data, comprising over 1.1 million legal entities from 30 different legal jurisdictions. The ground truth labels for classification per jurisdiction are taken from the Entity Legal Form (ELF) code standard (ISO 20275). Our findings demonstrate that pre-trained BERT variants outperform traditional text classification approaches in terms of F1 score, while also performing comparably well in the Macro F1 Score. Moreover, the validity of our proposal is supported by the outcome of third-party expert reviews conducted in ten selected jurisdictions. This study highlights the significant potential of Transformer-based models in advancing data standardization and data integration. The presented approaches can greatly benefit financial institutions, corporations, governments and other organizations in assessing business relationships, understanding risk exposure, and promoting effective governance.
    摘要 我们提议使用变换器基于模型来分类实体法律形式从原始的法律实体名称。specifically,我们利用了多种BERT变种并与多种传统基线进行比较。我们的评估覆盖了大量公开available Legal Entity Identifier(LEI)数据,包括30个不同的法律管辖区,共计1.1万个法律实体。ground truth标签 для分类每个司法管辖区来自ISO 20275标准的Entity Legal Form(ELF)代码标准。我们的发现表明预训练BERT变种在F1分数和Macro F1分数方面都高于传统文本分类方法,并且在多个选定的司法管辖区进行第三方专家审查后,结果支持了我们的建议。这项研究显示了变换器基于模型在数据标准化和数据 интеграция方面的重要潜力。提出的方法可以帮助金融机构、公司、政府和其他组织在评估商业关系、了解风险曝露和促进有效管理方面提供很大的助力。

Character-level Chinese Backpack Language Models

  • paper_url: http://arxiv.org/abs/2310.12751
  • repo_url: https://github.com/swordelucidator/nanobackpacklm
  • paper_authors: Hao Sun, John Hewitt
  • for: 这个论文旨在研究Backpack语言模型在使用Character-tokenized Chinese时的表现和可解释性。
  • methods: 这个论文使用了Backpack语言模型,并在Character-tokenized Chinese中训练、评估和控制了这种模型。
  • results: 研究发现,使用Backpack语言模型可以与使用Transformer模型相比,并且可以学习rich的字符级别意义,这些意义可以log-additively compose来形成词义。在SimLex-style lexical semantic evaluations中,Backpack模型的simple averages of character senses可以超过Transformer的输入嵌入。此外,研究还发现了 gender bias 的来源和如何进行 intervene 以减少这种偏见。
    Abstract The Backpack is a Transformer alternative shown to improve interpretability in English language modeling by decomposing predictions into a weighted sum of token sense components. However, Backpacks' reliance on token-defined meaning raises questions as to their potential for languages other than English, a language for which subword tokenization provides a reasonable approximation for lexical items. In this work, we train, evaluate, interpret, and control Backpack language models in character-tokenized Chinese, in which words are often composed of many characters. We find that our (134M parameter) Chinese Backpack language model performs comparably to a (104M parameter) Transformer, and learns rich character-level meanings that log-additively compose to form word meanings. In SimLex-style lexical semantic evaluations, simple averages of Backpack character senses outperform input embeddings from a Transformer. We find that complex multi-character meanings are often formed by using the same per-character sense weights consistently across context. Exploring interpretability-through control, we show that we can localize a source of gender bias in our Backpacks to specific character senses and intervene to reduce the bias.
    摘要 《背包》是一种 alternativa 于 Transformer 的语言模型,可以提高英语语言模型的可读性。然而,《背包》的依赖于 tokens 定义的意义会引起语言其他语言是否可以使用它的 вопро题。在这项工作中,我们使用 character-tokenized 的中文来训练、评估、解释和控制《背包》语言模型。我们发现我们的 (134M 参数) 中文《背包》语言模型与 (104M 参数) Transformer 相当,并学习了丰富的字符级别意义,这些意义以ilog-additively 组合以形成词义。在 SimLex 式 lexical semantic 评估中,简单的 Backpack 字符意义的平均值比输入 embedding 从 Transformer 高。我们发现了一些复杂的多字符意义通常是通过使用相同的每个字符意义Weight consistently across context来形成。我们 также发现了可以通过控制来解释性的地方性偏见的来源,并可以采取措施来减少偏见。

Representing and Computing Uncertainty in Phonological Reconstruction

  • paper_url: http://arxiv.org/abs/2310.12727
  • repo_url: https://github.com/lingpy/fuzzy
  • paper_authors: Johann-Mattis List, Nathan W. Hill, Robert Forkel, Frederic Blum
  • for: 这个论文主要是为了解决历史语言学中重建 proto-form 的不确定性问题。
  • methods: 这个论文使用了最新的超级vised phonological reconstruction 方法,其中一个算法学习如何在给定的 proto-语言中重建单词,并且受到了前一个文本的注释。
  • results: 这个论文提出了一种新的框架,可以表示语言重建中的不确定性,并且包括一个计算词库的工作流程。
    Abstract Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.
    摘要 尽管历史语言学中重建的 natura 有一定的抽象和不确定性,大多数学者在提出 proto-form 时并不表达这种不确定性。随着近期提出的一些方法在传统比较方法中自动化一些方面的成功, proto-form 的 formalization 也得到了改善。这种 formalization 使得可以考虑 both 表达和计算不确定性。基于最近的监督式phonological reconstruction 方法,我们提出了一个新的框架,该框架允许表达语言重建中的不确定性,并包括一个计算不确定性的工作流程。

Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing

  • paper_url: http://arxiv.org/abs/2310.12664
  • repo_url: None
  • paper_authors: Yue Guo, Zian Xu, Yi Yang
  • for: 评估大语言模型(LLMs)在金融领域的总体能力。
  • methods: 使用FinLMEval框架,包括九个金融语言任务的数据集,对语言模型进行评估。
  • results: 发现某些decoder-only LLMs在大多数金融任务上表现出色,但在使用专有数据集时,它们通常落后于专业化模型,特别是在使用预训练的情况下。
    Abstract The emergence of Large Language Models (LLMs), such as ChatGPT, has revolutionized general natural language preprocessing (NLP) tasks. However, their expertise in the financial domain lacks a comprehensive evaluation. To assess the ability of LLMs to solve financial NLP tasks, we present FinLMEval, a framework for Financial Language Model Evaluation, comprising nine datasets designed to evaluate the performance of language models. This study compares the performance of encoder-only language models and the decoder-only language models. Our findings reveal that while some decoder-only LLMs demonstrate notable performance across most financial tasks via zero-shot prompting, they generally lag behind the fine-tuned expert models, especially when dealing with proprietary datasets. We hope this study provides foundation evaluations for continuing efforts to build more advanced LLMs in the financial domain.
    摘要 大型自然语言模型(LLM),如ChatGPT,对普通自然语言处理(NLP)任务带来革命性的改变。然而,它们在金融领域的专业知识仍然缺乏全面的评估。为了评估语言模型在金融NLP任务中的能力,我们提出了FinLMEval框架,包括9个数据集,用于评估语言模型的表现。本研究比较了encoder-only语言模型和decoder-only语言模型的表现。我们的发现表明,虽然一些decoder-only LLMS在大多数金融任务上通过零shot提示表现出优异的能力,但它们通常落后于专门适应模型,特别是当面临专有数据集时。我们希望这项研究可以提供基础评估,以便将来继续努力建立更高级的LLMs在金融领域。

Towards Real-World Streaming Speech Translation for Code-Switched Speech

  • paper_url: http://arxiv.org/abs/2310.12648
  • repo_url: https://github.com/apple/ml-codeswitching-translations
  • paper_authors: Belen Alastruey, Matthias Sperber, Christian Gollan, Dominic Telaar, Tim Ng, Aashish Agarwal
  • for: 这篇论文主要是为了研究在流动环境下的多语言混合输入自动翻译(CS speech translation)。
  • methods: 该论文使用了扩展的鱼者和 Май亚миtest和验证集,以便在流动环境下和不同语言之间进行翻译。
  • results: 该论文在两种不同的翻译Setting下(即Offline和流动环境)实现了基线结果。
    Abstract Code-switching (CS), i.e. mixing different languages in a single sentence, is a common phenomenon in communication and can be challenging in many Natural Language Processing (NLP) settings. Previous studies on CS speech have shown promising results for end-to-end speech translation (ST), but have been limited to offline scenarios and to translation to one of the languages present in the source (\textit{monolingual transcription}). In this paper, we focus on two essential yet unexplored areas for real-world CS speech translation: streaming settings, and translation to a third language (i.e., a language not included in the source). To this end, we extend the Fisher and Miami test and validation datasets to include new targets in Spanish and German. Using this data, we train a model for both offline and streaming ST and we establish baseline results for the two settings mentioned earlier.
    摘要 ��ysteering (CS), i.e. mixing different languages in a single sentence, is a common phenomenon in communication and can be challenging in many Natural Language Processing (NLP) settings. Previous studies on CS speech have shown promising results for end-to-end speech translation (ST), but have been limited to offline scenarios and to translation to one of the languages present in the source (\textit{monolingual transcription}). In this paper, we focus on two essential yet unexplored areas for real-world CS speech translation: streaming settings, and translation to a third language (i.e., a language not included in the source). To this end, we extend the Fisher and Miami test and validation datasets to include new targets in Spanish and German. Using this data, we train a model for both offline and streaming ST and we establish baseline results for the two settings mentioned earlier.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan, Hong Kong, and other parts of the world.

Non-Autoregressive Sentence Ordering

  • paper_url: http://arxiv.org/abs/2310.12640
  • repo_url: https://github.com/steven640pixel/nonautoregressive-sentence-ordering
  • paper_authors: Yi Bin, Wenhao Shi, Bin Ji, Jipeng Zhang, Yujuan Ding, Yang Yang
  • for: 用于改进现有的句子排序方法,以提高句子排序 task 的效果。
  • methods: 提出了一种新的 Non-Autoregressive Ordering Network(NAON),该模型可以并行地预测每个句子的位置,并且可以充分利用句子之间的双向依赖关系。
  • results: 通过对多个常用的数据集进行广泛的实验,研究发现 NAON 模型可以与既有的排序方法相比,并且与当前的状态�的表现竞争。代码可以在以下链接中找到:https://github.com/steven640pixel/nonautoregressive-sentence-ordering。
    Abstract Existing sentence ordering approaches generally employ encoder-decoder frameworks with the pointer net to recover the coherence by recurrently predicting each sentence step-by-step. Such an autoregressive manner only leverages unilateral dependencies during decoding and cannot fully explore the semantic dependency between sentences for ordering. To overcome these limitations, in this paper, we propose a novel Non-Autoregressive Ordering Network, dubbed \textit{NAON}, which explores bilateral dependencies between sentences and predicts the sentence for each position in parallel. We claim that the non-autoregressive manner is not just applicable but also particularly suitable to the sentence ordering task because of two peculiar characteristics of the task: 1) each generation target is in deterministic length, and 2) the sentences and positions should match exclusively. Furthermore, to address the repetition issue of the naive non-autoregressive Transformer, we introduce an exclusive loss to constrain the exclusiveness between positions and sentences. To verify the effectiveness of the proposed model, we conduct extensive experiments on several common-used datasets and the experimental results show that our method outperforms all the autoregressive approaches and yields competitive performance compared with the state-of-the-arts. The codes are available at: \url{https://github.com/steven640pixel/nonautoregressive-sentence-ordering}.
    摘要 传统的句子排序方法通常采用encoder-decoder框架,使用Pointer网来恢复句子之间的相互关系。这种泛化的方式只能利用句子之间的单方向依赖关系,无法全面探索句子之间的semantic依赖关系。为了解决这些限制,在这篇论文中,我们提出了一种新的非泛化排序网络,名为NAON,它可以并行地遍历句子之间的bilateral依赖关系,并且可以在不同的句子之间进行独特的排序。我们认为非泛化的方式不仅可以应用于句子排序任务,而且特别适用于这种任务,因为句子的生成目标是固定长度的,并且句子和位置之间必须匹配精确。此外,为了解决非泛化 transformer 的重复问题,我们引入了一种独特的损失函数,以便约束句子和位置之间的唯一性。为了证明我们的方法的效果,我们在多个常用的数据集上进行了广泛的实验,结果表明,我们的方法不仅超过了所有泛化方法,而且与当前的state-of-the-arts具有竞争力。代码可以在以下链接获取:\url{https://github.com/steven640pixel/nonautoregressive-sentence-ordering}.

Predict the Future from the Past? On the Temporal Data Distribution Shift in Financial Sentiment Classifications

  • paper_url: http://arxiv.org/abs/2310.12620
  • repo_url: None
  • paper_authors: Yue Guo, Chenxi Hu, Yi Yang
  • for: 如何在不稳定的股票市场环境中训练一个精度和鲁棒地感知股票情绪分析系统?
  • methods: 我们使用实验方法来研究在时间分布shift下financial sentiment analysis系统的性能,并提出一种基于时间序列模型的新方法,用于检测和适应 evolving temporal shifts。
  • results: 实验结果表明,我们提出的方法可以增强模型在不稳定的股票市场中适应时间分布shift的能力,并且在不同的时间窗口和市场情况下具有良好的泛化能力。
    Abstract Temporal data distribution shift is prevalent in the financial text. How can a financial sentiment analysis system be trained in a volatile market environment that can accurately infer sentiment and be robust to temporal data distribution shifts? In this paper, we conduct an empirical study on the financial sentiment analysis system under temporal data distribution shifts using a real-world financial social media dataset that spans three years. We find that the fine-tuned models suffer from general performance degradation in the presence of temporal distribution shifts. Furthermore, motivated by the unique temporal nature of the financial text, we propose a novel method that combines out-of-distribution detection with time series modeling for temporal financial sentiment analysis. Experimental results show that the proposed method enhances the model's capability to adapt to evolving temporal shifts in a volatile financial market.
    摘要 Temporal data distribution shift is prevalent in financial texts. How can a financial sentiment analysis system be trained in a volatile market environment that can accurately infer sentiment and be robust to temporal data distribution shifts? In this paper, we conduct an empirical study on the financial sentiment analysis system under temporal data distribution shifts using a real-world financial social media dataset that spans three years. We find that the fine-tuned models suffer from general performance degradation in the presence of temporal distribution shifts. Furthermore, motivated by the unique temporal nature of financial texts, we propose a novel method that combines out-of-distribution detection with time series modeling for temporal financial sentiment analysis. Experimental results show that the proposed method enhances the model's capability to adapt to evolving temporal shifts in a volatile financial market.Here's the word-for-word translation of the text into Simplified Chinese:时间数据分布偏移是金融文本中的普遍现象。如何在投资环境中训练一个可以准确感受情感并在时间数据分布偏移下具有鲜度的金融情感分析系统?在这篇论文中,我们通过使用三年的实际金融社交媒体数据进行了employmerical study,发现了精度调整模型在时间数据分布偏移下的总性性能下降。此外,鉴于金融文本的特殊时间特征,我们提出了一种新的方法,即将out-of-distribution检测与时间系列模型结合以实现时间金融情感分析。实验结果表明,我们的提议方法可以在投资市场中的投资环境中提高模型的适应能力。

Multilingual estimation of political-party positioning: From label aggregation to long-input Transformers

  • paper_url: http://arxiv.org/abs/2310.12575
  • repo_url: https://github.com/macleginn/party-positioning-code
  • paper_authors: Dmitry Nikolaev, Tanise Ceron, Sebastian Padó
  • for: 这个研究是为了 automatization of scaling analysis in computational political science, 用于分析政治actor(如政治家或政党)在长文本(如国会讲话或选举 manifesto)中的政治倾向。
  • methods: 这个研究使用了两种方法来实现自动化的scaling analysis:label aggregation和long-input-Transformer-based models。label aggregation是一种管道Strategy,通过对manifestos中的每个声明进行标签注释来计算分类值,而long-input-Transformer-based models则直接从原始文本中计算分类值。
  • results: 研究在 Comparative Manifestos Project 数据集上进行了分析,包括41个国家和27种语言,并发现使用当今的模型可以高效解决这个任务,而label aggregation方法得到了最佳结果。
    Abstract Scaling analysis is a technique in computational political science that assigns a political actor (e.g. politician or party) a score on a predefined scale based on a (typically long) body of text (e.g. a parliamentary speech or an election manifesto). For example, political scientists have often used the left--right scale to systematically analyse political landscapes of different countries. NLP methods for automatic scaling analysis can find broad application provided they (i) are able to deal with long texts and (ii) work robustly across domains and languages. In this work, we implement and compare two approaches to automatic scaling analysis of political-party manifestos: label aggregation, a pipeline strategy relying on annotations of individual statements from the manifestos, and long-input-Transformer-based models, which compute scaling values directly from raw text. We carry out the analysis of the Comparative Manifestos Project dataset across 41 countries and 27 languages and find that the task can be efficiently solved by state-of-the-art models, with label aggregation producing the best results.
    摘要 《缩放分析》是计算政治科学中的一种技术,它将政治actor(如政治家或党派)分配到一个预先定义的数值范围中,基于一段(通常很长)的文本(如国会演讲或选举纲领)。例如,政治科学家经常使用左右刻度来系统地分析不同国家的政治景观。NLP方法可以自动进行缩放分析,只要它们能够处理长文本并在领域和语言上具有可靠性。在这项工作中,我们实现并比较了两种自动缩放分析政党纲领的方法:标签聚合策略和长输入变换器模型。我们对 Comparative Manifestos Project 数据集进行了41个国家和27种语言的分析,发现这种任务可以通过当今的模型高效解决,标签聚合策略得到了最佳结果。

Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong

  • paper_url: http://arxiv.org/abs/2310.12558
  • repo_url: None
  • paper_authors: Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé III, Jordan Boyd-Graber
  • for: 这 paper 是研究语言模型(LLMs)在提供信息时的可靠性和事实性的。
  • methods: 这 paper 使用了80名劳动者进行实验,比较语言模型和搜索引擎在帮助用户Fact-checking中的表现。
  • results: 用户阅读语言模型的解释时比使用搜索引擎更高效,但往往会因为错误的解释而过分依赖于语言模型。为解决这个问题, authors 提出了使用对比性的解释,并证明这种方法不能显著超越搜索引擎。
    Abstract Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they're getting, LLMs should not only provide but also help users fact-check information. In this paper, we conduct experiments with 80 crowdworkers in total to compare language models with search engines (information retrieval systems) at facilitating fact-checking by human users. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than using search engines with similar accuracy. However, they tend to over-rely the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. However, showing both search engine results and LLM explanations offers no complementary benefits as compared to search engines alone. Taken together, natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages yet, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.
    摘要 大型自然语言模型(LLM)在访问网络信息方面日益广泛使用。因此,LLM的准确性和事实性具有极大的兴趣。为了帮助用户做出正确的信息决策,LLM不仅应该提供信息,而且还应该帮助用户进行事实核实。在这篇论文中,我们通过80名志愿者进行实验,比较了LLM与搜索引擎(信息检索系统)在促进用户进行事实核实方面的性能。我们请求LLM验证一个声明,并提供相关的解释。用户读取LLM解释时比使用搜索引擎相同的精度更高,但往往会因LLM解释错误而过度依赖LLM。为减少过度依赖LLM,我们请求LLM提供相互补做的解释——解释一个声明是True和False两个方面。然后,我们向用户展示这两个解释。这种相互补做的解释可以减少用户对LLM的过度依赖,但无法达到与搜索引擎相同的性能。尽管显示搜索引擎结果和LLM解释可以提供补做,但这并没有提供补做的优势。因此,自然语言解释由LLM可能不是一个可靠的替代品,特别在高风险情况下,过度依赖错误的AI解释可能会导致严重的后果。

Product Attribute Value Extraction using Large Language Models

  • paper_url: http://arxiv.org/abs/2310.12537
  • repo_url: https://github.com/wbsg-uni-mannheim/extractgpt
  • paper_authors: Alexander Brinkmann, Roee Shraga, Christian Bizer
  • for: 这个论文是关于如何使用大型自然语言模型(LLM)来进行 attribute/value EXTRACTION,以提高 attribute/value EXTRACTION 的效率和可靠性。
  • methods: 这个论文使用了 hosted LLMs 和 open-source LLMs,如 GPT-3.5 和 GPT-4,以及不同的提问设计和示例值提供方法来进行 attribute/value EXTRACTION。
  • results: 研究发现,使用 GPT-4 可以达到 attribute/value EXTRACTION 的平均 F1 分数为 85%,而最佳 PLM-based 技术在同样的训练数据量下表现较差约 5%。此外, fine-tuned GPT-3.5 模型可以达到类似于 GPT-4 的性能,但是更加经济。
    Abstract E-commerce applications such as faceted product search or product comparison are based on structured product descriptions like attribute/value pairs. The vendors on e-commerce platforms do not provide structured product descriptions but describe offers using titles or descriptions. To process such offers, it is necessary to extract attribute/value pairs from textual product attributes. State-of-the-art attribute/value extraction techniques rely on pre-trained language models (PLMs), such as BERT. Two major drawbacks of these models for attribute/value extraction are that (i) the models require significant amounts of task-specific training data and (ii) the fine-tuned models face challenges in generalizing to attribute values not included in the training data. This paper explores the potential of large language models (LLMs) as a training data-efficient and robust alternative to PLM-based attribute/value extraction methods. We consider hosted LLMs, such as GPT-3.5 and GPT-4, as well as open-source LLMs based on Llama2. We evaluate the models in a zero-shot scenario and in a scenario where task-specific training data is available. In the zero-shot scenario, we compare various prompt designs for representing information about the target attributes of the extraction. In the scenario with training data, we investigate (i) the provision of example attribute values, (ii) the selection of in-context demonstrations, and (iii) the fine-tuning of GPT-3.5. Our experiments show that GPT-4 achieves an average F1-score of 85% on the two evaluation datasets while the best PLM-based techniques perform on average 5% worse using the same amount of training data. GPT-4 achieves a 10% higher F1-score than the best open-source LLM. The fine-tuned GPT-3.5 model reaches a similar performance as GPT-4 while being significantly more cost-efficient.
    摘要 电子商务应用程序如多维产品搜索或产品比较是基于结构化产品描述如属性值对。供应商在电子商务平台上不提供结构化产品描述,而是使用标题或描述来描述产品。为处理这些产品,需要从文本属性中提取属性值对。现状的属性值提取技术都是基于预训练语言模型(PLM),如BERT。这两种模型的缺点是:(一)模型需要大量的任务特定训练数据,(二)精通化的模型在训练数据中未包含的属性值上面临挑战。这篇论文探讨使用大语言模型(LLM)作为任务数据efficient和可靠的alternative。我们考虑了主机LLM,如GPT-3.5和GPT-4,以及基于Llama2的开源LLM。我们在零容量情况下和具有任务特定训练数据的情况下评估了模型。在零容量情况下,我们比较了不同的提示设计来表达目标属性的信息。在具有训练数据的情况下,我们研究了(一)提供示例属性值,(二)选择 Contextual Demonstrations,(三)精通化GPT-3.5。我们的实验结果显示,GPT-4在两个评估 datasets 上的平均 F1 分为 85%,而最佳 PLM 基本技术在同样的训练数据量下表现落后约 5%。GPT-4 在同样的训练数据量下达到了 10% 高的 F1 分,而 откры源 LL 在同样的训练数据量下达到了类似的性能。精通化 GPT-3.5 模型可以达到类似的性能,但是更加cost-efficient。

ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

  • paper_url: http://arxiv.org/abs/2310.12531
  • repo_url: https://github.com/gjwubyron/icu
  • paper_authors: Guojun Wu
  • for: 本文旨在解决多语言视频语言(V&L)研究中的多语言和多Modal功能问题。
  • methods: 我们提出了一种图像Caption理解(ICU)技术,将V&L任务分成两个阶段:首先,V&L模型在英语下进行图像Captioning;然后,多语言自然语言模型(mLM)使用Caption作为alt文本,进行多语言语言理解。这将将多语言处理的负担卸载到mLM上。
  • results: 在IGLUEbenchmark中的两个任务上,我们通过实验表明,ICU可以在9种语言中为5种语言取得新的状态对抗记录,并对剩下的语言取得相似的记录。
    Abstract Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs crosslingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
    摘要 大多数多语言视觉语言(V&L)研究的目标是在一个模型中实现多语言和多Modal功能。然而,由于图像 Multilingual 标签的罕见性,这些研究受到了阻碍。为了解决这个问题,我们提出了 ICU(图像标题理解),它将 V&L 任务分成两个阶段:一个 V&L 模型在英语中进行图像标题 generation,然后一个多语言语言模型(mLM)使用这个标题作为Alt文本,进行跨语言语言理解。将多语言处理的负担从 V&L 模型转移到 mLM 上。由于多语言文本数据的质量和quantity相对较高,ICU 可以帮助 conquering 语言障碍 для V&L 模型。在 IGLUE benchmark 上两个任务上,我们通过实验表明,ICU 可以达到新的州OF-THE-ART 结果的五种语言,和与其他语言相对的结果。

Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach

  • paper_url: http://arxiv.org/abs/2310.12522
  • repo_url: None
  • paper_authors: Shufan Jiang, Rafael Angarita, Stéphane Cormier, Francis Rousseaux
  • for: 本研究旨在使用感知技术和数据分析方法探测和评估农业精度中的作物健康威胁。
  • methods: 本研究使用社交媒体平台 Twitter 的用户生成的文本数据,通过抽象Token-level注解任务来提高 ChouBERT 模型对作物健康问题的识别能力。
  • results: ChouBERT 模型可以通过 Twitter 上的用户生成的文本数据探测到不熟悉的作物健康问题,并且可以在不同的自然灾害中保持一定的通用性。
    Abstract An important application scenario of precision agriculture is detecting and measuring crop health threats using sensors and data analysis techniques. However, the textual data are still under-explored among the existing solutions due to the lack of labelled data and fine-grained semantic resources. Recent research suggests that the increasing connectivity of farmers and the emergence of online farming communities make social media like Twitter a participatory platform for detecting unfamiliar plant health events if we can extract essential information from unstructured textual data. ChouBERT is a French pre-trained language model that can identify Tweets concerning observations of plant health issues with generalizability on unseen natural hazards. This paper tackles the lack of labelled data by further studying ChouBERT's know-how on token-level annotation tasks over small labeled sets.
    摘要 设置语言为简化中文。<>精准农业中一个重要应用场景是通过感知器和数据分析技术检测和评估作物健康威胁。然而,文本数据仍然受到已有解决方案的限制,这是因为lack labelled data和细化 semantic resources。近期研究表明,农民之间的连接度的增加和在线农业社区的出现,使得社交媒体如推特成为了检测不熟悉的作物健康事件的参与式平台。ChouBERT是一种法国预训练语言模型,可以从推特上提取关于作物健康问题的观察记录,并且可以在未看过的自然灾害上进行普适化。本文解决了lack labelled data问题,通过进一步研究ChouBERT的Token-level注释任务能力。

Lost in Translation: When GPT-4V(ision) Can’t See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

  • paper_url: http://arxiv.org/abs/2310.12520
  • repo_url: None
  • paper_authors: Xiang Zhang, Senyu Li, Zijun Wu, Ning Shi
  • for: 这篇论文旨在探讨多模态技术的新进展,以及这些技术在文本、音频和图像处理任务中的表现。
  • methods: 本文使用的方法是将计算机视觉和自然语言处理结合在一起,并对这些视觉语言模型(VLLMs)在多种任务上进行了全面的分析。
  • results: 研究发现,当任务较为简单时,模型如GPT-4V在不同modalities之间具有一定的一致性。然而,当任务变得更加复杂时,图像模态的可靠性减退。此外,我们还提出了一种“视觉描述引导”方法,可以有效地提高在复杂视觉任务中的表现。
    Abstract Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex text and image tasks. Numerous prior research endeavors have diligently examined the performance of these Vision Large Language Models (VLLMs) across tasks like object detection, image captioning and others. However, these analyses often focus on evaluating the performance of each modality in isolation, lacking insights into their cross-modal interactions. Specifically, questions concerning whether these vision-language models execute vision and language tasks consistently or independently have remained unanswered. In this study, we draw inspiration from recent investigations into multilingualism and conduct a comprehensive analysis of model's cross-modal interactions. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting and provide a set of datasets designed for these evaluations. Our findings reveal that models like GPT-4V tend to perform consistently modalities when the tasks are relatively simple. However, the trustworthiness of results derived from the vision modality diminishes as the tasks become more challenging. Expanding on our findings, we introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
    摘要 现代多模态技术的发展开创了多任务涉及文本、音频和图像处理的模型表现出色的可能性。如GPT-4V模型,它将计算机视觉和自然语言处理融合在一起,在复杂的文本和图像任务中表现出色。许多前期研究努力地研究了这些视觉大语言模型(VLLMs)在不同任务中的表现,但这些分析通常会孤立地评估每个模式的表现,缺乏跨模式交互的视角。特别是,关于这些视觉语言模型在视觉和语言任务中是否能够协调一致的问题,一直未得到回答。在这项研究中,我们 draw inspiration from recent investigations into multilingualism and conduct a comprehensive analysis of the model's cross-modal interactions. We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting and provide a set of datasets designed for these evaluations. Our findings reveal that models like GPT-4V tend to perform consistently across modalities when the tasks are relatively simple. However, the trustworthiness of results derived from the vision modality diminishes as the tasks become more challenging. Based on our findings, we introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.

Attack Prompt Generation for Red Teaming and Defending Large Language Models

  • paper_url: http://arxiv.org/abs/2310.12505
  • repo_url: https://github.com/aatrox103/sap
  • paper_authors: Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He
  • for: 防御大语言模型(LLMs)受到红色队伍攻击,生成危害性内容。
  • methods: 提出一种整合手动和自动方法的方法,以便经济高质量的攻击提示构造。
  • results: 实验 validate 提出的攻击和防御框架的有效性,并释放了不同大语言模型的攻击提示数据集(SAP)。
    Abstract Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. Our code and dataset is available on https://github.com/Aatrox103/SAP .
    摘要

Co$^2$PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning

  • paper_url: http://arxiv.org/abs/2310.12490
  • repo_url: https://github.com/dongxiangjue/co2pt
  • paper_authors: Xiangjue Dong, Ziwei Zhu, Zhuoer Wang, Maria Teleki, James Caverlee
  • for: 降低语言模型中的社会偏见
  • methods: 对下游任务进行debias-while-prompt tuning,通过对比事实contrastive prompt tuning来mitigate bias
  • results: 在三个外在偏见benchmark上进行了实验,结果表明Co$^2$PT在下游任务中的偏见mitigation效果显著,并且可以与现有的上游净化语言模型结合使用。
    Abstract Pre-trained Language Models are widely used in many important real-world applications. However, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. To address this challenge, we propose Co$^2$PT, an efficient and effective debias-while-prompt tuning method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. Our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of Co$^2$PT on bias mitigation during the prompt tuning process and its adaptability to existing upstream debiased language models. These findings indicate the strength of Co$^2$PT and provide promising avenues for further enhancement in bias mitigation on downstream tasks.
    摘要 预训言语模型在许多重要的实际应用中广泛使用。然而,最近的研究表明,这些模型可以从大规模预训料中学习社会偏见,甚至在下游应用中强化偏见。为解决这个挑战,我们提出了Co$^2$PT,一种高效的debias-while-prompt tuning方法,通过对下游任务进行假想对比的短语调整来mitigate偏见。我们的实验在三个外在偏见benchmark中展示了Co$^2$PT在偏见减轻过程中的效果和对现有的逆偏见语言模型的适应性。这些发现表明Co$^2$PT的强大和进一步减轻偏见的可能性。

MedAI Dialog Corpus (MEDIC): Zero-Shot Classification of Doctor and AI Responses in Health Consultations

  • paper_url: http://arxiv.org/abs/2310.12489
  • repo_url: None
  • paper_authors: Olumide E. Ojo, Olaronke O. Adebanji, Alexander Gelbukh, Hiram Calvo, Anna Feldman
  • for: 本研究旨在检验预训练语言模型在医疗咨询中的效果,以零批学习方式准确分类医生和AI生成的文本。
  • methods: 我们使用预训练语言模型进行零批学习,对医生和AI生成的医疗咨询文本进行分类。
  • results: 研究发现,预训练语言模型在医疗咨询文本分类方面存在限制,尚未达到预期的准确率。这些结果为未来在医疗文本分类领域的研究提供了基础。
    Abstract Zero-shot classification enables text to be classified into classes not seen during training. In this research, we investigate the effectiveness of pre-trained language models to accurately classify responses from Doctors and AI in health consultations through zero-shot learning. Our study aims to determine whether these models can effectively detect if a text originates from human or AI models without specific corpus training. We collect responses from doctors to patient inquiries about their health and pose the same question/response to AI models. While zero-shot language models show a good understanding of language in general, they have limitations in classifying doctor and AI responses in healthcare consultations. This research lays the groundwork for further research into this field of medical text classification, informing the development of more effective approaches to accurately classify doctor-generated and AI-generated text in health consultations.
    摘要 zero-shot 分类可以使文本被分类到没有在训练过程中看到的类别中。在这个研究中,我们研究了预训练语言模型在医疗询问中的准确性。我们的研究目标是确定这些模型能否准确地检测文本是否来自人类或AI模型,无需特定的文库训练。我们收集了医生对病人问题的回答,并对这些问题和回答提出同样的问题/回答给AI模型。虽然零shot语言模型在语言水平上显示了良好的理解,但在医疗询问中的分类中存在限制。这个研究为这一领域的医学文本分类铺平了基础,推动了更有效的approaches的发展,以准确地分类医生生成和AI生成的文本在医疗询问中。

Contrastive Learning for Inference in Dialogue

  • paper_url: http://arxiv.org/abs/2310.12467
  • repo_url: https://github.com/hltchkust/contrastive_inference_dialogue
  • paper_authors: Etsuko Ishii, Yan Xu, Bryan Wilie, Ziwei Ji, Holy Lovenia, Willy Chung, Pascale Fung
  • for: This paper aims to improve the ability of language models in inductive reasoning, specifically in generating correct inferences when not all information is present in the context.
  • methods: The paper uses contrastive learning, where negative samples are fed to the model to help it understand what is wrong and improve its inference generation.
  • results: The experiments suggest that using negative samples improves the model’s ability to generate correct inferences, mitigating the information gap between dialogue contexts and desired inferences.
    Abstract Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap -- which distinguishes inductive and deductive reasoning (Johnson-Laird, 1988, 1993). Our analysis reveals that the disparity in information between dialogue contexts and desired inferences poses a significant challenge to the inductive inference process. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.
    摘要 对话中的推理,特别是从推理过程中得到的推论,是我们的对话中的一个重要组成部分,可以补充说话人所显示或隐藏的信息。Recent large language models在推理任务中表现出色,但是在 inductive reasoning 任务中,它们的表现远远落后于 deduced reasoning。在这篇论文中,我们分析基于任务难度定义的semantic information gap,并对模型的行为进行分析。我们发现,对话上下文中的信息与推理结果之间的信息差异 pose 一个重要的挑战。为了缓解这种信息差异,我们 investigate 一种对比学习方法,通过 feeding negative samples。我们的实验表明,负样本可以帮助模型理解错误,并提高其推理生成。

Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

  • paper_url: http://arxiv.org/abs/2310.12462
  • repo_url: None
  • paper_authors: Yichuan Deng, Zhao Song, Shenghao Xie, Chiwun Yang
  • for: 本研究旨在探讨Transformer模型中数据是否可以通过注意力权重和输出来恢复。
  • methods: 我们提出了一种理论框架,通过推广损失函数L(X)来回归输入数据X。
  • results: 我们发现,通过注意力权重和输出,可以recover输入数据,这有关LLM的设计存在潜在的安全和隐私问题。
    Abstract In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.
    摘要 在深度学习领域,转换器已成为主流架构,尤其在自然语言处理任务中。然而,随着其广泛应用,对转换器处理数据的安全性和隐私问题产生了关切的关注。在这篇论文中,我们解决了一个重要问题:可以通过转换器的注意力权重和输出来恢复输入数据 $X \in \mathbb{R}^{d \times n}$?我们提出了一个理论框架,并采用一种算法来实现这一目标。具体来说,我们提出了一种算法,通过将注意力权重 $W = QK^\top \in \mathbb{R}^{d \times d}$ 和输出 $B \in \mathbb{R}^{n \times n}$ 作为输入,计算出输入数据 $X$ 的恢复loss函数 $L(X)$。这个loss函数捕捉了转换器输出与预期输出之间的差异。我们的发现对Localized Layer-wise Mechanism (LLM) 有重要的安全性和隐私问题的影响,表明转换器的设计可能存在潜在的漏洞。这种工作强调了理解和保护转换器的内部工作方式,以确保处理数据的隐私。

A Read-and-Select Framework for Zero-shot Entity Linking

  • paper_url: http://arxiv.org/abs/2310.12450
  • repo_url: https://github.com/hitsz-tmg/read-and-select
  • paper_authors: Zhenran Xu, Yulin Chen, Baotian Hu, Min Zhang
  • for: 这篇 paper 的目的是提出一个 zero-shot entity linking (EL) 方法,以挑战模型的通用能力。
  • methods: 这篇 paper 使用了一个 read-and-select (ReS) 框架,模型了主要的实体识别杜里识别和跨实体比较。
  • results: 这篇 paper 在 ZESHEL dataset 上 achieved 顶尖性能,与先前大多数工作不需要几阶段预训,展示了提取和选择的两个过程之间的互动效果。I hope that helps! Let me know if you have any other questions.
    Abstract Zero-shot entity linking (EL) aims at aligning entity mentions to unseen entities to challenge the generalization ability. Previous methods largely focus on the candidate retrieval stage and ignore the essential candidate ranking stage, which disambiguates among entities and makes the final linking prediction. In this paper, we propose a read-and-select (ReS) framework by modeling the main components of entity disambiguation, i.e., mention-entity matching and cross-entity comparison. First, for each candidate, the reading module leverages mention context to output mention-aware entity representations, enabling mention-entity matching. Then, in the selecting module, we frame the choice of candidates as a sequence labeling problem, and all candidate representations are fused together to enable cross-entity comparison. Our method achieves the state-of-the-art performance on the established zero-shot EL dataset ZESHEL with a 2.55% micro-average accuracy gain, with no need for laborious multi-phase pre-training used in most of the previous work, showing the effectiveness of both mention-entity and cross-entity interaction.
    摘要 <> traduction du texte en chinois simplifié<>零shot实体链接(EL)目标是将实体提及与未见过的实体进行对应,以挑战总结能力。先前的方法主要集中在候选人选择阶段,忽略了实体识别阶段的关键环节,这使得最终的链接预测受到了限制。在这篇论文中,我们提出了读取和选择(ReS)框架,模型实体异常识别的主要组成部分,即提及语境匹配和跨实体比较。首先,为每个候选人,读取模块利用提及语境来生成提及意识的实体表示,以便提及语境匹配。然后,选择模块将选择问题定义为序列标签问题,并将所有候选人表示融合起来,以便跨实体比较。我们的方法在ZESHEL数据集上实现了零shotEL任务的state-of-the-art性,具有2.55%的微平均准确率提升,无需耗费大量的多阶段预训练,示出提及语境和跨实体交互的效iveness。

Revisiting Sparse Retrieval for Few-shot Entity Linking

  • paper_url: http://arxiv.org/abs/2310.12444
  • repo_url: https://github.com/hitsz-tmg/sparse-retrieval-fewshot-el
  • paper_authors: Yulin Chen, Zhenran Xu, Baotian Hu, Min Zhang
  • for: 提高域内少量标注数据下 dense retriever 的性能
  • methods: 提出了一种基于 ELECTRA 的关键词提取器,用于减噪提取 mention 上下文,构建更好的查询表达
  • results: 对 ZESHEL 数据集进行实验,提出的方法比现有模型在所有测试领域均显著提高了性能,证明了关键词增强的稀疏检索的效果。
    Abstract Entity linking aims to link ambiguous mentions to their corresponding entities in a knowledge base. One of the key challenges comes from insufficient labeled data for specific domains. Although dense retrievers have achieved excellent performance on several benchmarks, their performance decreases significantly when only a limited amount of in-domain labeled data is available. In such few-shot setting, we revisit the sparse retrieval method, and propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression. For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions. Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains, showing the effectiveness of keyword-enhanced sparse retrieval.
    摘要 Entity 链接目标是将模糊提及链接到知识库中的对应实体。一个关键挑战是域特定数据的不足。虽然稠密抽取器在多个标准准点上表现出色,但当只有有限量的域特定标注数据时,其表现会明显下降。在这种几个shotSetting下,我们重新考虑稠密抽取方法,并基于ELECTRA提出了一个键盘EXTRACTOR来减少提及上下文的噪声并构建更好的查询表达。为了训练EXTRACTOR,我们提出了远程监督法,通过提取 mention context 和实体描述中的重叠token来自动生成训练数据。实验结果表明,我们提出的方法在ZESHEL数据集上比 estado-of-the-art 模型具有显著的优势,在所有测试领域中表现出色,这表明了键 palabm-enhanced 稠密抽取的效iveness。

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

  • paper_url: http://arxiv.org/abs/2310.12442
  • repo_url: None
  • paper_authors: Qingru Zhang, Dhananjay Ram, Cole Hawkins, Sheng Zha, Tuo Zhao
  • for: 提高自然语言处理任务中Transformer模型的性能,并且降低计算成本。
  • methods: 提出了一种混合 span 注意力的 transformer 变体,即 MASFormer,它结合了全注意力和稀注意力,以提高计算效率。
  • results: 对于自然语言模型和生成任务,MASFormer 可以与权重 transformer 具有相同的性能,而且可以减少计算成本(最多下降75%)。
    Abstract Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest.
    摘要 预训练的变换器模型在不同的自然语言处理任务中表现了惊人的表现。这些模型利用注意机制来捕捉序列中的长距离和短距离依赖关系。然而,全attenion机制对于长序列来说计算成本高于 quadratic,这对于长序列任务而言是不可接受的。虽然可以使用稀疏注意来提高计算效率,但这会削弱模型的表达能力,常常无法捕捉长序列中的复杂依赖关系。为了解决这个挑战,我们提出了 MASFormer,一种简单实现的变换器变体,具有混合注意长度。具体来说,MASFormer 具有全attenion,以捕捉长距离依赖关系,但只在一些层中使用。对于剩下的层,MASformer 只使用稀疏注意,以捕捉短距离依赖关系。我们的实验表明,一个 Parameters 为 1.3B 的 decoder-only MASFormer 模型可以与普通的变换器模型具有相同的表现,同时显著降低计算成本(最高降低 75%)。此外,我们还研究了在长序列数据上进行 continual training 的效果,以及序列长度对下游生成性能的影响,这可能是独立的兴趣。

DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond

  • paper_url: http://arxiv.org/abs/2310.12430
  • repo_url: https://github.com/alibabaresearch/advancedliteratemachinery
  • paper_authors: Cong Yao
  • for: The paper is written for document parsing and structured representation of unstructured documents.
  • methods: The paper proposes a powerful open-source toolchain called DocXChain, which includes basic capabilities such as text detection, text recognition, table structure recognition, and layout analysis, as well as fully functional pipelines for document parsing, including general text reading, table parsing, and document structurization.
  • results: The paper demonstrates the effectiveness of DocXChain in automatically converting rich information embodied in unstructured documents into structured representations that are readable and manipulable by machines. The paper also shows that DocXChain is concise, modularized, and flexible, and can be readily integrated with existing tools, libraries, or models to construct more powerful systems for various applications related to documents in real-world scenarios.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了文档分析和结构表示不结构化文档而写的。
  • methods: 论文提出了一个强大的开源工具链 called DocXChain,包括基本功能 such as 文本检测、文本识别、表格结构识别和布局分析,以及完整的文档分析管道,包括通用文本读取、表格分析和文档结构化。
  • results: 论文证明了 DocXChain 可以自动将不结构化文档中的丰富信息转化为机器可读和操作的结构表示。论文还显示了 DocXChain 是简洁、模块化和灵活的,可以与现有的工具、库或模型(如 LangChain 和 ChatGPT)集成,建立更强大的系统,用于各种文档相关的应用场景。
    Abstract In this report, we introduce DocXChain, a powerful open-source toolchain for document parsing, which is designed and developed to automatically convert the rich information embodied in unstructured documents, such as text, tables and charts, into structured representations that are readable and manipulable by machines. Specifically, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. Upon these basic capabilities, we also build a set of fully functional pipelines for document parsing, i.e., general text reading, table parsing, and document structurization, to drive various applications related to documents in real-world scenarios. Moreover, DocXChain is concise, modularized and flexible, such that it can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to construct more powerful systems that can accomplish more complicated and challenging tasks. The code of DocXChain is publicly available at:~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain}
    摘要 在本报告中,我们介绍了 DocXChain,一个强大的开源工具链 для文档分析,它可以自动将不结构化文档中的丰富信息,如文本、表格和图表,转化为可读取和可操作的机器可读取格式。具体来说,提供了基本功能,包括文本检测、文本识别、表格结构识别和文档布局分析。在这些基本功能之上,我们还构建了一些完整的文档分析管道,例如通用文本读取、表格分析和文档结构化,以驱动实际场景中文档应用。此外, DocXChain 具有简洁、模块化和灵活的特点,可以轻松地与现有的工具、库或模型(如 LangChain 和 ChatGPT)集成,以构建更强大的系统,用于解决更复杂和挑战性的任务。 DocXChain 的代码可以在以下链接获取:https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/Applications/DocXChain。

The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions

  • paper_url: http://arxiv.org/abs/2310.12418
  • repo_url: https://github.com/ozyyshr/ShareGPT_investigation
  • paper_authors: Siru Ouyang, Shuohang Wang, Yang Liu, Ming Zhong, Yizhu Jiao, Dan Iter, Reid Pryzant, Chenguang Zhu, Heng Ji, Jiawei Han
  • for: 本研究旨在探讨现有的大语言模型(LLM)研究是否准确反映用户需求。
  • methods: 本研究使用大规模的用户-GPT对话集来分析现有的NLP研究和用户需求之间的差异。
  • results: 研究发现用户常见的任务,如“设计”和“规划”,在学术研究中受到忽视或与传统的NLP标准任务不同。研究还探讨了这些被忽略的任务的实际挑战和如何使LLM更加适应用户需求。
    Abstract Recent progress in Large Language Models (LLMs) has produced models that exhibit remarkable performance across a variety of NLP tasks. However, it remains unclear whether the existing focus of NLP research accurately captures the genuine requirements of human users. This paper provides a comprehensive analysis of the divergence between current NLP research and the needs of real-world NLP applications via a large-scale collection of user-GPT conversations. We analyze a large-scale collection of real user queries to GPT. We compare these queries against existing NLP benchmark tasks and identify a significant gap between the tasks that users frequently request from LLMs and the tasks that are commonly studied in academic research. For example, we find that tasks such as ``design'' and ``planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks. We investigate these overlooked tasks, dissect the practical challenges they pose, and provide insights toward a roadmap to make LLMs better aligned with user needs.
    摘要

FinEntity: Entity-level Sentiment Classification for Financial Texts

  • paper_url: http://arxiv.org/abs/2310.12406
  • repo_url: https://github.com/yixuantt/finentity
  • paper_authors: Yixuan Tang, Yi Yang, Allen H Huang, Andy Tam, Justin Z Tang
  • for: 这篇论文的目的是为了介绍一个新的金融实体 Sentiment 分类数据集(FinEntity),该数据集包含金融新闻中的实体 span 和其 sentiment(积极、中性、消极)标注。
  • methods: 论文中使用了数据集建构过程的文章,以及对多种预训练模型(如 BERT、FinBERT 等)和 ChatGPT 的实体 Sentiment 分类 benchmarking。
  • results: 在一个案例研究中,通过使用 FinEntity 监测 криптовалю市场,实证表明 FinEntity 可以帮助准确评估金融实体的 Sentiment。数据和代码可以在 GitHub 上下载:https://github.com/yixuantt/FinEntity
    Abstract In the financial domain, conducting entity-level sentiment analysis is crucial for accurately assessing the sentiment directed toward a specific financial entity. To our knowledge, no publicly available dataset currently exists for this purpose. In this work, we introduce an entity-level sentiment classification dataset, called \textbf{FinEntity}, that annotates financial entity spans and their sentiment (positive, neutral, and negative) in financial news. We document the dataset construction process in the paper. Additionally, we benchmark several pre-trained models (BERT, FinBERT, etc.) and ChatGPT on entity-level sentiment classification. In a case study, we demonstrate the practical utility of using FinEntity in monitoring cryptocurrency markets. The data and code of FinEntity is available at \url{https://github.com/yixuantt/FinEntity}
    摘要 在金融领域,实施实体级别的情感分析是准确评估特定金融实体所向的情感方向的关键。据我们知道,目前没有公开可用的数据集用于此目的。在这种工作中,我们介绍了一个名为\textbf{FinEntity}的实体级别情感分类数据集,该数据集标注了金融实体范围内的情感(积极、中性、消极)在金融新闻中。我们在论文中详细介绍了数据集的建构过程。此外,我们还对多种预训练模型(如BERT、FinBERT等)和ChatGPT进行了实体级别情感分类的benchmark测试。在一个实验案例中,我们示出了使用FinEntity监测 криптовалю市场的实际实用性。FinEntity数据和代码可以在\url{https://github.com/yixuantt/FinEntity}获取。

Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing

  • paper_url: http://arxiv.org/abs/2310.12404
  • repo_url: None
  • paper_authors: Yixiao Zhang, Akira Maezawa, Gus Xia, Kazuhiko Yamamoto, Simon Dixon
  • for: 本研究旨在提供一个可交互的音乐创作系统,帮助用户逐步发展和精确地调整音乐作品。
  • methods: 本研究使用大量语言模型来理解用户意图,并选择适当的人工智能模型来进行任务执行。每个后端模型特化于特定任务,其输出被聚合以满足用户的需求。
  • results: 透过对受测者进行 semi-结构化访谈和调查,研究发现本系统不仅能帮助用户创作音乐,还有潜在应用于更广泛的领域。
    Abstract Creating music is iterative, requiring varied methods at each stage. However, existing AI music systems fall short in orchestrating multiple subsystems for diverse needs. To address this gap, we introduce Loop Copilot, a novel system that enables users to generate and iteratively refine music through an interactive, multi-round dialogue interface. The system uses a large language model to interpret user intentions and select appropriate AI models for task execution. Each backend model is specialized for a specific task, and their outputs are aggregated to meet the user's requirements. To ensure musical coherence, essential attributes are maintained in a centralized table. We evaluate the effectiveness of the proposed system through semi-structured interviews and questionnaires, highlighting its utility not only in facilitating music creation but also its potential for broader applications.
    摘要 创作音乐是一个迭代过程,需要不同的方法在每个阶段。然而,现有的AI音乐系统尚未能够有效地融合多种需求。为了解决这一问题,我们介绍了Loop Copilot,一种新的系统,允许用户通过交互式、多轮对话界面来生成和 repeatedly refine music。该系统使用大型自然语言模型来理解用户的意图,并选择适合任务执行的特定AI模型。每个后端模型特化于特定任务,其输出被聚合以满足用户的需求。为保证音乐凝聚,中央表格中维护了关键属性。我们通过 semi-structured 采访和问卷调查评估了提议的系统的有效性,并指出了其不仅能够促进音乐创作,还有潜在的应用于更广泛的领域。