methods: 这个论文使用了 five 种高资源语言和两个 NLP 任务来研究 chatGPT 的表现和自信度准确性。
results: 结果表明所选高资源语言都表现相似,chatGPT 的自信度准确性不良, часто过于自信而从未给出低自信值。Abstract
ChatGPT took the world by storm for its impressive abilities. Due to its release without documentation, scientists immediately attempted to identify its limits, mainly through its performance in natural language processing (NLP) tasks. This paper aims to join the growing literature regarding ChatGPT's abilities by focusing on its performance in high-resource languages and on its capacity to predict its answers' accuracy by giving a confidence level. The analysis of high-resource languages is of interest as studies have shown that low-resource languages perform worse than English in NLP tasks, but no study so far has analysed whether high-resource languages perform as well as English. The analysis of ChatGPT's confidence calibration has not been carried out before either and is critical to learn about ChatGPT's trustworthiness. In order to study these two aspects, five high-resource languages and two NLP tasks were chosen. ChatGPT was asked to perform both tasks in the five languages and to give a numerical confidence value for each answer. The results show that all the selected high-resource languages perform similarly and that ChatGPT does not have a good confidence calibration, often being overconfident and never giving low confidence values.
摘要
chatGPT在全球引起了一阵风波,主要是因为它的各种能力。由于没有相关文档,科学家们很快就开始了对 chatGPT 的研究,主要通过语言处理任务来测试它的能力。这篇论文想要加入关于 chatGPT 的能力的增长 литератур,主要是通过对高资源语言的表现和 chatGPT 给出答案准确性的信息来进行分析。研究高资源语言的 interessant 是, studies 表明,对英语的 NLP 任务表现较差,但没有任何研究表明,高资源语言的表现和英语相同。此外,还没有任何研究对 chatGPT 的信任性进行了分析,这也是这篇论文的一个重要目标。为了实现这两个目标,我们选择了五种高资源语言和两个 NLP 任务,并让 chatGPT 在这些语言中完成这两个任务,并给出每个答案的数字信任值。结果显示,所选高资源语言都表现相似,而 chatGPT 的信任把关不好,经常过于自信和从来不给低信任值。
Knowledge Graphs are not Created Equal: Exploring the Properties and Structure of Real KGs
results: 研究发现了许多KG的结构和属性特征,并提出了在KG基于模型开发和评估方面的一些建议。Abstract
Despite the recent popularity of knowledge graph (KG) related tasks and benchmarks such as KG embeddings, link prediction, entity alignment and evaluation of the reasoning abilities of pretrained language models as KGs, the structure and properties of real KGs are not well studied. In this paper, we perform a large scale comparative study of 29 real KG datasets from diverse domains such as the natural sciences, medicine, and NLP to analyze their properties and structural patterns. Based on our findings, we make several recommendations regarding KG-based model development and evaluation. We believe that the rich structural information contained in KGs can benefit the development of better KG models across fields and we hope this study will contribute to breaking the existing data silos between different areas of research (e.g., ML, NLP, AI for sciences).
摘要
尽管知识图(KG)相关任务和benchmark在最近几年得到了广泛关注,如KG嵌入、链接预测、实体对Alignment和语言模型的逻辑能力评估等,然而实际的知识图结构和特性尚未得到充分研究。在这篇论文中,我们对29个不同领域的真实知识图进行了大规模比较研究,以分析它们的性质和结构性特征。根据我们的发现,我们提出了一些关于基于知识图的模型开发和评估的建议。我们认为知识图中的丰富结构信息可以帮助开发更好的知识图模型,并且希望这篇研究能够突破现有的数据困境(如机器学习、自然语言处理、人工智能等领域之间的数据困境)。
Analyzing Modular Approaches for Visual Question Decomposition
results: 研究发现,ViperGPT的加成表现主要来自于选择任务特定模块,而不是BLIP-2模型。此外,ViperGPT可以保持大部分表现,只有 modifying 模块选择策略。此外,模块化方法在一些benchmark上比提问方法表现更好,因为它可以使用自然语言来表示子任务。Abstract
Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.
摘要
(Simplified Chinese translation)模块化神经网络无需额外训练最近已经能够超越端到端神经网络在复杂的视觉语言任务上。最新的这些方法同时引入了基于LLM的代码生成以建立程序,以及一些任务特定、任务oriented的模块来执行它们。在这篇论文中,我们关注ViperGPT,并问它的额外性能来源于它的选择的任务特定模块以及BLIP-2模型是否具有主导作用。为了回答这个问题,我们进行了一项控制性研究, comparing end-to-end、模块化和提问基本方法在多个VQAbenchmark上。我们发现,ViperGPT的报告性能增加与BLIP-2模型的选择有直接关系,并且当我们使用一种更任务agnostic的模块选择策略时,这些增加消失。此外,我们发现ViperGPT在做出显著变化到其模块选择时仍然保持较高的性能,例如删除或保留仅BLIP-2模型。最后,我们与提问基本方法进行比较,并发现在某些benchmark上,模块化方法在表示子任务的自然语言方面具有显著的优势。
Autoregressive Language Models For Estimating the Entropy of Epic EHR Audit Logs
paper_authors: Benjamin C. Warner, Thomas Kannampallil, Seunghwan Kim
for: 这项研究旨在Characterizing clinician workflow on the electronic health record (EHR) through EHR audit logs.
methods: 该研究使用 transformer-based tabular language model (tabular LM) 来度量工作流程中动作序列的 entropy 或混乱程度.
results: 研究发现 tabular LM 可以准确度量工作流程中动作序列的复杂性,并且可以公开发布评估模型 дляFuture research.Abstract
EHR audit logs are a highly granular stream of events that capture clinician activities, and is a significant area of interest for research in characterizing clinician workflow on the electronic health record (EHR). Existing techniques to measure the complexity of workflow through EHR audit logs (audit logs) involve time- or frequency-based cross-sectional aggregations that are unable to capture the full complexity of a EHR session. We briefly evaluate the usage of transformer-based tabular language model (tabular LM) in measuring the entropy or disorderedness of action sequences within workflow and release the evaluated models publicly.
摘要
Distilling Large Language Models using Skill-Occupation Graph Context for HR-Related Tasks
for: This paper aims to bridge the gap in HR applications by introducing a benchmark for various HR tasks, including matching and explaining resumes to job descriptions, extracting skills and experiences from resumes, and editing resumes.
methods: The benchmark is created by distilling domain-specific knowledge from a large language model (LLM) and relying on a curated skill-occupation graph to ensure diversity and provide context for LLMs generation.
results: The student models achieve near/better performance than the teacher model (GPT-4) in various HR tasks, and the benchmark is effective in out-of-distribution data for skill extraction and resume-job description matching in zero-shot and weak supervision manner.Here’s the simplified Chinese text:
results: 学生模型在不同的人力任务中具有near/更好的性能,而且benchmark在对数据集进行零shot和弱监督下的应用中也表现出了效果。Abstract
Numerous HR applications are centered around resumes and job descriptions. While they can benefit from advancements in NLP, particularly large language models, their real-world adoption faces challenges due to absence of comprehensive benchmarks for various HR tasks, and lack of smaller models with competitive capabilities. In this paper, we aim to bridge this gap by introducing the Resume-Job Description Benchmark (RJDB). We meticulously craft this benchmark to cater to a wide array of HR tasks, including matching and explaining resumes to job descriptions, extracting skills and experiences from resumes, and editing resumes. To create this benchmark, we propose to distill domain-specific knowledge from a large language model (LLM). We rely on a curated skill-occupation graph to ensure diversity and provide context for LLMs generation. Our benchmark includes over 50 thousand triples of job descriptions, matched resumes and unmatched resumes. Using RJDB, we train multiple smaller student models. Our experiments reveal that the student models achieve near/better performance than the teacher model (GPT-4), affirming the effectiveness of the benchmark. Additionally, we explore the utility of RJDB on out-of-distribution data for skill extraction and resume-job description matching, in zero-shot and weak supervision manner. We release our datasets and code to foster further research and industry applications.
摘要
许多人力资源(HR)应用程序都集中在简历和职业描述上。虽然这些应用程序可以从大语言模型(LLM)中受益,但它们在实际应用中遇到了各种挑战,主要是缺乏各种HR任务的全面指标,以及小型模型的竞争力不足。在这篇论文中,我们想要填补这个差距,我们提出了简历职业描述指标(RJDB)。我们尽可能地为各种HR任务,包括简历与职业描述匹配和解释、从简历中提取技能和经验、编辑简历等,创建了这个指标。我们利用一个精心挑选的技能岗位图来保证多样性和提供 контекст для LLMS的生成。我们的指标包括5万多个职业描述、匹配简历和未匹配简历的 triple。我们使用RJDB训练多个小型学生模型,我们的实验表明,这些学生模型可以与教师模型(GPT-4)的性能相似或更好,这证明了指标的有效性。此外,我们还研究了RJDB在零shot和弱监督下对技能提取和简历职业描述匹配的 utility。我们发布了我们的数据和代码,以便进一步的研究和实际应用。
Transfer Learning for Structured Pruning under Limited Task Data
results: 这篇论文的实验结果表明,使用这个框架可以实现剪枝后的模型具有更好的普遍化性,比对照强大的基eline。Abstract
Large, pre-trained models are problematic to use in resource constrained applications. Fortunately, task-aware structured pruning methods offer a solution. These approaches reduce model size by dropping structural units like layers and attention heads in a manner that takes into account the end-task. However, these pruning algorithms require more task-specific data than is typically available. We propose a framework which combines structured pruning with transfer learning to reduce the need for task-specific data. Our empirical results answer questions such as: How should the two tasks be coupled? What parameters should be transferred? And, when during training should transfer learning be introduced? Leveraging these insights, we demonstrate that our framework results in pruned models with improved generalization over strong baselines.
摘要
大型预训练模型在资源受限的应用中存在问题。幸运的是,任务意识 Structured pruning 方法提供了解决方案。这些方法通过去掉结构单元如层和注意头来减小模型大小,并且根据结束任务进行考虑。然而,这些剪枝算法需要更多的任务特定数据 than usual。我们提议一个框架,该结合 Structured pruning 和传输学习来减少需要任务特定数据的需求。我们的实验结果回答了以下问题:何时在训练过程中引入传输学习?何时将两个任务耦合?何时传输哪些参数?通过这些意见,我们示出了我们的框架可以在强大基eline上提供更好的泛化性。
results: DEMUX在84%的测试 caso中超越了强基eline,在零shot设定中(包括多语言目标池)的三种模型和四个任务上。尤其在低预算设定(5-100示例)下,我们观察到了8-11个F1点的提升 дляtoken级任务,以及2-5个F1点的提升 для复杂任务。我们的代码可以在以下链接中下载:https://github.com/simran-khanuja/demux。Abstract
We consider the task of optimally fine-tuning pre-trained multilingual models, given small amounts of unlabelled target data and an annotation budget. In this paper, we introduce DEMUX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations. Our active learning strategies rely upon distance and uncertainty measures to select task-specific neighbors that are most informative to label, given a model. DeMuX outperforms strong baselines in 84% of the test cases, in the zero-shot setting of disjoint source and target language sets (including multilingual target pools), across three models and four tasks. Notably, in low-budget settings (5-100 examples), we observe gains of up to 8-11 F1 points for token-level tasks, and 2-5 F1 for complex tasks. Our code is released here: https://github.com/simran-khanuja/demux.
摘要
我们考虑在小量目标数据和注释预算下优化预训练多语言模型的任务。在这篇论文中,我们介绍了DEMUX框架,它可以从大量的不标记多语言数据中选择特定的数据点进行标注,这些数据点可能与目标集之间存在未知的重叠度。与大多数前一代工作不同,我们的终端框架是语言无关的,考虑了模型表示,并支持多语言目标配置。我们的活动学策略基于距离和不确定度度量来选择任务特定的邻居,以便在模型上进行标注。DEMuX在3个模型和4个任务中的0号设定下(包括多语言目标池)上比强基eline表现出色,在5-100个示例的低预算设定下,我们观察到了8-11个F1分的提升 дляToken级任务,以及2-5个F1分的提升 для复杂任务。我们的代码可以在以下链接中找到:https://github.com/simran-khanuja/demux。
Heaps’ Law in GPT-Neo Large Language Model Emulated Corpora
results: 研究发现,生成的文献摘要遵循Heaps法律,而随着GPT-Neo模型的参数大小增加,生成的词汇更加遵循Heaps法律,与人类编写的文本类似。Abstract
Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.
摘要
Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach
paper_authors: Maxime Delmas, Magdalena Wysocka, André Freitas for: This paper aims to address the issue of limited labeled data in relation extraction tasks, specifically in the context of natural products literature.methods: The authors developed a new sampler inspired by diversity metrics in ecology, called the Greedy Maximum Entropy sampler (GME-sampler), to curate a evaluation dataset for training relation extraction models. They also explored few-shot learning with open large language models (LLaMA 7B-65B) and synthetic data generation using Vicuna-13B.results: The authors achieved substantial improvements in relation extraction performance when fine-tuning models on synthetic abstracts rather than the noisy original data. Their best-performing model, BioGPT-Large, achieved an f1-score of 59.0. They also provide the generated synthetic data and the evaluation dataset for future use.Abstract
The sparsity of labelled data is an obstacle to the development of Relation Extraction models and the completion of databases in various biomedical areas. While being of high interest in drug-discovery, the natural-products literature, reporting the identification of potential bioactive compounds from organisms, is a concrete example of such an overlooked topic. To mark the start of this new task, we created the first curated evaluation dataset and extracted literature items from the LOTUS database to build training sets. To this end, we developed a new sampler inspired by diversity metrics in ecology, named Greedy Maximum Entropy sampler, or GME-sampler (https://github.com/idiap/gme-sampler). The strategic optimization of both balance and diversity of the selected items in the evaluation set is important given the resource-intensive nature of manual curation. After quantifying the noise in the training set, in the form of discrepancies between the input abstracts text and the expected output labels, we explored different strategies accordingly. Framing the task as an end-to-end Relation Extraction, we evaluated the performance of standard fine-tuning as a generative task and few-shot learning with open Large Language Models (LLaMA 7B-65B). In addition to their evaluation in few-shot settings, we explore the potential of open Large Language Models (Vicuna-13B) as synthetic data generator and propose a new workflow for this purpose. All evaluated models exhibited substantial improvements when fine-tuned on synthetic abstracts rather than the original noisy data. We provide our best performing (f1-score=59.0) BioGPT-Large model for end-to-end RE of natural-products relationships along with all the generated synthetic data and the evaluation dataset. See more details at https://github.com/idiap/abroad-re.
摘要
“资料稀缺是生物医学领域中relation抽取模型的发展所面临的障碍。然而,自然产物文献中的潜在生物活性物质发现是一个受到过见的领域。为了启动这个新任务,我们创建了首个维护评估集和从LOTUS数据库中提取出来的文献项目,以建立训练集。为此,我们开发了一个灵活的最大熵采样器(GME-sampler),并在评估集中实现了权衡和多样性的选择。由于训练集的资源投入巨大,我们需要运用数据的混沌来评估模型的性能。我们将这个任务定义为一个端到端的relation抽取任务,并评估了标准的精致化和几何学模型的几何学学习。我们发现所有评估的模型在精致化的设定下表现出色,并且在使用生成器来生成实验数据时,具有更好的性能。我们提供了我们的最高表现(f1-score=59.0)的BioGPT-Large模型,以及所有生成的实验数据和评估集。详细信息请参考https://github.com/idiap/abroad-re。”
methods: 本研究使用了三种已 publik 的词典和 variants of ChatGPT 生成的词义定义进行比较。
results: 研究发现(i)不同的传统词典中的词义定义具有更高的表面形式相似性,而模型生成的定义则具有高度准确性,与传统词典相当;(ii)ChatGPT 定义具有高度准确性,可以在低频词术中保持准确性,而 GloVE 和 FastText 词 embedding 则不太准确。Abstract
Dictionary definitions are historically the arbitrator of what words mean, but this primacy has come under threat by recent progress in NLP, including word embeddings and generative models like ChatGPT. We present an exploratory study of the degree of alignment between word definitions from classical dictionaries and these newer computational artifacts. Specifically, we compare definitions from three published dictionaries to those generated from variants of ChatGPT. We show that (i) definitions from different traditional dictionaries exhibit more surface form similarity than do model-generated definitions, (ii) that the ChatGPT definitions are highly accurate, comparable to traditional dictionaries, and (iii) ChatGPT-based embedding definitions retain their accuracy even on low frequency words, much better than GloVE and FastText word embeddings.
摘要
传统的词典定义曾经是词语意义的决定性标准,但这种主导地位在计算机自然语言处理(NLP)的进步下来到了威胁。我们进行了一项探索性的研究,检查了古典词典定义和计算机生成的词语定义之间的吻合度。我们比较了三本出版的词典定义和 variants of ChatGPT 生成的定义,发现:1. 不同的传统词典定义在表面形式上更加相似,而模型生成的定义相对来说更加不同。2. ChatGPT 生成的定义准确率高,与传统词典定义相当,甚至在低频词语上也具有较高的准确率。3. ChatGPT 基于的词语定义 embedding 在低频词语上保持了准确性,而 GloVE 和 FastText 词语 embedding 则不如 ChatGPT。
Schema Graph-Guided Prompt for Multi-Domain Dialogue State Tracking
results: 我们的实验表明,我们的图基于方法在多域对话状态跟踪中表现更好,使用相同或少于其他多域 DST 方法的训练参数。我们还进行了广泛的对schema graph体系、参数使用和模块剥离的研究,以证明我们的模型在多域对话状态跟踪中的效果。Abstract
Tracking dialogue states is an essential topic in task-oriented dialogue systems, which involve filling in the necessary information in pre-defined slots corresponding to a schema. While general pre-trained language models have been shown effective in slot-filling, their performance is limited when applied to specific domains. We propose a graph-based framework that learns domain-specific prompts by incorporating the dialogue schema. Specifically, we embed domain-specific schema encoded by a graph neural network into the pre-trained language model, which allows for relations in the schema to guide the model for better adaptation to the specific domain. Our experiments demonstrate that the proposed graph-based method outperforms other multi-domain DST approaches while using similar or fewer trainable parameters. We also conduct a comprehensive study of schema graph architectures, parameter usage, and module ablation that demonstrate the effectiveness of our model on multi-domain dialogue state tracking.
摘要
“Dialogue state tracking(DST)在任务对话系统中是一个重要的主题,它需要填充预定的构造中的必要信息。而通用的预训语言模型在特定领域中表现不佳,因此我们提出了一个基于图形框架的方法,通过将领域特定的schema编码为图形神经网络,将领域特定的关系引导模型更好地适应特定领域。我们的实验结果显示,我们的图形基于方法在多域DST方法中表现出色,并且使用相似或少数可训练的参数。我们还进行了多种schema图架架构、参数使用和模块扩展的完整研究,实验结果证明了我们的模型在多域对话state Tracking中的效果。”Note that Simplified Chinese is the official writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other parts of the world.
Argumentation Element Annotation Modeling using XLNet
paper_authors: Christopher Ormerod, Amy Burkhardt, Mackenzie Young, Sue Lottridge
for: This paper demonstrates the effectiveness of XLNet for annotating argumentative elements in persuasive essays, providing automated feedback on essay organization.
methods: The paper uses XLNet, a transformer-based language model, with a recurrent mechanism to model long-term dependencies in lengthy texts. The model is fine-tuned on three datasets annotated with different schemes.
results: The XLNet models achieved strong performance across all datasets, even surpassing human agreement levels in some cases. The paper highlights the suitability of XLNet for providing automated feedback on essay organization, and provides insights into the relationships between the annotation tags.Abstract
This study demonstrates the effectiveness of XLNet, a transformer-based language model, for annotating argumentative elements in persuasive essays. XLNet's architecture incorporates a recurrent mechanism that allows it to model long-term dependencies in lengthy texts. Fine-tuned XLNet models were applied to three datasets annotated with different schemes - a proprietary dataset using the Annotations for Revisions and Reflections on Writing (ARROW) scheme, the PERSUADE corpus, and the Argument Annotated Essays (AAE) dataset. The XLNet models achieved strong performance across all datasets, even surpassing human agreement levels in some cases. This shows XLNet capably handles diverse annotation schemes and lengthy essays. Comparisons between the model outputs on different datasets also revealed insights into the relationships between the annotation tags. Overall, XLNet's strong performance on modeling argumentative structures across diverse datasets highlights its suitability for providing automated feedback on essay organization.
摘要
Translation notes:* "ARROW" 改为 "箭" (jian) (proprietary dataset using the Annotations for Revisions and Reflections on Writing scheme)* "PERSUADE" 改为 "说服" (shuocheng) (PERSUADE corpus)* "AAE" 改为 "Argument Annotated Essays" 改为 "论点标注作文" (lun dian biao zhun) (Argument Annotated Essays dataset)* "essays" 改为 "作文" (zuxing) (to match the Simplified Chinese word order)
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
results: 这篇论文提出了一种基于实践的论点,即LLM攻击的活动是一种社区协同的行为,其中参与者的动机和目标、使用的策略和技术以及社区的作用都具有重要作用。Abstract
Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail. We relate and connect this activity between its practitioners' motivations and goals; the strategies and techniques they deploy; and the crucial role the community plays. As a result, this paper presents a grounded theory of how and why people attack large language models: LLM red teaming in the wild.
摘要
大型语言模型(LLM)的故意生成异常输出的攻击是一项新的人类活动。这篇论文通过正式的形式化质量方法,介绍了这种攻击的如何和为何。我们对来自多个背景的参与者进行了多达数十人的采访,这些参与者都是这项尝试引起LLM失败的工作的贡献者。我们将这些参与者的动机和目标与战略和技巧相连接,并证明了这种活动的核心是LLM红团队在野外。因此,这篇论文提供了一个固定的LLM攻击理论:LLM红团队在野外。
A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There Outlier Words?
results: 研究发现,各个领域的文本情感分析结果存在差异,且不存在特定的词语库项导致差异的现象。Abstract
Lexicon-based approaches to sentiment analysis of text are based on each word or lexical entry having a pre-defined weight indicating its sentiment polarity. These are usually manually assigned but the accuracy of these when compared against machine leaning based approaches to computing sentiment, are not known. It may be that there are lexical entries whose sentiment values cause a lexicon-based approach to give results which are very different to a machine learning approach. In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains using the Hedonometer, a lexicon-based technique and Azure, a contemporary machine-learning based approach which is part of the Azure Cognitive Services family of APIs which is easy to use. We model differences in sentiment scores between approaches for documents in each domain using a regression and analyse the independent variables (Hedonometer lexical entries) as indicators of each word's importance and contribution to the score differences. Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.
摘要
Lexicon-based方法 для情感分析文本基于每个词或语言Entry有前定的欢度指数,这些通常是手动指定的,但与机器学习基于方法的计算情感结果相比,它们的准确性不明确。可能存在 lexical Entry whose sentiment values cause a lexicon-based approach to give results that are very different from a machine learning approach。在这篇论文中,我们计算了超过 150,000 篇英语文本,从 4 个领域中获取,使用 Hedonometer,一种 lexicon-based 技术和 Azure,一种现代机器学习基于 API 的方法,这是 Azure 认知服务家族的一部分,易于使用。我们模型了每个领域的文档的情感分数之间的差异使用回归分析,并将 Hedonometer 词语入力作为每个词的重要性和对情感分数做出贡献的指标进行分析。我们的发现是,在各个领域中,一个词的重要性取决于领域,并没有一个系统性地导致情感分数差异的词语。
results: 该论文显示了这种基于ホップ代数的模型可以帮助解决一些当前大语言模型的争议,并且可以提供一种新的方法来描述语音表达中的意义提取过程。Abstract
We extend our formulation of Merge and Minimalism in terms of Hopf algebras to an algebraic model of a syntactic-semantic interface. We show that methods adopted in the formulation of renormalization (extraction of meaningful physical values) in theoretical physics are relevant to describe the extraction of meaning from syntactic expressions. We show how this formulation relates to computational models of semantics and we answer some recent controversies about implications for generative linguistics of the current functioning of large language models.
摘要
我们扩展了我们的 merge 和 minimalism 在霍夫代数中的形式ulation,用于建立语音表示与 semantics 的 интерфейス。我们显示了在理论物理中的 renormalization (提取有意义的物理值) 方法与语音表达中提取意义的方法有相似之处。我们还示出了这种形式ulation 与计算 semantics 模型之间的关系,并回答了一些最近关于生成语言学的争议。
Is it indeed bigger better? The comprehensive study of claim detection LMs applied for disinformation tackling
paper_authors: Martin Hyben, Sebastian Kula, Ivan Srba, Robert Moro, Jakub Simko
For: Compares the performance of fine-tuned models and extremely large language models on the task of check-worthy claim detection.* Methods: Uses a multilingual and multi-topical dataset, and benchmark analysis to determine the most general multilingual and multi-topical claim detector.* Results: Despite technological progress in natural language processing, fine-tuned models outperform zero-shot approaches in cross-domain settings.Here’s the full text in Simplified Chinese:* 为: Compares 精制模型和非常大的自然语言处理模型在检查可信laim检测任务上的表现。* 方法: 使用多语言多频道的数据集,并进行了benchmark分析,以确定最通用的多语言多频道laim检测器。* 结果: despite技术进步,精制模型在跨频道设置下仍然表现更好于零批处理approaches。Abstract
This study compares the performance of (1) fine-tuned models and (2) extremely large language models on the task of check-worthy claim detection. For the purpose of the comparison we composed a multilingual and multi-topical dataset comprising texts of various sources and styles. Building on this, we performed a benchmark analysis to determine the most general multilingual and multi-topical claim detector. We chose three state-of-the-art models in the check-worthy claim detection task and fine-tuned them. Furthermore, we selected three state-of-the-art extremely large language models without any fine-tuning. We made modifications to the models to adapt them for multilingual settings and through extensive experimentation and evaluation. We assessed the performance of all the models in terms of accuracy, recall, and F1-score in in-domain and cross-domain scenarios. Our results demonstrate that despite the technological progress in the area of natural language processing, the models fine-tuned for the task of check-worthy claim detection still outperform the zero-shot approaches in a cross-domain settings.
摘要
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
results: 研究发现,现有的会员推测技术对于实际的精度语言模型(LLM)不能有效地泄露个人隐私信息。这是因为现有的会员推测方法假设训练记录会具有高的概率被采样,但是这种假设受到训练集的多重正则化和 LLM 的总体化的影响,导致会员推测的效果减弱。Abstract
Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.
摘要
语言模型(LM)的会员推测攻击(MIA)目的是确定目标数据记录是否在模型训练中使用过。先前的尝试已经衡量了语言模型的隐私风险通过 MIA,但没有达成一致是否存在现实中大语言模型(LLM)上显著的隐私泄露问题。现有的 MIA 设计为语言模型可以分为两类:无参和参参攻击。它们都基于目标模型训练记录的假设,即训练记录会具有更高的抽样概率。然而,这个假设取决于目标模型的过度适应,这将通过多种正则化方法和大语言模型的通用性来减弱。参参攻击似乎在 LLM 中获得了有效的成果,它通过比较目标模型和参考模型之间的概率差来测量更可靠的会员信号。然而,参参攻击的性能受到参考 dataset 的影响,这个 dataset 通常在实际场景中不可得。总之,现有的 MIA 无法有效地暴露实际精细调整后的 LLM 中的隐私泄露。我们提出一种基于自适应概率变化(SPV)的会员推测攻击方法。具体来说,在语言模型的训练过程中,记忆是不可避免的,而记忆在训练之前就发生了过度适应。我们引入更可靠的会员信号,即概率变化,它基于记忆而不是过度适应。此外,我们引入自我提示方法,它通过让目标 LLM 自己提供参考模型的 dataset 来构建一个类似于公共 API 上的 dataset。这样,敌对方可以收集一个类似于公共 API 上的 dataset,从而实现更好的会员推测。
Multi-Label Topic Model for Financial Textual Data
results: 作者发现,在不同主题之间的合并影响了股市反应。例如,公告新的大规模项目或破产申请会产生强烈的正面或负面市场反应,而某些其他主题则不显示出显著的价格影响。此外,相比之前的研究,这种多个标签结构允许分析不同主题之间的相互作用。Abstract
This paper presents a multi-label topic model for financial texts like ad-hoc announcements, 8-K filings, finance related news or annual reports. I train the model on a new financial multi-label database consisting of 3,044 German ad-hoc announcements that are labeled manually using 20 predefined, economically motivated topics. The best model achieves a macro F1 score of more than 85%. Translating the data results in an English version of the model with similar performance. As application of the model, I investigate differences in stock market reactions across topics. I find evidence for strong positive or negative market reactions for some topics, like announcements of new Large Scale Projects or Bankruptcy Filings, while I do not observe significant price effects for some other topics. Furthermore, in contrast to previous studies, the multi-label structure of the model allows to analyze the effects of co-occurring topics on stock market reactions. For many cases, the reaction to a specific topic depends heavily on the co-occurrence with other topics. For example, if allocated capital from a Seasoned Equity Offering (SEO) is used for restructuring a company in the course of a Bankruptcy Proceeding, the market reacts positively on average. However, if that capital is used for covering unexpected, additional costs from the development of new drugs, the SEO implies negative reactions on average.
摘要
(Note: Please note that the translation is in Simplified Chinese, and the formatting of the text may be different from the original English version.)
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences
results: 对于实际任务 such as 信息提取、问答和对话生成,ChiMed-GPT的性能都显著高于通用领域的语言模型。此外,通过对模型进行某些词汇和语言模型的改进,提高了模型的可读性和可信度。Abstract
Recently, the increasing demand for superior medical services has highlighted the discrepancies in the medical infrastructure. With big data, especially texts, forming the foundation of medical services, there is an exigent need for effective natural language processing (NLP) solutions tailored to the healthcare domain. Conventional approaches leveraging pre-trained models present promising results in this domain and current large language models (LLMs) offer advanced foundation for medical text processing. However, most medical LLMs are trained only with supervised fine-tuning (SFT), even though it efficiently empowers LLMs to understand and respond to medical instructions but is ineffective in learning domain knowledge and aligning with human preference. Another engineering barrier that prevents current medical LLM from better text processing ability is their restricted context length (e.g., 2,048 tokens), making it hard for the LLMs to process long context, which is frequently required in the medical domain. In this work, we propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, SFT, and RLHF. Evaluations on real-world tasks including information extraction, question answering, and dialogue generation demonstrate ChiMed-GPT's superior performance over general domain LLMs. Furthermore, we analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients, so as to contribute to further responsible development of LLMs in the medical domain. The code and model are released at https://github.com/synlp/ChiMed-GPT.
摘要
最近,医疗服务的需求增长,抛出了医疗基础设施的差异。医疗领域的大数据 Text 作为医疗服务的基础,需要有效的自然语言处理(NLP)解决方案。现有的方法利用预训练模型显示了良好的结果,而当前的大语言模型(LLM)提供了医疗文本处理的高级基础。然而,大多数医疗 LL M 仅通过监督微调(SFT)进行训练,尽管它可以有效地使 LLM 理解和回答医疗指令,但是无法学习域知识和人类偏好。另一个工程障碍是现有的医疗 LL M 的上下文长度 restriction(例如 2,048 个 Token),使得 LLM Difficult to process long context,这经常需要在医疗领域进行。在这种情况下,我们提出了 ChiMed-GPT,一个专门为中文医疗领域设计的新的标准 LL M。 ChiMed-GPT 的上下文长度增加到 4,096 个 Token,并通过预训练、SFT 和 RLHF 进行全面的训练 regime。在实际任务中,包括信息提取、问题回答和对话生成,ChiMed-GPT 的性能超过了通用领域 LL M。此外,我们还分析了 ChiMed-GPT 的可能的偏见,通过让它完成恶势卷反映的任务,以至于降低 LLM 在医疗领域的可能性。代码和模型可以在 https://github.com/synlp/ChiMed-GPT 上下载。
Large Language Models are Zero Shot Hypothesis Proposers
for: investigate whether LLMs can propose scientific hypotheses
methods: construct a dataset of background knowledge and hypothesis pairs from biomedical literature, evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings
results: LLMs surprisingly generate untrained yet validated hypotheses from testing literature, increasing uncertainty facilitates candidate generation, potentially enhancing zero-shot hypothesis generation capabilitiesAbstract
Significant scientific discoveries have driven the progress of human civilisation. The explosion of scientific literature and data has created information barriers across disciplines that have slowed the pace of scientific discovery. Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down these information barriers and foster a new wave of scientific discovery. However, the potential of LLMs for scientific discovery has not been formally explored. In this paper, we start from investigating whether LLMs can propose scientific hypotheses. To this end, we construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature. The dataset is divided into training, seen, and unseen test sets based on the publication date to control visibility. We subsequently evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings, including both closed and open-source LLMs. Additionally, we introduce an LLM-based multi-agent cooperative framework with different role designs and external tools to enhance the capabilities related to generating hypotheses. We also design four metrics through a comprehensive review to evaluate the generated hypotheses for both ChatGPT-based and human evaluations. Through experiments and analyses, we arrive at the following findings: 1) LLMs surprisingly generate untrained yet validated hypotheses from testing literature. 2) Increasing uncertainty facilitates candidate generation, potentially enhancing zero-shot hypothesis generation capabilities. These findings strongly support the potential of LLMs as catalysts for new scientific discoveries and guide further exploration.
摘要
科学发现的进步对人类文明的发展具有重要作用。 however, scientific literature and data explosion 已经创造了知识障碍, slowing down scientific discovery。 Large Language Models (LLMs) possess a wealth of global and interdisciplinary knowledge that can break down these information barriers and foster a new wave of scientific discovery. 然而, LLMS的科学发现潜力还没有得到正式探索。在这篇论文中,我们开始了 LLMS 可以提出科学假设的研究。为此,我们构建了一个基于生物医学文献的假设集和背景知识集,并将其分为训练、seen和未见测试集,以控制可见性。接着,我们评估了不同级别的 instructed 模型在零shot、几shot和精度调整设置下的假设生成能力,包括开源和关闭源 LLMS。此外,我们还提出了基于 LLMS 的多代合作框架,并设计了不同角色的设计和外部工具来提高假设生成能力。最后,我们设计了四种度量来评估生成的假设,包括 ChatGPT 基于和人类评估。通过实验和分析,我们得到以下发现:1. LLMS 奇异地从测试文献中提出未经训练的有效假设。2. 增加不确定性可能提高零shot假设生成能力。这些发现加强了 LLMS 作为新科学发现的潜在推动者的潜力,并且引导进一步探索。
Chain of Thought with Explicit Evidence Reasoning for Few-shot Relation Extraction
results: 这篇论文提出了一个名为CoT-ER的新方法,它使用大型自然语言模型来生成证据,然后将这些证据Explicitly incorporated into chain-of-thought来进行关系抽取。实验结果显示,CoT-ER方法在FewRel1.0和FewRel2.0数据集上 achieves competitive performance 与完全监督(100% 训练数据)现有方法相比。Abstract
Few-shot relation extraction involves identifying the type of relationship between two specific entities within a text, using a limited number of annotated samples. A variety of solutions to this problem have emerged by applying meta-learning and neural graph techniques which typically necessitate a training process for adaptation. Recently, the strategy of in-context learning has been demonstrating notable results without the need of training. Few studies have already utilized in-context learning for zero-shot information extraction. Unfortunately, the evidence for inference is either not considered or implicitly modeled during the construction of chain-of-thought prompts. In this paper, we propose a novel approach for few-shot relation extraction using large language models, named CoT-ER, chain-of-thought with explicit evidence reasoning. In particular, CoT-ER first induces large language models to generate evidences using task-specific and concept-level knowledge. Then these evidences are explicitly incorporated into chain-of-thought prompting for relation extraction. Experimental results demonstrate that our CoT-ER approach (with 0% training data) achieves competitive performance compared to the fully-supervised (with 100% training data) state-of-the-art approach on the FewRel1.0 and FewRel2.0 datasets.
摘要
几个shot关系提取问题涉及到在文本中确定两个特定实体之间的类型关系,使用有限数量的标注样本进行训练。许多解决方案已经在应用元学习和神经图技术,通常需要训练过程进行适应。然而,最近,在文本中学习的策略已经在没有训练的情况下达到了显著的结果。只有一些研究已经使用了零shot信息提取。然而,在构建链条思维提问时,对推理的证据并不被考虑或直接模型。在本文中,我们提出了一种基于大语言模型的新方法,名为CoT-ER,即链条思维withExplicit Evidence Reasoning。特别是,CoT-ER首先使大语言模型生成证据,使用任务特定和概念水平的知识。然后,这些证据被Explicitly incorporated into链条思维提问中。实验结果表明,我们的CoT-ER方法(无需训练数据)可以与完全监督(具有100%训练数据)当前领域的状态之前性能竞争。
Citation Recommendation on Scholarly Legal Articles
paper_authors: Doğukan Arslan, Saadet Sena Erdoğan, Gülşen Eryiğit
for: 这个论文的目的是提出一个学术法律数据集,以便进行参考文献推荐任务。
methods: 这个论文使用了现有的模型,并进行了实验和比较,以检验这些模型在法律领域的表现。
results: 研究结果表明,使用BM25+和SciNCL进行预选和重新排序可以提高基线性能从0.26到0.30 MAP@10,而 fine-tuning也可以提高预处理模型的表现。Abstract
Citation recommendation is the task of finding appropriate citations based on a given piece of text. The proposed datasets for this task consist mainly of several scientific fields, lacking some core ones, such as law. Furthermore, citation recommendation is used within the legal domain to identify supporting arguments, utilizing non-scholarly legal articles. In order to alleviate the limitations of existing studies, we gather the first scholarly legal dataset for the task of citation recommendation. Also, we conduct experiments with state-of-the-art models and compare their performance on this dataset. The study suggests that, while BM25 is a strong benchmark for the legal citation recommendation task, the most effective method involves implementing a two-step process that entails pre-fetching with BM25+, followed by re-ranking with SciNCL, which enhances the performance of the baseline from 0.26 to 0.30 MAP@10. Moreover, fine-tuning leads to considerable performance increases in pre-trained models, which shows the importance of including legal articles in the training data of these models.
摘要English简化字 Citation recommendation是一项基于给定文本的任务,找到相应的引用。已有的数据集主要包括一些科学领域,缺乏一些核心领域,例如法律。在法律领域中,引用推荐用于identifying supporting arguments,使用非学术法律文章。为了解决现有研究的局限性,我们收集了首个学术法律数据集 для引用推荐任务。此外,我们进行了现有模型的实验和比较,并发现使用BM25+后followed by SciNCL重新排名的两步进程可以提高基准值从0.26到0.30 MAP@10。此外, fine-tuning对预训练模型的性能有显著提高,这说明包含法律文章在模型训练数据中的重要性。 Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.
Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
paper_authors: Reza Esfandiarpoor, Stephen H. Bach for: This paper aims to improve the performance of vision-language models like CLIP for image classification by extending class descriptions with related attributes.methods: The proposed method, Follow-up Differential Descriptions (FuDD), uses a Large Language Model (LLM) to generate new class descriptions that differentiate between ambiguous classes.results: FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets, and high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.Abstract
A promising approach for improving the performance of vision-language models like CLIP for image classification is to extend the class descriptions (i.e., prompts) with related attributes, e.g., using brown sparrow instead of sparrow. However, current zero-shot methods select a subset of attributes regardless of commonalities between the target classes, potentially providing no useful information that would have helped to distinguish between them. For instance, they may use color instead of bill shape to distinguish between sparrows and wrens, which are both brown. We propose Follow-up Differential Descriptions (FuDD), a zero-shot approach that tailors the class descriptions to each dataset and leads to additional attributes that better differentiate the target classes. FuDD first identifies the ambiguous classes for each image, and then uses a Large Language Model (LLM) to generate new class descriptions that differentiate between them. The new class descriptions resolve the initial ambiguity and help predict the correct label. In our experiments, FuDD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. We show that differential descriptions are an effective tool to resolve class ambiguities, which otherwise significantly degrade the performance. We also show that high quality natural language class descriptions produced by FuDD result in comparable performance to few-shot adaptation methods.
摘要
一种有前途的方法是通过扩展类描述(即提示)来提高视觉语言模型如CLIP的图像分类性能。例如,使用 Brown Sparrow 而不是只使用 Sparrow。然而,当前的零shot方法会选择图像集中的一 subset of 属性,而不考虑这些目标类之间的共通点,这可能无法提供任何有用的信息,用于 distinguishing между他们。例如,它们可能使用颜色而不是嘴形来分辨鸟鹤和织纹鸟,它们都是棕色的。我们提出了 Follow-up Differential Descriptions (FuDD),一种零shot方法,它可以为每个图像集定制类描述,并且生成更好地分 differentiate 目标类的属性。FuDD 首先确定每个图像中的抽象类,然后使用大型自然语言模型(LLM)生成新的类描述,以解决初始的混淆。这些新的类描述可以分解初始的混淆,并帮助预测正确的标签。在我们的实验中,FuDD consistently 超过了通用描述阵列和幼AGE LLM 生成的描述在 12 个数据集上。我们展示了 differential 描述是一种有效的工具,用于解决类混淆,否则会对性能产生负面影响。我们还展示了 FuDD 生成的高质量自然语言类描述可以达到与几 shot 适应方法相同的性能。
Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications
results: 本文提出了未来研究方向,包括数据增强、知识编辑和模型提升等。Abstract
Large language models (LLMs) exhibit superior performance on various natural language tasks, but they are susceptible to issues stemming from outdated data and domain-specific limitations. In order to address these challenges, researchers have pursued two primary strategies, knowledge editing and retrieval augmentation, to enhance LLMs by incorporating external information from different aspects. Nevertheless, there is still a notable absence of a comprehensive survey. In this paper, we propose a review to discuss the trends in integration of knowledge and large language models, including taxonomy of methods, benchmarks, and applications. In addition, we conduct an in-depth analysis of different methods and point out potential research directions in the future. We hope this survey offers the community quick access and a comprehensive overview of this research area, with the intention of inspiring future research endeavors.
摘要
大型自然语言模型(LLM)在各种自然语言任务上表现出色,但它们受到过时数据和领域特定限制的影响。为了解决这些挑战,研究人员通过知识编辑和检索增强来增强LLM,并将外部信息integrate到不同方面。然而,当前仍然缺乏一份全面的评论。本文提出了一篇文章,探讨大型语言模型和知识 интеграción的趋势,包括方法分类、标准准比和应用场景。此外,我们还进行了深入的分析不同方法,并指出了未来研究的可能性。我们希望这份评论可以为社区提供快速的访问和全面的概述,以便鼓励未来的研究努力。
results: 结果显示,使用PRM-based方法可以提高简单数学逻辑(GSM8K)的准确率,但在复杂任务(MATH)中,不料地下降性能,并且奖励聚合函数的作用对模型性能产生关键作用。Abstract
While recent advances have boosted LM proficiency in linguistic benchmarks, LMs consistently struggle to reason correctly on complex tasks like mathematics. We turn to Reinforcement Learning from Human Feedback (RLHF) as a method with which to shape model reasoning processes. In particular, we explore two reward schemes, outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs), to optimize for logical reasoning. Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning (GSM8K) while, unexpectedly, reducing performance in complex tasks (MATH). Furthermore, we show the critical role reward aggregation functions play in model performance. Providing promising avenues for future research, our study underscores the need for further exploration into fine-grained reward modeling for more reliable language models.
摘要
Recent advances have improved the proficiency of language models (LMs) in linguistic benchmarks, but they consistently struggle with complex tasks like mathematics. To improve the reasoning processes of LMs, we turn to reinforcement learning from human feedback (RLHF). Specifically, we explore two reward schemes, outcome-supervised reward models (ORMs) and process-supervised reward models (PRMs), to optimize for logical reasoning. Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy in simple mathematical reasoning (GSM8K) while, unexpectedly, reducing performance in complex tasks (MATH). Additionally, we find that the aggregation functions used in the reward models play a critical role in model performance. This study highlights the need for further research into fine-grained reward modeling for more reliable language models.
CFBenchmark: Chinese Financial Assistant Benchmark for Large Language Model
results: 对一些现有的语言模型进行实验,发现虽有一些模型在特定任务上表现出色,但总体来说,现有模型在基本金融文本处理任务中仍有很大的提升空间。Abstract
Large language models (LLMs) have demonstrated great potential in the financial domain. Thus, it becomes important to assess the performance of LLMs in the financial tasks. In this work, we introduce CFBenchmark, to evaluate the performance of LLMs for Chinese financial assistant. The basic version of CFBenchmark is designed to evaluate the basic ability in Chinese financial text processing from three aspects~(\emph{i.e.} recognition, classification, and generation) including eight tasks, and includes financial texts ranging in length from 50 to over 1,800 characters. We conduct experiments on several LLMs available in the literature with CFBenchmark-Basic, and the experimental results indicate that while some LLMs show outstanding performance in specific tasks, overall, there is still significant room for improvement in basic tasks of financial text processing with existing models. In the future, we plan to explore the advanced version of CFBenchmark, aiming to further explore the extensive capabilities of language models in more profound dimensions as a financial assistant in Chinese. Our codes are released at https://github.com/TongjiFinLab/CFBenchmark.
摘要
大型语言模型(LLM)在金融领域的潜力已经得到了广泛的认可。因此,评估 LLM 在金融任务中的表现变得非常重要。在这项工作中,我们提出了 CFBenchmark,用于评估中文金融助手的 LLM 表现。CFBenchmark 的基本版本包括三个方面的八个任务,包括文本识别、分类和生成等,并且文本的长度从 50 字符到超过 1,800 字符不等。我们在文献中公布的一些 LLM 上进行了 CFBenchmark-Basic 的实验,结果表明,虽然一些 LLM 在特定任务中表现出色,但总体来说,现有模型仍然在基本的金融文本处理任务中存在很大的改进空间。未来,我们计划将 CFBenchmark 的高级版本推出,以更深入探索语言模型在中文金融助手中的广泛能力。我们的代码在 GitHub 上公布,请参考 https://github.com/TongjiFinLab/CFBenchmark。