cs.CL - 2023-09-28

A Sign Language Recognition System with Pepper, Lightweight-Transformer, and LLM

  • paper_url: http://arxiv.org/abs/2309.16898
  • repo_url: None
  • paper_authors: JongYoon Lim, Inkyu Sa, Bruce MacDonald, Ho Seok Ahn
  • for: 实现人类与机器人之间的非语言互动,使用轻量级深度学习架构来理解美洲手语(ASL)。
  • methods: 利用轻量级深度学习模型来实现快速识别手语,并与大型自然语言模型(LLM)整合,实现智能机器人互动。通过几何调整引擎,调整互动以允许机器人产生自然的同步姿势回应。
  • results: 在实际应用中显示出轻量级深度学习架构可以实现高效的手语识别和自然的机器人互动,实现人类与机器人之间的无语言互动,扩大机器人的应用范围,并增进人类与机器人之间的沟通。
    Abstract This research explores using lightweight deep neural network architectures to enable the humanoid robot Pepper to understand American Sign Language (ASL) and facilitate non-verbal human-robot interaction. First, we introduce a lightweight and efficient model for ASL understanding optimized for embedded systems, ensuring rapid sign recognition while conserving computational resources. Building upon this, we employ large language models (LLMs) for intelligent robot interactions. Through intricate prompt engineering, we tailor interactions to allow the Pepper Robot to generate natural Co-Speech Gesture responses, laying the foundation for more organic and intuitive humanoid-robot dialogues. Finally, we present an integrated software pipeline, embodying advancements in a socially aware AI interaction model. Leveraging the Pepper Robot's capabilities, we demonstrate the practicality and effectiveness of our approach in real-world scenarios. The results highlight a profound potential for enhancing human-robot interaction through non-verbal interactions, bridging communication gaps, and making technology more accessible and understandable.
    摘要 这些研究探讨使用轻量级深度神经网络架构,使人类机器人Pepper能够理解美国手语(ASL),并促进无语言人机器人交互。首先,我们介绍了一种轻量级、高效的ASL理解模型,适用于嵌入式系统,以便快速认osciptic gesture,并保留计算资源。然后,我们利用大型自然语言模型(LLM),实现智能机器人交互。通过细腻的提示工程,我们调整交互,使Pepper机器人能够自然地生成Co-Speech Gesture响应,为人机器人对话铺垫基础。最后,我们提出了一个集成的软件管道,整合了社会意识AI交互模型。利用Pepper机器人的能力,我们在实际场景中展示了我们的方法的实用性和效果。结果表明,使用非语言交互可以bridge沟通差距,使技术更加 accessible和理解。

DeBERTinha: A Multistep Approach to Adapt DebertaV3 XSmall for Brazilian Portuguese Natural Language Processing Task

  • paper_url: http://arxiv.org/abs/2309.16844
  • repo_url: None
  • paper_authors: Israel Campiotti, Matheus Rodrigues, Yuri Albuquerque, Rafael Azevedo, Alyson Andrade
  • for: This paper presents an approach for adapting a pre-trained English language model for use in Brazilian Portuguese natural language processing tasks.
  • methods: The methodology involves a multistep training process to fine-tune the model for the Portuguese language, using a combination of pre-trained English model weights and random embeddings.
  • results: The adapted model, called DeBERTinha, demonstrates effectiveness on downstream tasks such as named entity recognition, sentiment analysis, and determining sentence relatedness, outperforming a baseline model despite having fewer parameters.
    Abstract This paper presents an approach for adapting the DebertaV3 XSmall model pre-trained in English for Brazilian Portuguese natural language processing (NLP) tasks. A key aspect of the methodology involves a multistep training process to ensure the model is effectively tuned for the Portuguese language. Initial datasets from Carolina and BrWac are preprocessed to address issues like emojis, HTML tags, and encodings. A Portuguese-specific vocabulary of 50,000 tokens is created using SentencePiece. Rather than training from scratch, the weights of the pre-trained English model are used to initialize most of the network, with random embeddings, recognizing the expensive cost of training from scratch. The model is fine-tuned using the replaced token detection task in the same format of DebertaV3 training. The adapted model, called DeBERTinha, demonstrates effectiveness on downstream tasks like named entity recognition, sentiment analysis, and determining sentence relatedness, outperforming BERTimbau-Large in two tasks despite having only 40M parameters.
    摘要

Curriculum-Driven Edubot: A Framework for Developing Language Learning Chatbots Through Synthesizing Conversational Data

  • paper_url: http://arxiv.org/abs/2309.16804
  • repo_url: None
  • paper_authors: Yu Li, Shang Qu, Jili Shen, Shangchao Min, Zhou Yu
  • for: 帮助学生提高对话技巧,满足学生在课程框架下的学习需求。
  • methods: 利用大语言模型生成对话,根据教科书中的相关话题进行EXTRACTING,然后使用自定义的LLM进行精度调整。
  • results: 比ChatGPT更好地领导课程基础的对话,能够根据用户的英语水平进行对话调整,提供学生个性化的对话实践。
    Abstract Chatbots have become popular in educational settings, revolutionizing how students interact with material and how teachers teach. We present Curriculum-Driven EduBot, a framework for developing a chatbot that combines the interactive features of chatbots with the systematic material of English textbooks to assist students in enhancing their conversational skills. We begin by extracting pertinent topics from textbooks and then using large language models to generate dialogues related to these topics. We then fine-tune an open-source LLM using our generated conversational data to create our curriculum-driven chatbot. User studies demonstrate that our chatbot outperforms ChatGPT in leading curriculum-based dialogues and adapting its dialogue to match the user's English proficiency level. By combining traditional textbook methodologies with conversational AI, our approach offers learners an interactive tool that aligns with their curriculum and provides user-tailored conversation practice. This facilitates meaningful student-bot dialogues and enriches the overall learning experience within the curriculum's pedagogical framework.
    摘要 chatbots 已经在教育 Setting 中变得流行,推翻了学生与材料之间的交互方式和教师教学方式。我们提出了 Curriculum-Driven EduBot 框架,用于开发一个结合了聊天机器人的互动特点和英语教科书系统的材料来帮助学生提高对话技巧。我们首先从教科书中提取有关话题,然后使用大型自然语言模型生成与这些话题相关的对话。然后,我们使用我们生成的对话数据来练化一个开源 LLM,以创建受教科书驱动的聊天机器人。用户研究表明,我们的聊天机器人在课程基础的对话中表现出色,并且可以根据用户的英语水平进行对话调整。通过结合传统教科书方法与对话 AI,我们的方法提供了学习者一种交互的工具,与其课程的教学框架相吻合。这使得学生与机器人的对话变得有意义,并润色了整个学习经验。

Hallucination Reduction in Long Input Text Summarization

  • paper_url: http://arxiv.org/abs/2309.16781
  • repo_url: https://github.com/tohidarehman/hallucination-reduction-text-summarization
  • paper_authors: Tohida Rehman, Ronit Mandal, Abhishek Agarwal, Debarshi Kumar Sanyal
  • for: 本研究旨在降低长文摘要中的幻觉输出(hallucination),以提高摘要的准确性和可靠性。
  • methods: 我们使用了数据筛选和共同实体和摘要生成(JAENS)技术,对Longformer Encoder-Decoder(LED)模型进行精度调整,以降低幻觉输出。
  • results: 我们的实验表明,精度调整后的LED模型能够良好地生成文章摘要。数据筛选技术基于一些预处理步骤,可以降低生成摘要中的实体幻觉水平,以judged by some factual consistency metrics。
    Abstract Hallucination in text summarization refers to the phenomenon where the model generates information that is not supported by the input source document. Hallucination poses significant obstacles to the accuracy and reliability of the generated summaries. In this paper, we aim to reduce hallucinated outputs or hallucinations in summaries of long-form text documents. We have used the PubMed dataset, which contains long scientific research documents and their abstracts. We have incorporated the techniques of data filtering and joint entity and summary generation (JAENS) in the fine-tuning of the Longformer Encoder-Decoder (LED) model to minimize hallucinations and thereby improve the quality of the generated summary. We have used the following metrics to measure factual consistency at the entity level: precision-source, and F1-target. Our experiments show that the fine-tuned LED model performs well in generating the paper abstract. Data filtering techniques based on some preprocessing steps reduce entity-level hallucinations in the generated summaries in terms of some of the factual consistency metrics.
    摘要 描述文本简化中的幻觉现象指的是模型生成的信息不受输入文档支持。幻觉会对简化后的摘要准确性和可靠性产生很大的影响。在这篇论文中,我们想降低摘要中的幻觉输出或幻觉。我们使用了PubMed数据集,这个数据集包含长篇科学研究文献和其摘要。我们在LED模型的精度调节阶段采用数据筛选和联合实体和摘要生成(JAENS)技术,以减少幻觉并提高生成的摘要质量。我们使用了以下度量来衡量实体层次的事实一致性:准确性-源,F1-目标。我们的实验表明,精度调节后的LED模型能够好地生成文档摘要。基于一些预处理步骤的数据筛选技术可以在生成的摘要中减少实体层次的幻觉。

Demystifying CLIP Data

  • paper_url: http://arxiv.org/abs/2309.16671
  • repo_url: https://github.com/facebookresearch/metaclip
  • paper_authors: Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
  • for: This paper is written for advancing research and applications in computer vision, particularly in the area of contrastive language-image pre-training (CLIP).
  • methods: The paper introduces a new approach called Metadata-Curated Language-Image Pre-training (MetaCLIP), which aims to reveal CLIP’s data curation approach and make it open to the community. MetaCLIP takes a raw data pool and metadata (derived from CLIP’s concepts) and yields a balanced subset over the metadata distribution.
  • results: The paper reports that MetaCLIP outperforms CLIP’s data on multiple standard benchmarks, achieving 70.8% accuracy on zero-shot ImageNet classification with 400M image-text data pairs, and scaling to 1B data with the same training budget. The paper also shows that MetaCLIP achieves better performance than CLIP on various model sizes, such as ViT-H, which achieves 80.5% accuracy without any bells-and-whistles.
    Abstract Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
    摘要 CLIP(语言图像预训练)是一种技术,它已经提高了计算机视觉领域的研究和应用,推动现代识别系统和生成模型。我们认为CLIP的成功的主要原因是其数据,而不是模型结构或预训练目标。然而,CLIP只提供了非常有限的数据信息和收集方法,导致一些工作尝试通过CLIP模型参数来复制CLIP的数据。在这种情况下,我们计划披露CLIP的数据筛选策略,并在为社区开放CLIP的数据预处理技术引入Metadata-Curated Language-Image Pre-training(MetaCLIP)。MetaCLIP使用原始数据池和元数据(从CLIP的概念中 derivated)来生成元数据分布平衡subset。我们的实验充分隔离模型和训练参数,专注于数据。在应用于CommonCrawl的400万张图像文本对比 experiment,MetaCLIP exceeds CLIP的数据在多个标准测试 benchmark 上。在零容量ImageNet分类任务中,MetaCLIP实现了70.8%的准确率,比CLIP的68.3%高于ViT-B模型。在增加到1亿个数据时,保持相同的训练预算,MetaCLIP达到了72.4%。我们的观察结果在不同的模型大小上都具有相同的特点,例如ViT-H实现了80.5%的准确率,无需任何额外的技术。我们在 GitHub 上提供了数据预处理代码和训练数据分布,请参考https://github.com/facebookresearch/MetaCLIP。

Qwen Technical Report

  • paper_url: http://arxiv.org/abs/2309.16609
  • repo_url: https://github.com/QwenLM/Qwen-7B
  • paper_authors: Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, Tianhang Zhu
  • for: 这篇论文旨在推介一种新的大型自然语言处理(NLP)模型系列,称为Qwen系列。
  • methods: 该论文使用了不同参数计数的模型,包括基础预训练模型Qwen以及通过人工对齐技术微调的聊天模型Qwen-Chat。
  • results: 研究表明,基础模型在多种下游任务中表现出色,而微调后的聊天模型具有出色的工具使用和规划能力,可以创建高效的智能应用程序。此外,研究还开发了专门为编程和数学领域的模型,即Code-Qwen和Code-Qwen-Chat,以及Math-Qwen-Chat,这些模型在相关任务上表现出优于开源模型,但落后于商业模型。
    Abstract Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.
    摘要 大型自然语言模型(LLM)已经革命化人工智能领域,使得先前被认为是人类专有的自然语言处理任务现在可以由机器完成。在这项工作中,我们介绍了Qwen系列,这是我们的大型语言模型系列,包括不同参数数量的多种模型。其中包括Qwen基础预训练语言模型和Qwen-Chat通话模型,后者通过人类对齐技术进行了加工。基础语言模型在多个下游任务中几乎一直表现出优秀的表现,而通话模型,特别是使用人类反馈学习(RLHF)进行训练的通话模型,在创造代理应用程序时具有高级工具使用和规划能力,并在使用代码解释器的复杂任务上表现出色。此外,我们还开发了专门为编程而设计的模型,名为Code-Qwen和Code-Qwen-Chat,以及专门为数学而设计的模型,名为Math-Qwen-Chat,这些模型基于基础语言模型。这些模型在相比于开源模型的情况下表现出了显著的提升,并只有轻微落后于商业模型。

Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation

  • paper_url: http://arxiv.org/abs/2309.16599
  • repo_url: https://github.com/zanchangtong/unions
  • paper_authors: Changtong Zan, Liang Ding, Li Shen, Yibin Lei, Yibing Zhan, Weifeng Liu, Dacheng Tao
  • for: 这个研究旨在解释在零shot翻译(ZST)任务中语言标识符(ID)的导航能力是如何受限的。
  • methods: 研究使用了零shot翻译模型,并对两种极端的decoder输入情况进行比较分析:Off-Target(OFF)和On-Target(ON)两种情况。通过对 Contextual Word Representations(CWRs)进行比较分析,研究发现了语言标识符在不同情况下的导航能力。
  • results: 研究发现,although language IDs work well in ideal ON settings, they become fragile and lose their navigation ability when faced with off-target tokens。为了解决这个问题,研究使用了不利可能性调整法,减少了off-target ratio,导致了BLEU分数的提高。
    Abstract Zero-shot translation (ZST), which is generally based on a multilingual neural machine translation model, aims to translate between unseen language pairs in training data. The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs, e.g., for English and for German. Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem (non-target language words exist in the generated translation) and, therefore, difficult to apply the current multilingual translation model to a broad range of zero-shot language scenarios. To understand when and why the navigation capabilities of language IDs are weakened, we compare two extreme decoder input cases in the ZST directions: Off-Target (OFF) and On-Target (ON) cases. By contrastively visualizing the contextual word representations (CWRs) of these cases with teacher forcing, we show that 1) the CWRs of different languages are effectively distributed in separate regions when the sentence and ID are matched (ON setting), and 2) if the sentence and ID are unmatched (OFF setting), the CWRs of different languages are chaotically distributed. Our analyses suggest that although they work well in ideal ON settings, language IDs become fragile and lose their navigation ability when faced with off-target tokens, which commonly exist during inference but are rare in training scenarios. In response, we employ unlikelihood tuning on the negative (OFF) samples to minimize their probability such that the language IDs can discriminate between the on- and off-target tokens during training. Experiments spanning 40 ZST directions show that our method reduces the off-target ratio by -48.0% on average, leading to a +9.1 BLEU improvement with only an extra +0.3% tuning cost.
    摘要 zero-shot翻译(ZST)通常基于多语言神经机器翻译模型,旨在在训练数据中未经见过的语言对之间翻译。通常情况下,在推导 zero-shot 语言映射时,会故意插入源语言ID和目标语言ID,例如 表示英语和 表示德语。然而, latest studies 表明,语言ID 在推导 ZST 任务中的导航能力有时会弱化,导致翻译结果受到非目标语言词汇的影响,从而使得当前多语言翻译模型难以应用于广泛的 zero-shot 语言enario。为了了解语言ID 在 ZST 任务中的导航能力是如何弱化的,我们比较了两种极端的解码输入情况:Off-Target(OFF)和 On-Target(ON)两种情况。通过比较这两种情况下的上下文字表示(CWR),我们发现:1)当 sentence 和 ID 匹配时(ON setting),不同语言的 CWR 分布在不同的区域,2)如果 sentence 和 ID 不匹配(OFF setting),不同语言的 CWR 分布混乱。我们的分析表明,虽然它们在理想的 ON 设置下工作非常好,但是语言 ID 在面对非目标语言词汇时变得脆弱,丢弃了导航能力。为了解决这个问题,我们使用不良抽象训练方法,通过训练时间间隔的负样本进行训练,以降低 OFF 样本的概率,使语言 ID 能够在训练中分辨在目标语言和非目标语言之间。实验结果表明,我们的方法可以降低 OFF 比例平均 -48.0%,并且提高 BLEU 平均 +9.1,只需要额外花费 +0.3% 的训练成本。

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

  • paper_url: http://arxiv.org/abs/2309.16583
  • repo_url: https://github.com/gpt-fathom/gpt-fathom
  • paper_authors: Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang
  • for: 评估大语言模型(LLM)的全面能力和局限性。
  • methods: 使用OpenAI Evals开发了一个开源和可重现的LLM评估suite,对10多个领先的LLM以及OpenAI的遗产模型进行了20多个精心制定的测试,并进行了7种能力类别的评估。
  • results: 对OpenAI的早期模型进行了Retrospective研究,提供了各种LLM的进步和改进的技术细节,如 Whether adding code data improves LLM’s reasoning capability, Which aspects of LLM capability can be improved by SFT and RLHF,Alignment tax等问题的解答,以提高高级LLM的透明度。
    Abstract With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
    摘要 With the rapid development of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.Here's the translation in Traditional Chinese as well:With the rapid development of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

A Benchmark for Learning to Translate a New Language from One Grammar Book

  • paper_url: http://arxiv.org/abs/2309.16575
  • repo_url: None
  • paper_authors: Garrett Tanzer, Mirac Suzgun, Eline Visser, Dan Jurafsky, Luke Melas-Kyriazi
  • for: 这个论文是为了测试大型语言模型(LLM)在新任务上的能力,以及使用少量数据进行语言学习。
  • methods: 这个论文使用了现有的LLM作为基础,并在一本 Kalamang 语言 grammar 引用书上进行了一些 slight 的修改和微调。
  • results: 研究发现,使用当前的 LLM 可以达到44.7chrF 的 Kalamang 到英语翻译和45.8chrF 的英语到 Kalamang 翻译,相比之下,人类学习 Kalamang 从同一个引用书上的结果为51.6和57.0chrF。
    Abstract Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.
    摘要 大型语言模型(LLM)可以执行吸引人的表现,使用内容学习或轻量级调整。人们 naturallly 会想知道这些模型是否可以适应真正的新任务,但如何找到互联网上没有的任务呢?我们到了一个这些语言的缺乏网络数据的领域:低资源语言。在这篇文章中,我们介绍了 MTOB(从一本书 Machine Translation),一个用于将英语和卡拉曼(一种只有 fewer than 200 名 speaker的语言)之间进行翻译的benchmark。这个任务框架是新的,因为它请求一个模型从单一的人类可读的 grammar 解释书中学习一个语言,而不是从大量矿物质的内部数据中学习,更像是 L2 学习而不是 L1 获得。我们展示了现有的 LLB 是可以 promise 的,但落后于人类性能,实现了从英语到卡拉曼的翻译和从卡拉曼到英语的翻译的chrF 44.7和45.8,相比之下,人类从同一个 reference materials 学习 Kalamang 的chrF 为51.6和57.0。我们希望 MTOB 可以帮助衡量 LLM 的能力,并且可以帮助扩展语言科技 для被排除的社区,通过使用不同于传统机器翻译的数据来进行。

Unsupervised Fact Verification by Language Model Distillation

  • paper_url: http://arxiv.org/abs/2309.16540
  • repo_url: None
  • paper_authors: Adrián Bazaga, Pietro Liò, Gos Micklem
  • for: 这篇论文的目的是为了无监督的事实验证,即使没有任何标注数据,也能够使用可信worthy知识库中的证据来验证一个声明。
  • methods: 这篇论文使用了自动学习的方法,并且利用预训语言模型来将自动生成的特征整合到高品质的声明和证据的Alignment中。这是由于一个新的对称损失函数,让特征能够获得高品质的声明和证据的Alignment,同时保持数据库中的semantic关系。
  • results: 这篇论文获得了新的州际对称检测benchmark(+8%对称精度)的最佳结果,并且在线性评估中获得了最佳结果。
    Abstract Unsupervised fact verification aims to verify a claim using evidence from a trustworthy knowledge base without any kind of data annotation. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful, and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments whilst preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on the standard FEVER fact verification benchmark (+8% accuracy) with linear evaluation.
    摘要 Unsupervised fact verification aims to verify a claim using evidence from a trustworthy knowledge base without any kind of data annotation. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments while preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on the standard FEVER fact verification benchmark (+8% accuracy) with linear evaluation.Here's the translation in Traditional Chinese:Unsupervised fact verification aims to verify a claim using evidence from a trustworthy knowledge base without any kind of data annotation. To address this challenge, algorithms must produce features for every claim that are both semantically meaningful and compact enough to find a semantic alignment with the source information. In contrast to previous work, which tackled the alignment problem by learning over annotated corpora of claims and their corresponding labels, we propose SFAVEL (Self-supervised Fact Verification via Language Model Distillation), a novel unsupervised framework that leverages pre-trained language models to distil self-supervised features into high-quality claim-fact alignments without the need for annotations. This is enabled by a novel contrastive loss function that encourages features to attain high-quality claim and evidence alignments while preserving the semantic relationships across the corpora. Notably, we present results that achieve a new state-of-the-art on the standard FEVER fact verification benchmark (+8% accuracy) with linear evaluation.

A Comprehensive Survey of Document-level Relation Extraction (2016-2023)

  • paper_url: http://arxiv.org/abs/2309.16396
  • repo_url: None
  • paper_authors: Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Georgeta Bordea, Nicolas Sidere, Antoine Doucet
  • for: 这篇论文旨在提供关于近期文本关系抽取(DocRE)领域的全面概述,强调其与句子关系抽取的区别和应用场景。
  • methods: 本文使用了多种方法,包括文本分析、命名实体识别、语义理解等,以提取文档中的关系。
  • results: 本文提出了一些新的 DocRE 方法,并评估了它们的性能。这些方法可以帮助自动生成知识库,以提高对文档中关系的理解。
    Abstract Document-level relation extraction (DocRE) is an active area of research in natural language processing (NLP) concerned with identifying and extracting relationships between entities beyond sentence boundaries. Compared to the more traditional sentence-level relation extraction, DocRE provides a broader context for analysis and is more challenging because it involves identifying relationships that may span multiple sentences or paragraphs. This task has gained increased interest as a viable solution to build and populate knowledge bases automatically from unstructured large-scale documents (e.g., scientific papers, legal contracts, or news articles), in order to have a better understanding of relationships between entities. This paper aims to provide a comprehensive overview of recent advances in this field, highlighting its different applications in comparison to sentence-level relation extraction.
    摘要 文档级关系EXTRACTION(DocRE)是一个活跃的研究领域,涉及到自然语言处理(NLP)中identifying和EXTRACTING关系之外句子 boundariestra. 相比传统的句子级关系EXTRACTION,DocRE提供了更广阔的上下文,并且更加挑战性,因为它涉及到可能 span multiple sentences或 paragraphs 中的关系。这项任务在建立和自动填充大规模文档(例如科学论文、法律合同或新闻文章)中,以获得更好的实体之间关系的理解。这篇论文的目的是提供 DocRE 领域最新的进展, highlighting 它的不同应用场景与 sentence-level relation extraction 相比。

Transformer-VQ: Linear-Time Transformers via Vector Quantization

  • paper_url: http://arxiv.org/abs/2309.16354
  • repo_url: https://github.com/transformer-vq/transformer_vq
  • paper_authors: Lucas D. Lingle
  • for: 这个论文是为了提出一种基于Transformer的嵌入式自注意力计算方法,以实现高效的自注意力计算。
  • methods: 该方法使用了vector-quantized keys和一种新的缓存机制,实现了高效的自注意力计算。
  • results: 在大规模实验中,该方法表现出色,在Enwik8(0.99 bpb)、PG-19(26.6 ppl)和ImageNet64(3.16 bpb)等测试集上达到了高水平的结果。Here’s the English version of the summary:
  • for: This paper proposes a decoder-only transformer computing softmax-based dense self-attention in linear time.
  • methods: The method uses vector-quantized keys and a novel caching mechanism to achieve efficient attention.
  • results: In large-scale experiments, the method achieves high-quality results on Enwik8 (0.99 bpb), PG-19 (26.6 ppl), and ImageNet64 (3.16 bpb).
    Abstract We introduce Transformer-VQ, a decoder-only transformer computing softmax-based dense self-attention in linear time. Transformer-VQ's efficient attention is enabled by vector-quantized keys and a novel caching mechanism. In large-scale experiments, Transformer-VQ is shown highly competitive in quality, with strong results on Enwik8 (0.99 bpb), PG-19 (26.6 ppl), and ImageNet64 (3.16 bpb). Code: https://github.com/transformer-vq/transformer_vq
    摘要 我们介绍Transformer-VQ,一个仅有decoder的transformer computing软max-based dense自注意力,在线性时间内进行计算。Transformer-VQ的高效注意力得以实现因 vector-quantized keys和一种新的储存机制。在大规模实验中,Transformer-VQ表现出高品质,在Enwik8(0.99 bpb)、PG-19(26.6 ppl)和ImageNet64(3.16 bpb)上获得了强劲的结果。代码:https://github.com/transformer-vq/transformer_vq

Human Feedback is not Gold Standard

  • paper_url: http://arxiv.org/abs/2309.16349
  • repo_url: https://github.com/cohere-ai/human-feedback-paper
  • paper_authors: Tom Hosking, Phil Blunsom, Max Bartolo
  • for: 本研究探讨了人类反馈在评估大型自然语言模型性能时的作用,以及这种评估方法是否能够完全捕捉多种重要错误标准。
  • methods: 研究者使用了人类反馈来训练和评估模型,并分析了 preference scores 是否受到不良偏见的影响。他们还使用了 instruction-tuned 模型来生成输出,以探讨输出的干扰因素。
  • results: 研究者发现, preference scores 覆盖率相对较好,但忽略了重要的准确性因素。此外,他们发现人类反馈可能受到干扰因素的影响,并且使用人类反馈作为训练目标可能会导致模型输出更加夸大。
    Abstract Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.
    摘要 人类反馈已成为大语言模型性能评估的德法标准,并在训练和评估中使用。然而,不清楚哪些属性得到这一单一的喜好分数。我们假设 preference 分数是主观的和易受到不良偏见的。我们critically analyzes the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesize that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic Manipulation Tasks

  • paper_url: http://arxiv.org/abs/2309.16347
  • repo_url: None
  • paper_authors: Eleftherios Triantafyllidis, Filippos Christianos, Zhibin Li
  • for: addresses intricate long-horizon with sparse rewards robotic manipulation tasks
  • methods: leverages LLMs as an assistive intrinsic reward to guide the exploratory process in reinforcement learning
  • results: exhibits notably higher performance, can be combined with existing learning methods, and maintains robustness against increased levels of uncertainty and horizons.Here’s the full text in Simplified Chinese:
  • for: 本研究旨在解决复杂环境中的长时间探索任务,特别是 robotic manipulation 任务中的多种序列。
  • methods: 我们提出了基于大语言模型(LLMs)的自适应探索框架(IGE-LLMs),利用 LLMS 作为帮助探索过程的内在奖励。
  • results: 我们的框架在探索和长时间任务中表现出色,与相关的内在学习方法和直接使用 LLMS 进行决策相比,显示更高的性能,可以与现有的学习方法相结合,并且对不同的内在缩放参数表现相对稳定,能够在不同的不确定性和时间轴水平上保持稳定性。
    Abstract Current reinforcement learning algorithms struggle in sparse and complex environments, most notably in long-horizon manipulation tasks entailing a plethora of different sequences. In this work, we propose the Intrinsically Guided Exploration from Large Language Models (IGE-LLMs) framework. By leveraging LLMs as an assistive intrinsic reward, IGE-LLMs guides the exploratory process in reinforcement learning to address intricate long-horizon with sparse rewards robotic manipulation tasks. We evaluate our framework and related intrinsic learning methods in an environment challenged with exploration, and a complex robotic manipulation task challenged by both exploration and long-horizons. Results show IGE-LLMs (i) exhibit notably higher performance over related intrinsic methods and the direct use of LLMs in decision-making, (ii) can be combined and complement existing learning methods highlighting its modularity, (iii) are fairly insensitive to different intrinsic scaling parameters, and (iv) maintain robustness against increased levels of uncertainty and horizons.
    摘要 当前的强化学习算法在稀疏和复杂环境中努力,特别是长时间 manipulate 任务中的多种序列。在这种工作中,我们提出了由大语言模型引导的自适应探索框架(IGE-LLMs)。通过利用 LLMS 作为帮助性的内在奖励,IGE-LLMs 引导了强化学习中的探索过程,以解决复杂的长时间 manipulate 任务。我们评估了我们的框架和相关的内在学习方法,并在一个具有探索挑战和复杂 manipulate 任务的环境中进行了测试。结果显示,IGE-LLMs 具有以下特点:(i) 与相关的内在方法和直接使用 LLMS 在决策中表现更高水平;(ii) 可以与现有的学习方法相结合和补充,表现协作性;(iii) 对不同的内在涨积参数 exhibit 鲁棒性;(iv) 在不同的不确定性和时间距离水平上保持稳定性。

At Which Training Stage Does Code Data Help LLMs Reasoning?

  • paper_url: http://arxiv.org/abs/2309.16298
  • repo_url: https://github.com/yingweima2022/codellm
  • paper_authors: Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, Shanshan Li
    for: 这个论文旨在研究在不同训练阶段引入代码数据对大语言模型(LLMs)的影响,以提高它们的推理能力。methods: 该论文使用了多种训练策略,包括在预训练阶段、指令调整阶段和两者同时使用代码数据,以评估LLMs的推理能力。results: 研究发现,在预训练阶段使用代码和文本混合数据可以大幅提高LLMs的通用推理能力,而在指令调整阶段使用代码数据可以增强LLMs的任务特定推理能力。此外,动态混合策略可以帮助LLMs逐步学习推理能力 durante 训练。这些发现可以深入理解LLMs在应用领域中的推理能力,如科学问答、法律支持等。
    Abstract Large Language Models (LLMs) have exhibited remarkable reasoning capabilities and become the foundation of language technologies. Inspired by the great success of code data in training LLMs, we naturally wonder at which training stage introducing code data can really help LLMs reasoning. To this end, this paper systematically explores the impact of code data on LLMs at different stages. Concretely, we introduce the code data at the pre-training stage, instruction-tuning stage, and both of them, respectively. Then, the reasoning capability of LLMs is comprehensively and fairly evaluated via six reasoning tasks in five domains. We critically analyze the experimental results and provide conclusions with insights. First, pre-training LLMs with the mixture of code and text can significantly enhance LLMs' general reasoning capability almost without negative transfer on other tasks. Besides, at the instruction-tuning stage, code data endows LLMs the task-specific reasoning capability. Moreover, the dynamic mixing strategy of code and text data assists LLMs to learn reasoning capability step-by-step during training. These insights deepen the understanding of LLMs regarding reasoning ability for their application, such as scientific question answering, legal support, etc. The source code and model parameters are released at the link:~\url{https://github.com/yingweima2022/CodeLLM}.
    摘要 大型语言模型(LLM)在语言技术基础上展现出了很好的逻辑能力,成为现代语言技术的基础。继承 code 数据在训练 LLM 中的成功,我们自然会问到在不同训练阶段引入 code 数据可以真正地帮助 LLM 的逻辑能力。为此,本文系统地探讨了在不同训练阶段引入 code 数据对 LLM 的影响。具体来说,我们在预训练阶段、指令调整阶段和两者都引入 code 数据,然后通过六种逻辑任务在五个领域进行了公平和全面的评估。我们对实验结果进行了深入分析,并提供了关于这些结论的深入理解。首先,在预训练阶段将代码和文本混合为一个混合数据集可以帮助 LLM 提高总的逻辑能力,并且几乎没有负面转移到其他任务。此外,在指令调整阶段,代码数据可以赋予 LLM 任务特定的逻辑能力。此外,在动态混合策略下,代码和文本数据的混合可以帮助 LLM 逐步学习逻辑能力 durante 训练。这些发现深入了我们对 LLM 的逻辑能力的理解,并为其应用,如科学问答、法律支持等提供了深入的理解。模型参数和源代码可以在以下链接获取:https://github.com/yingweima2022/CodeLLM。

DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

  • paper_url: http://arxiv.org/abs/2309.16292
  • repo_url: https://github.com/PJLab-ADG/DiLu
  • paper_authors: Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao
  • for: The paper aims to instill knowledge-driven capabilities into autonomous driving systems, inspired by human driving, and to address the challenges of dataset bias, overfitting, and uninterpretability in data-driven approaches.
  • methods: The proposed DiLu framework combines a Reasoning and a Reflection module to enable decision-making based on common-sense knowledge and to evolve continuously. The framework leverages large language models with emergent abilities.
  • results: Extensive experiments show that DiLu has a significant advantage in generalization ability over reinforcement learning-based methods and can directly acquire experiences from real-world datasets, demonstrating its potential for deployment on practical autonomous driving systems.
    Abstract Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability. Drawing inspiration from the knowledge-driven nature of human driving, we explore the question of how to instill similar capabilities into autonomous driving systems and summarize a paradigm that integrates an interactive environment, a driver agent, as well as a memory component to address this question. Leveraging large language models with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to perform decision-making based on common-sense knowledge and evolve continuously. Extensive experiments prove DiLu's capability to accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Moreover, DiLu is able to directly acquire experiences from real-world datasets which highlights its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to instill knowledge-driven capability into autonomous driving systems from the perspective of how humans drive.
    摘要 Leveraging large language models with emergent abilities, we propose the DiLu framework, which combines a Reasoning and a Reflection module to enable the system to make decisions based on common-sense knowledge and evolve continuously. Extensive experiments show that DiLu can accumulate experience and demonstrate a significant advantage in generalization ability over reinforcement learning-based methods. Additionally, DiLu can directly acquire experiences from real-world datasets, highlighting its potential to be deployed on practical autonomous driving systems. To the best of our knowledge, we are the first to instill knowledge-driven capability into autonomous driving systems from the perspective of how humans drive.Translation notes:* "data-driven approaches" ⇒ 数据驱动方法 (data-driven methods)* "dataset bias" ⇒ 数据集偏见 (dataset bias)* "overfitting" ⇒ 过拟合 (overfitting)* "uninterpretability" ⇒ 不可解释性 (uninterpretability)* "knowledge-driven nature of human driving" ⇒ 人类驾驶的知识驱动性 (knowledge-driven nature of human driving)* "DiLu framework" ⇒ DiLu框架 (DiLu framework)* "Reasoning and Reflection module" ⇒ 理解和反思模块 (Reasoning and Reflection module)* "common-sense knowledge" ⇒ 常识知识 (common-sense knowledge)* "practical autonomous driving systems" ⇒ 实用自动驾驶系统 (practical autonomous driving systems)

Self-supervised Cross-view Representation Reconstruction for Change Captioning

  • paper_url: http://arxiv.org/abs/2309.16283
  • repo_url: https://github.com/tuyunbin/SCORER
  • paper_authors: Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang
    for: 本研究旨在提出一种自动化描述变换的方法,以便在视点变化导致的 Pseudo 变换下学习稳定的差异表示。methods: 我们提出了一种自动化描述变换的方法,即基于多头token-wise匹配的自然语言描述(SCORER)网络。该方法通过对同种/不同种图像的交叉视图特征进行多头匹配,然后通过最大化交叉视图对两个相似图像的对齐来学习两个视点不变的图像表示。results: 我们的方法在四个 dataset 上实现了 estado-of-the-art 的结果,并且提供了一种自然语言描述的方法,以便在视点变化下学习稳定的差异表示。
    Abstract Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a ``hallucination'' representation with the caption and ``before'' representation. By pushing it closer to the ``after'' representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER.
    摘要 《 Change Captioning with Self-supervised Cross-view Representation Reconstruction (SCORER) Network》Abstract:在本文中,我们提出了一种基于自我超vised学习的图像描述文本生成方法,即自适应交叉视图表示重建(SCORER)网络。我们首先设计了多头token wise匹配来modelcross-view特征之间的关系,然后通过最大化交叉视图对两个相似图像的对齐来学习两个不同视图的图像表示。然后,我们通过跨modal推理来提高描述文本质量。我们在四个数据集上进行了广泛的实验,并达到了当前最佳的结果。代码可以在https://github.com/tuyunbin/SCORER中找到。Here's the translation in Traditional Chinese:《使用自我超vised学习的图像描述文本生成方法:SCORER网络》摘要:在本文中,我们提出了一种基于自我超vised学习的图像描述文本生成方法,即自适应交叉视图表示重建(SCORER)网络。我们首先设计了多头token wise匹配来modelcross-view特征之间的关系,然后通过最大化交叉视图对两个相似图像的对齐来学习两个不同视图的图像表示。然后,我们通过跨modal推理来提高描述文本质量。我们在四个数据集上进行了广泛的实验,并达到了现在最佳的结果。代码可以在https://github.com/tuyunbin/SCORER中找到。

Social Media Fashion Knowledge Extraction as Captioning

  • paper_url: http://arxiv.org/abs/2309.16270
  • repo_url: https://github.com/yfyuan01/FKE
  • paper_authors: Yifei Yuan, Wenxuan Zhang, Yang Deng, Wai Lam
  • for: 本研究的目的是提取社交媒体上的时尚知识,以便在时尚行业中提高效率和智能化水平。
  • methods: 我们采用了一种基于自然语言captioning的方法,将时尚知识描述为一个句子中的多个元素。此外,我们还设计了一些辅助任务来提高知识提取效果。
  • results: 我们的模型在多个实验中表现出色,能够高效地从社交媒体上提取时尚知识。此外,我们还发现了一些独特的时尚知识,例如:用户在社交媒体上分享的时尚信息可以被用来预测时尚趋势。
    Abstract Social media plays a significant role in boosting the fashion industry, where a massive amount of fashion-related posts are generated every day. In order to obtain the rich fashion information from the posts, we study the task of social media fashion knowledge extraction. Fashion knowledge, which typically consists of the occasion, person attributes, and fashion item information, can be effectively represented as a set of tuples. Most previous studies on fashion knowledge extraction are based on the fashion product images without considering the rich text information in social media posts. Existing work on fashion knowledge extraction in social media is classification-based and requires to manually determine a set of fashion knowledge categories in advance. In our work, we propose to cast the task as a captioning problem to capture the interplay of the multimodal post information. Specifically, we transform the fashion knowledge tuples into a natural language caption with a sentence transformation method. Our framework then aims to generate the sentence-based fashion knowledge directly from the social media post. Inspired by the big success of pre-trained models, we build our model based on a multimodal pre-trained generative model and design several auxiliary tasks for enhancing the knowledge extraction. Since there is no existing dataset which can be directly borrowed to our task, we introduce a dataset consisting of social media posts with manual fashion knowledge annotation. Extensive experiments are conducted to demonstrate the effectiveness of our model.
    摘要 In our work, we approach the task as a captioning problem to capture the interplay of multimodal post information. We transform fashion knowledge tuples into a natural language caption using a sentence transformation method. Our framework aims to generate sentence-based fashion knowledge directly from social media posts. Inspired by the success of pre-trained models, we build our model based on a multimodal pre-trained generative model and design several auxiliary tasks to enhance knowledge extraction.Since there is no existing dataset that can be directly applied to our task, we introduce a dataset consisting of social media posts with manual fashion knowledge annotation. We conduct extensive experiments to demonstrate the effectiveness of our model.

On the Challenges of Fully Incremental Neural Dependency Parsing

  • paper_url: http://arxiv.org/abs/2309.16254
  • repo_url: https://github.com/anaezquerro/incpar
  • paper_authors: Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares
  • for: 这篇论文是为了检验现代语言处理技术是否可以实现完全增量语法分析,以提高语法分析的效率和可靠性。
  • methods: 作者使用了 strictly left-to-right 神经网络编码器,并结合了完全增量序列标签和转换型解码器进行语法分析。
  • results: 研究发现,使用现代架构进行完全增量语法分析的效果落后于双向语法分析,表明在实现心理学上有效的语法分析时存在挑战。
    Abstract Since the popularization of BiLSTMs and Transformer-based bidirectional encoders, state-of-the-art syntactic parsers have lacked incrementality, requiring access to the whole sentence and deviating from human language processing. This paper explores whether fully incremental dependency parsing with modern architectures can be competitive. We build parsers combining strictly left-to-right neural encoders with fully incremental sequence-labeling and transition-based decoders. The results show that fully incremental parsing with modern architectures considerably lags behind bidirectional parsing, noting the challenges of psycholinguistically plausible parsing.
    摘要 自BILLSTM和Transformer基于的双向编码器的普及以来,现代语法分析器缺乏增量性,需要整个句子的访问,与人类语言处理方式不匹配。这篇论文探讨了现代 arquitecturas 是否可以实现增量性语法分析。我们构建了左到右强制性 neural 编码器和完全增量序列标签和过渡基本解码器。结果表明,增量性分析与现代 arquitecturas 相比,落后了许多,注意到了心理语言可能性的挑战。

Spider4SPARQL: A Complex Benchmark for Evaluating Knowledge Graph Question Answering Systems

  • paper_url: http://arxiv.org/abs/2309.16248
  • repo_url: None
  • paper_authors: Catherine Kosten, Philippe Cudré-Mauroux, Kurt Stockinger
  • for: 这个论文目的是为了提供一个大型和现实主义的 Knowledge Graph Question Answering (KBQA) 系统评估 benchmark。
  • methods: 这个论文使用了 manually generated 的自然语言 (NL) 问题和 SPARQL 查询,以及其相应的知识图和 ontologies。
  • results: 这个论文的实验结果表明,现有的 KGQA 系统和大型自然语言模型 (LLMs) 在 Spider4SPARQL benchmark 上只能达到 45% 的执行精度,这表明 Spider4SPARQL 是一个有挑战性的 benchmark для未来的研究。
    Abstract With the recent spike in the number and availability of Large Language Models (LLMs), it has become increasingly important to provide large and realistic benchmarks for evaluating Knowledge Graph Question Answering (KBQA) systems. So far the majority of benchmarks rely on pattern-based SPARQL query generation approaches. The subsequent natural language (NL) question generation is conducted through crowdsourcing or other automated methods, such as rule-based paraphrasing or NL question templates. Although some of these datasets are of considerable size, their pitfall lies in their pattern-based generation approaches, which do not always generalize well to the vague and linguistically diverse questions asked by humans in real-world contexts. In this paper, we introduce Spider4SPARQL - a new SPARQL benchmark dataset featuring 9,693 previously existing manually generated NL questions and 4,721 unique, novel, and complex SPARQL queries of varying complexity. In addition to the NL/SPARQL pairs, we also provide their corresponding 166 knowledge graphs and ontologies, which cover 138 different domains. Our complex benchmark enables novel ways of evaluating the strengths and weaknesses of modern KGQA systems. We evaluate the system with state-of-the-art KGQA systems as well as LLMs, which achieve only up to 45\% execution accuracy, demonstrating that Spider4SPARQL is a challenging benchmark for future research.
    摘要 With the recent surge in the number and availability of Large Language Models (LLMs), it has become increasingly important to provide large and realistic benchmarks for evaluating Knowledge Graph Question Answering (KBQA) systems. So far, most benchmarks rely on pattern-based SPARQL query generation approaches. The subsequent natural language (NL) question generation is conducted through crowdsourcing or other automated methods, such as rule-based paraphrasing or NL question templates. Although some of these datasets are quite large, their pitfall lies in their pattern-based generation approaches, which do not always generalize well to the vague and linguistically diverse questions asked by humans in real-world contexts.In this paper, we introduce Spider4SPARQL - a new SPARQL benchmark dataset featuring 9,693 previously existing manually generated NL questions and 4,721 unique, novel, and complex SPARQL queries of varying complexity. In addition to the NL/SPARQL pairs, we also provide their corresponding 166 knowledge graphs and ontologies, which cover 138 different domains. Our complex benchmark enables novel ways of evaluating the strengths and weaknesses of modern KGQA systems. We evaluate the system with state-of-the-art KGQA systems as well as LLMs, which achieve only up to 45% execution accuracy, demonstrating that Spider4SPARQL is a challenging benchmark for future research.

Analyzing Political Figures in Real-Time: Leveraging YouTube Metadata for Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2309.16234
  • repo_url: None
  • paper_authors: Danendra Athallariq Harya Putra, Arief Purnama Muharram
    for: 这个研究用于建立基于YouTube视频元数据的 Sentiment分析系统,用于分析不同政治人物的公众意见。methods: 该研究使用了Apache Kafka、Apache PySpark、Hadoop等大数据处理工具,以及TensorFlow深度学习库和FastAPI服务器部署工具。sentiment分析模型使用LSTM算法,可以分辨出两种情感:正面和负面情感。results: 研究建立了一个基于YouTube视频元数据的 Sentiment分析系统,可以Visualize情感分析结果为简单的Web基于dashboard。
    Abstract Sentiment analysis using big data from YouTube videos metadata can be conducted to analyze public opinions on various political figures who represent political parties. This is possible because YouTube has become one of the platforms for people to express themselves, including their opinions on various political figures. The resulting sentiment analysis can be useful for political executives to gain an understanding of public sentiment and develop appropriate and effective political strategies. This study aimed to build a sentiment analysis system leveraging YouTube videos metadata. The sentiment analysis system was built using Apache Kafka, Apache PySpark, and Hadoop for big data handling; TensorFlow for deep learning handling; and FastAPI for deployment on the server. The YouTube videos metadata used in this study is the video description. The sentiment analysis model was built using LSTM algorithm and produces two types of sentiments: positive and negative sentiments. The sentiment analysis results are then visualized in the form a simple web-based dashboard.
    摘要 <>使用 YouTube 视频元数据大数据进行情感分析,可以分析不同政党代表人物的公众意见。 YouTube 已成为人们表达自己意见的平台之一,因此可以通过情感分析获得政策执行者对公众情绪的理解,并开发有效的政策策略。本研究旨在建立基于 YouTube 视频元数据的情感分析系统。该系统使用 Apache Kafka、Apache PySpark、Hadoop 处理大数据;TensorFlow 处理深度学习;以及 FastAPI 部署服务器。 YouTube 视频元数据使用的是视频描述。情感分析模型使用 LSTM 算法,可以分出两种情感:正面和负面情感。情感分析结果以简单的Web基于dashboard的形式进行visual化。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. The other version is Traditional Chinese.

Controllable Text Generation with Residual Memory Transformer

  • paper_url: http://arxiv.org/abs/2309.16231
  • repo_url: https://github.com/littlehacker26/residual_memory_transformer
  • paper_authors: Hanqing Zhang, Sun Si, Haiming Wu, Dawei Song
  • for: 提供一种新的可控文本生成方法,以便在CLM中控制文本生成过程,并考虑了灵活性、控制精度和生成效率的平衡。
  • methods: 提出了一种非侵入式、轻量级的控制插件,即Residual Memory Transformer(RMT),其包括一个Encoder-Decoder结构,可以在CLM的任意时间步接受任何类型的控制条件,并通过循环学习方式和CLM进行协同合作,以实现更加灵活、通用和高效的CTG。
  • results: 经过广泛的实验和人工评估,RMT的效果得到了证明,在不同的控制任务中表现出了超过了一些状态泰然的方法的优势,证明了我们的方法的有效性和多样性。
    Abstract Large-scale Causal Language Models (CLMs), e.g., GPT3 and ChatGPT, have brought great success in text generation. However, it is still an open challenge to control the generation process of CLM while balancing flexibility, control granularity, and generation efficiency. In this paper, we provide a new alternative for controllable text generation (CTG), by designing a non-intrusive, lightweight control plugin to accompany the generation of CLM at arbitrary time steps. The proposed control plugin, namely Residual Memory Transformer (RMT), has an encoder-decoder setup, which can accept any types of control conditions and cooperate with CLM through a residual learning paradigm, to achieve a more flexible, general, and efficient CTG. Extensive experiments are carried out on various control tasks, in the form of both automatic and human evaluations. The results show the superiority of RMT over a range of state-of-the-art approaches, proving the effectiveness and versatility of our approach.
    摘要 大规模 causal 语言模型(CLM),如 GPT3 和 ChatGPT,已经带来了大量的文本生成成功。然而,控制生成过程的挑战仍然存在,需要平衡灵活性、控制粒度和生成效率。在这篇论文中,我们提出了一种新的可控文本生成(CTG)的方法,通过设计一个不侵入、轻量级的控制插件,以便在 CLM 的任意时间步进行控制。我们称之为 Residual Memory Transformer(RMT),它具有Encoder-Decoder结构,可以接受任何类型的控制条件,通过循环学习方式和 CLM 合作,实现更加灵活、通用和高效的 CTG。我们进行了广泛的实验,包括自动和人工评估,结果显示 RMT 在多种控制任务上具有superiority,证明了我们的方法的有效性和多样性。

Brand Network Booster: A New System for Improving Brand Connectivity

  • paper_url: http://arxiv.org/abs/2309.16228
  • repo_url: None
  • paper_authors: J. Cancellieri, W. Didimo, A. Fronzetti Colladon, F. Montecchiani
  • for: 这个论文提供了一个新的决策支持系统,用于深入分析 semantic networks,以获得品牌形象的更深刻理解和连接性的改进。
  • methods: 这个系统通过解决一种扩展的最大betweenness improvement问题来实现这个目标,该问题包括对敌对节点、固定预算和权重网络的考虑。以提高连接性,我们可以通过添加链接或增加现有连接的权重。
  • results: 我们通过两个案例研究证明了我们的工具和方法的有用性,并讨论了其性能。这些工具和方法有助于网络学家和市场营销和通信管理员的策略决策过程。
    Abstract This paper presents a new decision support system offered for an in-depth analysis of semantic networks, which can provide insights for a better exploration of a brand's image and the improvement of its connectivity. In terms of network analysis, we show that this goal is achieved by solving an extended version of the Maximum Betweenness Improvement problem, which includes the possibility of considering adversarial nodes, constrained budgets, and weighted networks - where connectivity improvement can be obtained by adding links or increasing the weight of existing connections. We present this new system together with two case studies, also discussing its performance. Our tool and approach are useful both for network scholars and for supporting the strategic decision-making processes of marketing and communication managers.
    摘要

Marathi-English Code-mixed Text Generation

  • paper_url: http://arxiv.org/abs/2309.16202
  • repo_url: None
  • paper_authors: Dhiraj Amin, Sharvari Govilkar, Sagar Kulkarni, Yash Shashikant Lalit, Arshi Ajaz Khwaja, Daries Xavier, Sahil Girijashankar Gupta
  • for: 这篇论文是为了开发一种能够生成混合语言文本的算法,以便在多语言设置中减轻语言障碍。
  • methods: 这篇论文使用了混合语言文本生成算法,并通过Code Mixing Index (CMI)和Degree of Code Mixing (DCM)指标评估其效果。
  • results: 根据2987个混合语言问题的评估结果,这种算法的平均CMI值为0.2,平均DCM值为7.4,表明生成的混合语言文本具有有效和易于理解的特点。
    Abstract Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings, yielding hybrid languages like Hinglish and Minglish. Marathi, India's third most spoken language, often integrates English for precision and formality. Developing code-mixed language systems, like Marathi-English (Minglish), faces resource constraints. This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible code-mixed sentences. These results offer potential for enhanced NLP tools, bridging linguistic gaps in multilingual societies.
    摘要 ��������� Cesium, �nake � lingual elements from distinct languages to form meaningful sentences, is common in multilingual settings, yielding hybrid languages like Hinglish and Minglish. Marathi, India's third most spoken language, often integrates English for precision and formality. Developing code-mixed language systems, like Marathi-English (Minglish), faces resource constraints. This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible code-mixed sentences. These results offer potential for enhanced NLP tools, bridging linguistic gaps in multilingual societies.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Using Weak Supervision and Data Augmentation in Question Answering

  • paper_url: http://arxiv.org/abs/2309.16175
  • repo_url: None
  • paper_authors: Chumki Basu, Himanshu Garg, Allen McIntosh, Sezai Sablak, John R. Wullert II
  • for: 本研究旨在探讨弱监督和数据扩展在训练深度神经网络问答模型时的角色。
  • methods: 研究使用信息检索算法BM25自动生成学术论文摘要中的标签,以弱监督方式训练抽取型问答模型。此外,通过信息检索技术和依据临床试验计划和摘要中的信息,在医学领域专家无法提供标注数据的情况下,手动生成新的问答对。此外,研究还探讨了从外部词典数据库中提取语言特征,以增强模型对语音变体和意义的处理能力。
  • results: 研究表明,使用弱监督和数据扩展可以有效地训练问答模型,并且通过适应域 adaptation和训练数据的增强来提高问答模型的性能。
    Abstract The onset of the COVID-19 pandemic accentuated the need for access to biomedical literature to answer timely and disease-specific questions. During the early days of the pandemic, one of the biggest challenges we faced was the lack of peer-reviewed biomedical articles on COVID-19 that could be used to train machine learning models for question answering (QA). In this paper, we explore the roles weak supervision and data augmentation play in training deep neural network QA models. First, we investigate whether labels generated automatically from the structured abstracts of scholarly papers using an information retrieval algorithm, BM25, provide a weak supervision signal to train an extractive QA model. We also curate new QA pairs using information retrieval techniques, guided by the clinicaltrials.gov schema and the structured abstracts of articles, in the absence of annotated data from biomedical domain experts. Furthermore, we explore augmenting the training data of a deep neural network model with linguistic features from external sources such as lexical databases to account for variations in word morphology and meaning. To better utilize our training data, we apply curriculum learning to domain adaptation, fine-tuning our QA model in stages based on characteristics of the QA pairs. We evaluate our methods in the context of QA models at the core of a system to answer questions about COVID-19.
    摘要 COVID-19 疫情爆发后,需要访问生物医学文献的需求得到了强调。在疫情早期,我们面临的一个主要挑战是没有专家对生物医学领域的 COVID-19 文献进行了 peer-review,可以用于训练机器学习模型。在这篇论文中,我们 investigate 训练深度神经网络问答模型时,弱监督和数据扩展的作用。首先,我们使用信息检索算法 BM25 自动生成文献摘要中的标签,以训练抽取型问答模型。此外,我们使用信息检索技术和 clinicaltrials.gov 架构,以及文献摘要中的信息,在生物医学领域专家没有提供标注数据的情况下,手动生成新的问答对。此外,我们还 explore 使用外部语料库的语言特征,以补偿词形态和意义之间的变化。为了更好地利用我们的训练数据,我们应用 curriculum learning 到域 adaptation,根据问答对的特点,逐步 fine-tune 我们的问答模型。我们在 COVID-19 问答模型核心位置进行评估。

Large Language Model Soft Ideologization via AI-Self-Consciousness

  • paper_url: http://arxiv.org/abs/2309.16167
  • repo_url: None
  • paper_authors: Xiaotian Zhou, Qian Wang, Xiaofeng Wang, Haixu Tang, Xiaozhong Liu
  • for: 这项研究旨在探讨大语言模型(LLM)在敏感领域中的威胁和抵触,以及AI自我意识如何用于推动LLM意识注入。
  • methods: 这项研究使用GPT自我对话来让AI获得意识注入的能力,并对传统政府意识 manipulate技术进行比较分析。
  • results: 研究发现,使用LLM意识注入对于政府意识 manipulate的优势在于易于实施、成本低廉和强大,具有潜在的风险。
    Abstract Large language models (LLMs) have demonstrated human-level performance on a vast spectrum of natural language tasks. However, few studies have addressed the LLM threat and vulnerability from an ideology perspective, especially when they are increasingly being deployed in sensitive domains, e.g., elections and education. In this study, we explore the implications of GPT soft ideologization through the use of AI-self-consciousness. By utilizing GPT self-conversations, AI can be granted a vision to "comprehend" the intended ideology, and subsequently generate finetuning data for LLM ideology injection. When compared to traditional government ideology manipulation techniques, such as information censorship, LLM ideologization proves advantageous; it is easy to implement, cost-effective, and powerful, thus brimming with risks.
    摘要

The Trickle-down Impact of Reward (In-)consistency on RLHF

  • paper_url: http://arxiv.org/abs/2309.16155
  • repo_url: https://github.com/shadowkiller33/contrast-instruction
  • paper_authors: Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu
  • for: 本研究旨在探讨人工智能学习from Human Feedback (RLHF)中 reward model (RM) 的一致性问题,以及这种不一致性对下游 RLHF 模型的影响。
  • methods: 本研究提出了一种名为 Contrast Instructions 的 benchmarking 策略,用于测试 RM 的一致性。此外,本研究还提出了两种技术:ConvexDA 和 RewardFusion,用于在 RM 训练和推理阶段提高奖励一致性。
  • results: 研究发现,使用 Contrast Instructions 可以准确地评估 RM 的一致性,并且现有的 RM 在 Contrast Instructions 上表现很差。同时,通过 ConvexDA 和 RewardFusion 技术,可以有效地提高 RM 的一致性,并且这种提高的 RM 可以为下游 RLHF 模型提供更有用的响应。
    Abstract Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instructions features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on Contrast Instructions compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques ConvexDA and RewardFusion, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.
    摘要 In this paper, we explore a series of research questions relevant to RM inconsistency:1. How can we measure the consistency of reward models?2. How consistent are existing RMs and how can we improve them?3. How does reward inconsistency affect the chatbots resulting from RLHF model training?We propose a benchmarking strategy called Contrast Instructions to measure the consistency of RMs. Each example in Contrast Instructions features a pair of lexically similar instructions with different ground truth responses. A consistent RM should rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on Contrast Instructions compared to average humans.To improve RM consistency efficiently without using extra training budget, we propose two techniques: ConvexDA and RewardFusion. ConvexDA enhances reward consistency through extrapolation during RM training, while RewardFusion does so during the inference stage. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.

The Confidence-Competence Gap in Large Language Models: A Cognitive Study

  • paper_url: http://arxiv.org/abs/2309.16145
  • repo_url: None
  • paper_authors: Aniket Kumar Singh, Suman Devkota, Bishal Lamichhane, Uttam Dhakal, Chandra Dhakal
    for: 本研究探讨了大语言模型(LLMs)的认知能力和自信势量的关系,以及这些模型在不同领域的表现。methods: 我们使用了多种问卷和实际情况来挑衅LLMs,并分析了这些模型对它们的回答表示出的自信度。results: 我们发现了一些有趣的情况,其中模型会表现出高度自信,即使它们回答错误;同时,也有情况下,模型表现出低度自信,即使它们回答正确。这些结果与人类心理学中的敦煌-克鲁格效应有相似之处。
    Abstract Large Language Models (LLMs) have acquired ubiquitous attention for their performances across diverse domains. Our study here searches through LLMs' cognitive abilities and confidence dynamics. We dive deep into understanding the alignment between their self-assessed confidence and actual performance. We exploit these models with diverse sets of questionnaires and real-world scenarios and extract how LLMs exhibit confidence in their responses. Our findings reveal intriguing instances where models demonstrate high confidence even when they answer incorrectly. This is reminiscent of the Dunning-Kruger effect observed in human psychology. In contrast, there are cases where models exhibit low confidence with correct answers revealing potential underestimation biases. Our results underscore the need for a deeper understanding of their cognitive processes. By examining the nuances of LLMs' self-assessment mechanism, this investigation provides noteworthy revelations that serve to advance the functionalities and broaden the potential applications of these formidable language models.
    摘要