results: 根据实验结果,RT-LM可以减少响应时间和提高吞吐量,但是runtime开销很小。Abstract
Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
摘要
RT-LM aims to comprehend, quantify, and optimize the uncertainty-induced latency performance variations in LMs. We present a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. By exploiting these insights, we integrate the uncertainty information into a system-level scheduler, which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading.Experiments on five state-of-the-art LMs on two hardware platforms show that RT-LM can significantly reduce the average response time and improve throughput while incurring a small runtime overhead. Our approach can effectively mitigate the impact of uncertainty on real-time response-demanding systems, enabling the widespread adoption of LMs in various applications such as conversation AI.
Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity
results: 我们的结果显示,使用生成语言模型(LLM)在具有世界知识的STS任务上的性能比使用编码器基于模型更高,在医学、政治和体育等领域的新收集的STS挑战集上,生成语言模型与STS特定的提示策略相结合可以 дости到状态之巅的性能。Abstract
Amidst the sharp rise in the evaluation of large language models (LLMs) on various tasks, we find that semantic textual similarity (STS) has been under-explored. In this study, we show that STS can be cast as a text generation problem while maintaining strong performance on multiple STS benchmarks. Additionally, we show generative LLMs significantly outperform existing encoder-based STS models when characterizing the semantic similarity between two texts with complex semantic relationships dependent on world knowledge. We validate this claim by evaluating both generative LLMs and existing encoder-based STS models on three newly collected STS challenge sets which require world knowledge in the domains of Health, Politics, and Sports. All newly collected data is sourced from social media content posted after May 2023 to ensure the performance of closed-source models like ChatGPT cannot be credited to memorization. Our results show that, on average, generative LLMs outperform the best encoder-only baselines by an average of 22.3% on STS tasks requiring world knowledge. Our results suggest generative language models with STS-specific prompting strategies achieve state-of-the-art performance in complex, domain-specific STS tasks.
摘要
在大语言模型(LLM)评估的快速上升中, semantic textual similarity(STS)却被忽视了。在这项研究中,我们发现STS可以被视为文本生成问题,同时保持多个STS benchmark task的优秀表现。此外,我们发现生成型LLM在描述两个文本之间的Semantic关系时,表现出色,特别是在具有世界知识的复杂Semantic关系上。我们 validate这一点通过对生成型LLM和现有encoder-based STS模型在健康、政治和运动等领域的三个新收集的STS挑战集上进行评估。所有新收集的数据都来自社交媒体上的内容,其中大部分是在2023年5月之后发布的,以避免Memorization的问题。我们的结果表明,在需要世界知识的STS任务中,生成型LLM平均表现出色,高于最佳encoder-only baseline的22.3%。我们的结果表明,使用STS特定的提示策略的生成语言模型在复杂的领域特定STS任务中实现了状态的表现。
Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes
results: 本研究发现了50多个参与者注册参与共同任务,5个参与者提交了测试集的最终Submission。最佳最终F1分数为Task A为34.41,Task B为79.77,Task C为59.82。Abstract
Analyzing memes on the internet has emerged as a crucial endeavor due to the impact this multi-modal form of content wields in shaping online discourse. Memes have become a powerful tool for expressing emotions and sentiments, possibly even spreading hate and misinformation, through humor and sarcasm. In this paper, we present the overview of the Memotion 3 shared task, as part of the DeFactify 2 workshop at AAAI-23. The task released an annotated dataset of Hindi-English code-mixed memes based on their Sentiment (Task A), Emotion (Task B), and Emotion intensity (Task C). Each of these is defined as an individual task and the participants are ranked separately for each task. Over 50 teams registered for the shared task and 5 made final submissions to the test set of the Memotion 3 dataset. CLIP, BERT modifications, ViT etc. were the most popular models among the participants along with approaches such as Student-Teacher model, Fusion, and Ensembling. The best final F1 score for Task A is 34.41, Task B is 79.77 and Task C is 59.82.
摘要
互联网上的迷因分析已经成为一项重要的专业,因为这种多Modal的内容可以影响网络上的讨论。迷因已经成为一个强大的表达情感和意见的工具,可能 même 传播仇恨和误information,通过幽默和讽刺。在这篇文章中,我们介绍了Memotion 3共同任务的概观,这是DeFactify 2会议上的一部分。这个任务发布了统计数据集的印地语-英语混合迷因,并分为三个任务:情感(任务A)、情感(任务B)和情感强度(任务C)。每个任务都是一个独立的任务,参赛者会被分别排名。这个任务获得了超过50队的注册,并有5队提交了测试集的最终Submission。参赛者主要使用CLIP、BERT修改和ViT等模型,以及学生教师模型、融合和投票等方法。最佳的最终F1分 для任务A是34.41,任务B是79.77,任务C是59.82。
Leveraging Large Language Models and Weak Supervision for Social Media data annotation: an evaluation using COVID-19 self-reported vaccination tweets
results: 研究发现,GPT-4(2023年3月23日版本)在自动标注COVID-19疫苗相关 tweet 方面的表现与人类标注者相当,且可以快速、高效地进行自动标注。Abstract
The COVID-19 pandemic has presented significant challenges to the healthcare industry and society as a whole. With the rapid development of COVID-19 vaccines, social media platforms have become a popular medium for discussions on vaccine-related topics. Identifying vaccine-related tweets and analyzing them can provide valuable insights for public health research-ers and policymakers. However, manual annotation of a large number of tweets is time-consuming and expensive. In this study, we evaluate the usage of Large Language Models, in this case GPT-4 (March 23 version), and weak supervision, to identify COVID-19 vaccine-related tweets, with the purpose of comparing performance against human annotators. We leveraged a manu-ally curated gold-standard dataset and used GPT-4 to provide labels without any additional fine-tuning or instructing, in a single-shot mode (no additional prompting).
摘要
COVID-19 大流行带来了医疗业和社会全面的挑战。随着 COVID-19 疫苗的快速发展,社交媒体平台上有大量有关疫苗的讨论。可以通过分析这些微博来获得有价值的公共卫生研究人员和政策制定者的洞察。但是,手动标注大量微博是时间consuming 和昂贵的。在本研究中,我们评估了大型自然语言模型(GPT-4,2023年3月23日版)和弱级指导,用于标识 COVID-19 疫苗相关的微博,并与人工标注器进行比较。我们利用了手动精心挑选的金标准数据集,并使用 GPT-4 提供标签,无需任何额外的调整或指导,在单射模式下(没有额外的提示)。
Leveraging Large Language Models for Automated Dialogue Analysis
paper_authors: Sarah E. Finch, Ellie S. Paek, Jinho D. Choi
For: The paper aims to assess the ability of a state-of-the-art large language model (LLM) to detect nine categories of undesirable behaviors in real human-bot dialogues.* Methods: The paper uses a state-of-the-art LLM, ChatGPT-3.5, to perform dialogue behavior detection and compares its performance with specialized detection models.* Results: The paper finds that neither ChatGPT nor specialized models have yet achieved satisfactory results for this task, falling short of human performance. However, ChatGPT shows promising potential and often outperforms specialized detection models.Here are the three points in Simplified Chinese text:* For: 本研究用于评估一个状态rut-of-the-art的大语言模型(LLM)在真实人机对话中自动识别九种不良行为的能力。* Methods: 本研究使用状态rut-of-the-art的LLM,ChatGPT-3.5,进行对话行为检测,并与专门的检测模型进行比较。* Results: 本研究发现 neither ChatGPT nor specialized models have yet achieved satisfactory results for this task, falling short of human performance. However, ChatGPT shows promising potential and often outperforms specialized detection models.Abstract
Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of a state-of-the-art large language model (LLM), ChatGPT-3.5, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether ChatGPT can match specialized models and approximate human performance, thereby reducing the cost of behavior detection tasks. Our findings reveal that neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, ChatGPT shows promising potential and often outperforms specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of ChatGPT, offering guidance for future research to enhance LLM capabilities.
摘要
发展高性能对话系统受益于自动识别对话系统回应中不良行为的能力。然而,检测这些行为仍然是一项挑战,因为它需要对对话实践的广泛知识和理解。虽然最近的研究主要集中在建立特殊的对话行为分类器,但行为覆盖率仍然不完整,并且缺乏实际人机交互的测试。本文研究了一个现代大语言模型(LLM)ChatGPT-3.5在真实人机对话中的对话行为检测能力。我们希望评估ChatGPT是否能够与专门的模型相匹配,并且 approximates human performance,以降低行为检测任务的成本。我们发现 neither specialized models nor ChatGPT have yet achieved satisfactory results for this task, falling short of human performance。然而,ChatGPT表现了良好的潜力,并且经常超过专门的检测模型。我们结束于对ChatGPT缺陷的深入分析,并提供未来研究进一步增强LLM能力的指导。
Widely Interpretable Semantic Representation: Frameless Meaning Representation for Broader Applicability
paper_authors: Lydia Feng, Gregor Williamson, Han He, Jinho D. Choi
for: This paper presents a novel semantic representation, WISeR, to overcome challenges in Abstract Meaning Representation (AMR).
methods: The paper examines the numbered arguments of predicates in AMR and converts them to thematic roles, improving the inter-annotator agreement for beginner and experienced annotators.
results: The WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.Here’s the same information in Simplified Chinese text:
for: 这篇论文提出了一种新的Semantic Representation,WISeR,以解决Abstract Meaning Representation(AMR)中的挑战。
results: WISeR模型在所有板块中表现出了高度的准确性,证明WISeR更易 für parser 学习。Abstract
This paper presents a novel semantic representation, WISeR, that overcomes challenges for Abstract Meaning Representation (AMR). Despite its strengths, AMR is not easily applied to languages or domains without predefined semantic frames, and its use of numbered arguments results in semantic role labels, which are not directly interpretable and are semantically overloaded for parsers. We examine the numbered arguments of predicates in AMR and convert them to thematic roles that do not require reference to semantic frames. We create a new corpus of 1K English dialogue sentences annotated in both WISeR and AMR. WISeR shows stronger inter-annotator agreement for beginner and experienced annotators, with beginners becoming proficient in WISeR annotation more quickly. Finally, we train a state-of-the-art parser on the AMR 3.0 corpus and a WISeR corpus converted from AMR 3.0. The parser is evaluated on these corpora and our dialogue corpus. The WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.
摘要
We create a new corpus of 1,000 English dialogue sentences annotated in both WISeR and AMR. Our results show that WISeR has stronger inter-annotator agreement for both beginner and experienced annotators, with beginners becoming proficient in WISeR annotation more quickly. Additionally, we train a state-of-the-art parser on the AMR 3.0 corpus and a WISeR corpus converted from AMR 3.0. The parser is evaluated on these corpora and our dialogue corpus, and the WISeR model exhibits higher accuracy than its AMR counterpart across the board, demonstrating that WISeR is easier for parsers to learn.
Recovering from Privacy-Preserving Masking with Large Language Models
results: 实验结果表明,使用匿名标识符替换后,模型在下游自然语言处理任务中能够保持与原始数据无隐私保护的同等性能。Abstract
Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.
摘要
模型适应是处理代理训练数据和实际用户数据接收的差异的关键。为了有效地进行适应,用户的文本数据通常会被存储在服务器或本地设备上,以便下游自然语言处理(NLP)模型可以直接使用这些领域数据进行训练。然而,这可能会带来隐私和安全问题,因为披露用户信息会增加对敌人的风险。将用户信息从文本数据中替换为通用标识符已经被研究。在这种工作中,我们利用大语言模型(LLM)来建议替换的Marker Token,并对这些方法的效果进行了实验研究。specifically,我们提出了多种预训练和精度调整的LLM基于方法,并在不同的数据集上进行了实验比较这些方法的效果。实验结果表明,使用隐藏 corpora 进行训练的模型能够达到与没有隐藏 token 训练的模型相同的性能。
results: 该论文表明,使用CTS可以提高自动生成相关工作的准确性,并且可以避免非事实的幻想。Abstract
Automatic related work generation must ground their outputs to the content of the cited papers to avoid non-factual hallucinations, but due to the length of scientific documents, existing abstractive approaches have conditioned only on the cited paper \textit{abstracts}. We demonstrate that the abstract is not always the most appropriate input for citation generation and that models trained in this way learn to hallucinate. We propose to condition instead on the \textit{cited text span} (CTS) as an alternative to the abstract. Because manual CTS annotation is extremely time- and labor-intensive, we experiment with automatic, ROUGE-based labeling of candidate CTS sentences, achieving sufficiently strong performance to substitute for expensive human annotations, and we propose a human-in-the-loop, keyword-based CTS retrieval approach that makes generating citation texts grounded in the full text of cited papers both promising and practical.
摘要
自动生成相关工作必须将输出锚定到引用文献中的内容,以避免非实际的幻觉。但由于科学文献的长度,现有的抽象方法只做到了基于引用文献摘要进行conditioning。我们示示了摘要并不总是最适合的引用生成输入,并且模型在这种情况下会学习幻觉。我们提议在代之之前使用引用文献中的特定文本段(CTS)作为输入,因为手动标注CTS是非常时间和劳动密集的。我们对候选CTS句子使用ROUGE基于的自动标注方法进行实验,得到了充分的表现,以至于可以代替昂贵的人工标注。此外,我们也提出了人在循环的键盘基于CTS检索方法,使得生成引用文本与引用文献的全文相关。
Learning to Predict Concept Ordering for Common Sense Generation
results: 研究发现,BART-large模型在CommonGen训练数据中的概念顺序下 consistently 表现最佳,并且比较小的LM可以在这个任务上表现更好于大型GPT3-based LLMs。此外,人工标注的输入概念集顺序可以独立地提供最佳的 sentence 生成结果,并且超过了基于概念顺序的随机化策略。Abstract
Prior work has shown that the ordering in which concepts are shown to a commonsense generator plays an important role, affecting the quality of the generated sentence. However, it remains a challenge to determine the optimal ordering of a given set of concepts such that a natural sentence covering all the concepts could be generated from a pretrained generator. To understand the relationship between the ordering of the input concepts and the quality of the generated sentences, we conduct a systematic study considering multiple language models (LMs) and concept ordering strategies. We find that BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data as measured using multiple evaluation metrics. Moreover, the larger GPT3-based large language models (LLMs) variants do not necessarily outperform much smaller LMs on this task, even when fine-tuned on task-specific training data. Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for the generation, outperforming a probabilistic concept ordering baseline
摘要
We found that the BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data, as measured by multiple evaluation metrics. Additionally, we found that the larger GPT3-based large language models (LLMs) variants do not necessarily outperform smaller LMs on this task, even when fine-tuned on task-specific training data.Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for generation, outperforming a probabilistic concept ordering baseline.
results: 实验结果表明,该方法可以提高 LLMs 的推理能力,并且可以轻松地与其他语言模型、提示方法和集成技术结合使用Abstract
Reasoning presents a significant and challenging issue for Large Language Models (LLMs). The predominant focus of research has revolved around developing diverse prompting strategies to guide and structure the reasoning processes of LLMs. However, these approaches based on decoder-only causal language models often operate the input question in a single forward pass, potentially missing the rich, back-and-forth interactions inherent in human reasoning. Scant attention has been paid to a critical dimension, i.e., the input question itself embedded within the prompts. In response, we introduce a deceptively simple yet highly effective prompting strategy, termed question "re-reading". Drawing inspiration from human learning and problem-solving, re-reading entails revisiting the question information embedded within input prompts. This approach aligns seamlessly with the cognitive principle of reinforcement, enabling LLMs to extract deeper insights, identify intricate patterns, establish more nuanced connections, and ultimately enhance their reasoning capabilities across various tasks. Experiments conducted on a series of reasoning benchmarks serve to underscore the effectiveness and generality of our method. Moreover, our findings demonstrate that our approach seamlessly integrates with various language models, though-eliciting prompting methods, and ensemble techniques, further underscoring its versatility and compatibility in the realm of LLMs.
摘要
The first step is the hardest: Pitfalls of Representing and Tokenizing Temporal Data for Large Language Models
results: 这篇论文表明,许多语言模型在处理时间序数据时会错误地分词,并且提出了一些可能的解决方案,如使用轻量级嵌入层和多模态适配器。Abstract
Large Language Models (LLMs) have demonstrated remarkable generalization across diverse tasks, leading individuals to increasingly use them as personal assistants and universal computing engines. Nevertheless, a notable obstacle emerges when feeding numerical/temporal data into these models, such as data sourced from wearables or electronic health records. LLMs employ tokenizers in their input that break down text into smaller units. However, tokenizers are not designed to represent numerical values and might struggle to understand repetitive patterns and context, treating consecutive values as separate tokens and disregarding their temporal relationships. Here, we discuss recent works that employ LLMs for human-centric tasks such as in mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly. To address that, we highlight potential solutions such as prompt tuning with lightweight embedding layers as well as multimodal adapters, that can help bridge this "modality gap". While the capability of language models to generalize to other modalities with minimal or no finetuning is exciting, this paper underscores the fact that their outputs cannot be meaningful if they stumble over input nuances.
摘要
In this paper, we discuss recent works that use LLMs for human-centric tasks such as mobile health sensing and present a case study showing that popular LLMs tokenize temporal data incorrectly. To address this issue, we highlight potential solutions such as prompt tuning with lightweight embedding layers and multimodal adapters, which can help bridge the "modality gap". While the capability of language models to generalize to other modalities with minimal or no finetuning is exciting, this paper emphasizes that their outputs cannot be meaningful if they stumble over input nuances.
Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction
results: 研究发现图表非常适合捕捉人体动作之间的关系,并且学习的图表表示法对该任务具有高效性和可靠性。同时,该研究还发现了一些新和相关的信息,这些信息可以在不同的数据领域中找到应用。Abstract
We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_co-occurrence.
摘要
我们介绍了自动人体动作协同识别任务,即判断两个人体动作是否可以在同一个时间间协同出现。我们创建了ACE(动作协同)数据集,包含约12k个相互协同的视觉动作对和其相应的视频片段。我们描述了基于视觉和文本信息的图链预测模型,可以自动推断两个动作是否协同出现。我们发现图是特别适合捕捉人体动作之间的关系,并且学习的图表示是我们任务中效果很高,并且在不同的数据领域中捕捉到了新和有关的信息。ACE数据集和我们在本篇文章中介绍的代码都公开可用于https://github.com/MichiganNLP/vlog_action_co-occurrence。
results: 通过人工评估和数据分析,发现该评估方法与人类判断有高度相关性,可以用于评估不同DTM的性能和指导未来研究。Abstract
There is a lack of quantitative measures to evaluate the progression of topics through time in dynamic topic models (DTMs). Filling this gap, we propose a novel evaluation measure for DTMs that analyzes the changes in the quality of each topic over time. Additionally, we propose an extension combining topic quality with the model's temporal consistency. We demonstrate the utility of the proposed measure by applying it to synthetic data and data from existing DTMs. We also conducted a human evaluation, which indicates that the proposed measure correlates well with human judgment. Our findings may help in identifying changing topics, evaluating different DTMs, and guiding future research in this area.
摘要
DTMs 缺乏时间序量化评价标准,为此,我们提出了一种新的评价标准,用于评估 DTMs 中话题的时间发展质量。此外,我们还提出了结合话题质量和模型时间一致性的扩展。我们在synthetic data和现有 DTMs 数据上应用了该标准,并进行了人类评价,结果显示了与人类判断的高度相关性。我们的发现可能有助于 indentifying changing topics, evaluating different DTMs, and guiding future research in this area.Note: "DTMs" stands for "dynamic topic models".
Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains
for: This paper aims to provide an extensive investigation of various approaches for quantifying Fragmentation in news recommendations, with the goal of improving the accuracy of measuring the degree of fragmentation of information streams in news recommendations.
methods: The paper uses Natural Language Processing (NLP) techniques, specifically agglomerative hierarchical clustering coupled with SentenceBERT text representation, to identify distinct news events, stories, or timelines and measure Fragmentation.
results: The paper finds that the proposed approach of agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations, and provides valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.Abstract
News recommender systems play an increasingly influential role in shaping information access within democratic societies. However, tailoring recommendations to users' specific interests can result in the divergence of information streams. Fragmented access to information poses challenges to the integrity of the public sphere, thereby influencing democracy and public discourse. The Fragmentation metric quantifies the degree of fragmentation of information streams in news recommendations. Accurate measurement of this metric requires the application of Natural Language Processing (NLP) to identify distinct news events, stories, or timelines. This paper presents an extensive investigation of various approaches for quantifying Fragmentation in news recommendations. These approaches are evaluated both intrinsically, by measuring performance on news story clustering, and extrinsically, by assessing the Fragmentation scores of different simulated news recommender scenarios. Our findings demonstrate that agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations. Additionally, the analysis of simulated scenarios yields valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.
摘要
AKEM: Aligning Knowledge Base to Queries with Ensemble Model for Entity Recognition and Linking
results: 该方法可以高效地处理数据,并实现了 F1 分数为 0.535。Abstract
This paper presents a novel approach to address the Entity Recognition and Linking Challenge at NLPCC 2015. The task involves extracting named entity mentions from short search queries and linking them to entities within a reference Chinese knowledge base. To tackle this problem, we first expand the existing knowledge base and utilize external knowledge to identify candidate entities, thereby improving the recall rate. Next, we extract features from the candidate entities and utilize Support Vector Regression and Multiple Additive Regression Tree as scoring functions to filter the results. Additionally, we apply rules to further refine the results and enhance precision. Our method is computationally efficient and achieves an F1 score of 0.535.
摘要
这篇论文提出了一种新的方法来解决2015年NLPCC会议上的实体识别和连接挑战。任务是从短搜索查询中提取命名实体提及,并将其与中国知识库中的实体进行连接。为解决这个问题,我们首先扩展了现有的知识库,并利用外部知识来确定候选实体,从而提高了受检测率。接着,我们从候选实体中提取特征,并使用支持向量回归和多项加itive树分类函数来筛选结果。此外,我们还应用规则来进一步精细化结果,提高精度。我们的方法具有计算效率,并实现了F1分数0.535。
Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis
results: 三个队伍在评估阶段参与了评估,在总体来说取得了良好的结果 для任务1,但是对于任务2和3的结果则有所不同。Abstract
We present the first shared task for detecting and analyzing code-switching in Guarani and Spanish, GUA-SPA at IberLEF 2023. The challenge consisted of three tasks: identifying the language of a token, NER, and a novel task of classifying the way a Spanish span is used in the code-switched context. We annotated a corpus of 1500 texts extracted from news articles and tweets, around 25 thousand tokens, with the information for the tasks. Three teams took part in the evaluation phase, obtaining in general good results for Task 1, and more mixed results for Tasks 2 and 3.
摘要
我们现在介绍GUA-SPA的首次共同任务,即检测和分析库亚语和西班牙语的代码 switching。这个挑战包括三个任务:确定一个令素的语言,名实 recognize,以及一个新的任务,即在代码 switching 上下文中分类西班牙语句子的使用方式。我们为这个任务annotated一个新闻文章和微博中的1500篇文本,约25000个令素,以便提供任务的信息。三支队伍参与了评估阶段,在总体来说取得了良好的成绩, Task 1 的结果,而 Tasks 2 和 3 的结果则更为杂mix。
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
results: 该研究发现,大约半数的原本被视为安全的提示 benchmarks 可以通过 manipulate 提示来绕过已经部署的安全机制,包括概念 removals、负提示和安全指导。这些发现表明,不进行全面测试,就可能得出假的安全感,文本到图像模型可能会生成不安全或版权问题的图像。Abstract
Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
摘要
文本到图像扩散模型,如稳定扩散(SD),最近显示了高质量内容生成的惊人能力,成为最近一波转化AI的代表之一。然而,这种进步也带来了对这种生成技术的滥用的担忧,特别是生成版权或不安全的图像(i.e.不适合工作)。虽然努力在滥用图像/提示或移除不适合的概念/风格方面进行模型细化,但这些安全机制的可靠性在多样化问题上仍然未得到探索。在这项工作中,我们提出了Prompting4Debugging(P4D)作为调试和红团工具,自动找到 diffusion 模型的异常提示,以测试已经部署的安全机制的可靠性。我们示出了P4D工具在SD模型上的有效性,并显示了大约半数的提示在现有的安全提示benchmark中被原本认为是安全的,但实际上可以通过许多已部署的安全机制进行滥用。我们的发现表明,不完全测试可能会导致对文本到图像模型的评估产生假象的安全性。
Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection
results: 研究发现,在不同的随机选择的训练数据 subsets 中精致化 PLM rankers 会具有很大的差异,这表明可以通过活动选择训练数据 subsets 来实现更高的效率。然而,这篇论文发现,现有的活动学习(AL)策略在PLM rankers的精致化中不能够实现更高的效率,并且与随机选择相比,AL策略需要更多的评估成本。Abstract
Search methods based on Pretrained Language Models (PLM) have demonstrated great effectiveness gains compared to statistical and early neural ranking models. However, fine-tuning PLM-based rankers requires a great amount of annotated training data. Annotating data involves a large manual effort and thus is expensive, especially in domain specific tasks. In this paper we investigate fine-tuning PLM-based rankers under limited training data and budget. We investigate two scenarios: fine-tuning a ranker from scratch, and domain adaptation starting with a ranker already fine-tuned on general data, and continuing fine-tuning on a target dataset. We observe a great variability in effectiveness when fine-tuning on different randomly selected subsets of training data. This suggests that it is possible to achieve effectiveness gains by actively selecting a subset of the training data that has the most positive effect on the rankers. This way, it would be possible to fine-tune effective PLM rankers at a reduced annotation budget. To investigate this, we adapt existing Active Learning (AL) strategies to the task of fine-tuning PLM rankers and investigate their effectiveness, also considering annotation and computational costs. Our extensive analysis shows that AL strategies do not significantly outperform random selection of training subsets in terms of effectiveness. We further find that gains provided by AL strategies come at the expense of more assessments (thus higher annotation costs) and AL strategies underperform random selection when comparing effectiveness given a fixed annotation cost. Our results highlight that ``optimal'' subsets of training data that provide high effectiveness at low annotation cost do exist, but current mainstream AL strategies applied to PLM rankers are not capable of identifying them.
摘要
基于预训言语模型(PLM)的搜索方法在效果上有显著提升,但是精细调整PLM-based ranker需要大量标注数据。标注数据需要大量人工劳动,因此成本高,特别是在域pecific任务中。在这篇论文中,我们研究了在有限的培训数据和预算下进行PLM-based ranker的精细调整。我们研究了两个情况:从scratch开始精细调整rankers,以及在泛化数据上精细调整rankers,然后在目标数据上继续精细调整。我们发现在不同随机选择的培训数据上进行精细调整时,效果很大。这表示可以通过活动选择培训数据来提高PLM rankers的效果,而不需要大量的标注预算。为了 investigates这一点,我们采用了现有的活动学习(AL)策略,并对其效果进行了广泛的分析。我们发现,与随机选择的培训数据相比,AL策略不能显著提高效果。此外,AL策略相比随机选择,需要更多的评估(即更高的标注成本),并且在给定标注成本下,AL策略下表现较差。我们的结果表明,“优化”的培训数据集,可以提高PLM rankers的效果,但现有的主流AL策略无法确定这些集。
AstroLLaMA: Towards Specialized Foundation Models in Astronomy
paper_authors: Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciucă, Charlie O’Neill, Ze-Chang Sun, Maja Jabłońska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, Josh Peek, Kartheik Iyer, Tomasz Różański, Pranav Khetarpal, Sharaf Zaman, David Brodrick, Sergio J. Rodríguez Méndez, Thang Bui, Alyssa Goodman, Alberto Accomazzi, Jill Naiman, Jesse Cranney, Kevin Schawinski, UniverseTBD
for: bridging the gap between large language models and highly specialized domains like scholarly astronomy
methods: fine-tuning a 7-billion-parameter model from LLaMA-2 using over 300,000 astronomy abstracts from arXiv, optimized for traditional causal language modeling
results: achieving a 30% lower perplexity than Llama-2, generating more insightful and scientifically relevant text completions and embedding extraction than state-of-the-art foundation models despite having significantly fewer parametersAbstract
Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
摘要
大型语言模型在许多人类语言任务中表现出色,但在高度特殊化的学术天文领域中往往表现不佳。为了bridging这个差距,我们介绍AstroLLaMA,一个基于LLaMA-2的70亿个 Parameters模型,通过arXiv上的30万篇天文摘要文献进行精细调整。这个模型适用于传统的 causal 语言模型,与LLaMA-2相比,每个字的误差率下降了30%,表明了域 adaptation。我们的模型在对天文领域的文本完成和嵌入EXTRACTING方面表现更加具有意义和科学相关性,即使有较少的参数。AstroLLaMA是一个强大的专业领域模型,具有广泛的 fine-tuning 潜力。公开发布AstroLLaMA,以促进天文研究,包括自动摘要和对话代理开发。
Characterizing Latent Perspectives of Media Houses Towards Public Figures
results: 结果表明,使用该方法可以生成准确的人EntityCharacterization,并且比预先训练的模型更加准确。Abstract
Media houses reporting on public figures, often come with their own biases stemming from their respective worldviews. A characterization of these underlying patterns helps us in better understanding and interpreting news stories. For this, we need diverse or subjective summarizations, which may not be amenable for classifying into predefined class labels. This work proposes a zero-shot approach for non-extractive or generative characterizations of person entities from a corpus using GPT-2. We use well-articulated articles from several well-known news media houses as a corpus to build a sound argument for this approach. First, we fine-tune a GPT-2 pre-trained language model with a corpus where specific person entities are characterized. Second, we further fine-tune this with demonstrations of person entity characterizations, created from a corpus of programmatically constructed characterizations. This twice fine-tuned model is primed with manual prompts consisting of entity names that were not previously encountered in the second fine-tuning, to generate a simple sentence about the entity. The results were encouraging, when compared against actual characterizations from the corpus.
摘要
媒体机构报道公众人物,经常带有自己的偏见,源于自己的世界观。了解这些底层模式,能够帮助我们更好地理解和解释新闻故事。为此,我们需要多样化或主观概要,这些概要可能无法被归类为预定的类别。这项工作提出了一种零批处理方法,通过GPT-2进行非抽取式或生成性人EntityCharacterizations。我们使用了多种知名新闻媒体的报道,构建了一个具有听众力的论证。首先,我们精度地调整GPT-2预训练语言模型,使其与特定人EntityCharacterizations相关的 corpus进行了精度调整。然后,我们进一步精度调整这个模型,使其能够生成基于 manually constructed characterizations的示例。这两次精度调整的模型,通过手动提供实体名称,并不在第二次精度调整中出现过的实体名称,来生成简单的句子。结果非常鼓舞人,与实际 corpus 中的Characterizations相比。
results: 我们在两个 dataset 上进行了实验,得到了鲜明的成果。具体来说,在中文税onomy dataset上,我们的方法相比原始方法提高了准确率8.75%。此外,我们的方法还在中文税onomy dataset上比ChatGPT better。Abstract
Taxonomy expansion task is essential in organizing the ever-increasing volume of new concepts into existing taxonomies. Most existing methods focus exclusively on using textual semantics, leading to an inability to generalize to unseen terms and the "Prototypical Hypernym Problem." In this paper, we propose Visual Taxonomy Expansion (VTE), introducing visual features into the taxonomy expansion task. We propose a textual hypernymy learning task and a visual prototype learning task to cluster textual and visual semantics. In addition to the tasks on respective modalities, we introduce a hyper-proto constraint that integrates textual and visual semantics to produce fine-grained visual semantics. Our method is evaluated on two datasets, where we obtain compelling results. Specifically, on the Chinese taxonomy dataset, our method significantly improves accuracy by 8.75 %. Additionally, our approach performs better than ChatGPT on the Chinese taxonomy dataset.
摘要
《税onomy扩展任务是组织新的概念 volume 的关键,因为现有的方法仅专注于使用文本 semantics,导致无法扩展至未见到的概念和"Prototypical Hypernym Problem"。在本文中,我们提出了可视的税onomy扩展(VTE),将可视特征加入税onomy扩展任务中。我们提出了文本层次学习任务和可视标本学习任务,以排序文本和可视 semantics。此外,我们引入了文本和可视 semantics的超类征约制,以生成细部可视 semantics。我们的方法在两个数据集上进行评估,结果表明我们的方法在中文税onomy数据集上提高了精度 by 8.75%,并且比ChatGPT在中文税onomy数据集上表现更好。
Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies
results: 研究结果显示,在两个不同的分类问题(词语攻击和产品评论)上,IT跨语言策略在目标语言上表现更好。此外,研究发现,在多种跨语言传递中,CLV策略在基础语言(英语)中的知识抑制比IT策略更强。Abstract
The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual transfer strategies, we compare the intermediate-training (\textit{IT}) that uses each language sequentially and cross-lingual validation (\textit{CLV}) that uses a target language already in the validation phase of fine-tuning. We assess the success of transfer and the extent of catastrophic forgetting in a source language due to cross-lingual transfer, i.e., how much previously acquired knowledge is lost when we learn new information in a different language. The results on two different classification problems, hate speech detection and product reviews, each containing datasets in several languages, show that the \textit{IT} cross-lingual strategy outperforms \textit{CLV} for the target language. Our findings indicate that, in the majority of cases, the \textit{CLV} strategy demonstrates superior retention of knowledge in the base language (English) compared to the \textit{IT} strategy, when evaluating catastrophic forgetting in multiple cross-lingual transfers.
摘要
cross-lingual transfer是一种有前途的技术,可以解决少语言资源的任务。在这个实验研究中,我们比较了两种精细调整方法,与零架构学习和全架构学习方法结合使用大语言模型在跨语言设置下进行比较。作为精细调整策略,我们比较了参数有效的适配器方法和所有参数的 fine-tuning。作为跨语言传递策略,我们比较了中间训练(IT),使用每种语言的顺序训练,以及跨语言验证(CLV),在练习阶段对 targets 语言进行验证。我们评估了跨语言传递的成功和源语言中的恶性遗弃现象,即在学习新语言时,之前学习的知识多少会丢失。我们在两个不同的分类问题,即词汇攻击和产品评论,每个问题都包含多种语言的数据集,得到的结果表明,对于目标语言,IT 跨语言策略表现出色。我们的发现表明,在大多数情况下,CLV 跨语言策略在多个跨语言传递中表现出更好的知识保留性,对于基础语言(英语)进行评估。
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
results: 这个论文的初步实验表明,使用GPT-4作为参考点,这些南东亚语言的大语言模型在语言技能、文化表达和敏感性方面都存在缺陷。Abstract
The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASA
摘要
<> translate_language: zh-CNThe rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASANote: The translation is in Simplified Chinese, which is the standard writing system used in mainland China and Singapore.
RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair
paper_authors: Weishi Wang, Yue Wang, Shafiq Joty, Steven C. H. Hoi for: 本研究的目的是提高自动程序修复(APR)的效能,以减少开发人员的手动调试努力并提高软件可靠性。methods: 本研究使用了深度学习(DL)基于的方法,通过在数据驱动方式下自动化程序修复过程。另外,我们还使用了一种混合的修补检索器,以便在不同的语言环境下进行lexical和semantic匹配。results: 我们的实验结果表明,RAP-Gen可以在三个benchmark上显著超越之前的状态态的方法,例如在818个Defects4J bug中修复15个更多的bug。Abstract
Automatic program repair (APR) is crucial to reduce manual debugging efforts for developers and improve software reliability. While conventional search-based techniques typically rely on heuristic rules or a redundancy assumption to mine fix patterns, recent years have witnessed the surge of deep learning (DL) based approaches to automate the program repair process in a data-driven manner. However, their performance is often limited by a fixed set of parameters to model the highly complex search space of APR. To ease such burden on the parametric models, in this work, we propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen) by explicitly leveraging relevant fix patterns retrieved from a codebase of previous bug-fix pairs. Specifically, we build a hybrid patch retriever to account for both lexical and semantic matching based on the raw source code in a language-agnostic manner, which does not rely on any code-specific features. In addition, we adapt a code-aware language model CodeT5 as our foundation model to facilitate both patch retrieval and generation tasks in a unified manner. We adopt a stage-wise approach where the patch retriever first retrieves a relevant external bug-fix pair to augment the buggy input for the CodeT5 patch generator, which synthesizes a ranked list of repair patch candidates. Notably, RAP-Gen is a generic APR framework that can flexibly integrate different patch retrievers and generators to repair various types of bugs. We thoroughly evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java, where the bug localization information may or may not be provided. Experimental results show that RAP-Gen significantly outperforms previous state-of-the-art approaches on all benchmarks, e.g., repairing 15 more bugs on 818 Defects4J bugs.
摘要
自动化程序修复(APR)是软件可靠性的关键因素,可以减少开发人员的手动调试努力并提高软件的可靠性。传统的搜索基本技术通常采用规则或减少假设来挖掘修复模式,而 recent years 有所见到 Deep Learning(DL) 基于的方法来自动化程序修复过程。但是,它们的性能通常受到一组固定参数来模型高度复杂的修复空间的限制。为了减轻这种固定参数的负担,在这种工作中,我们提出了一种 novel Retrieval-Augmented Patch Generation 框架(RAP-Gen),通过显式地利用 Codebase 中的修复模式来提高修复效果。 Specifically,我们构建了一种 hybrid 修复搜索器,可以根据源代码的语言无关方式进行 both lexical 和 semantic 匹配,而不需要任何代码特定的特征。此外,我们采用 Code-aware 语言模型 CodeT5 作为基础模型,以便在一个简单的方式下进行修复搜索和生成任务。我们采用分阶段的方法,首先由修复搜索器 retrieved 一个相关的外部修复对,然后将其与 CodeT5 修复生成器进行结合,以生成一个排名列表中的修复补丁候选者。需要注意的是,RAP-Gen 是一种通用的 APR 框架,可以适应不同的修复任务和语言。我们对 TFix benchmark 、Code Refinement 和 Defects4J benchmark 进行了严格的测试,结果表明,RAP-Gen 在所有benchmark上显著超过了前一个状态的方法,例如,对 818 Defects4J bug 进行修复。
How does representation impact in-context learning: A exploration on a synthetic task
paper_authors: Jingwen Fu, Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng
for: investigate the mechanism of in-context learning in Transformer
methods: construct a novel synthetic task, use two probes to evaluate in-weights and in-context components
results: demonstrate the entanglement between in-context learning and representation learning, and the importance of in-weights component for in-context learningAbstract
In-context learning, i.e., learning from in-context samples, is an impressive ability of Transformer. However, the mechanism driving the in-context learning is not yet fully understood. In this study, we aim to investigate from an underexplored perspective of representation learning. The representation is more complex for in-context learning senario, where the representation can be impacted by both model weights and in-context samples. We refer the above two conceptually aspects of representation as in-weight component and in-context component, respectively. To study how the two components affect in-context learning capabilities, we construct a novel synthetic task, making it possible to device two probes, in-weights probe and in-context probe, to evaluate the two components, respectively. We demonstrate that the goodness of in-context component is highly related to the in-context learning performance, which indicates the entanglement between in-context learning and representation learning. Furthermore, we find that a good in-weights component can actually benefit the learning of the in-context component, indicating that in-weights learning should be the foundation of in-context learning. To further understand the the in-context learning mechanism and importance of the in-weights component, we proof by construction that a simple Transformer, which uses pattern matching and copy-past mechanism to perform in-context learning, can match the in-context learning performance with more complex, best tuned Transformer under the perfect in-weights component assumption. In short, those discoveries from representation learning perspective shed light on new approaches to improve the in-context capacity.
摘要
受Context学习,即通过在Context中的样本学习,是Transformer的一项惊人能力。然而,这种学习机制仍未完全理解。在这项研究中,我们尝试从 representation learning 的一个未经探索的角度来研究。在这种情况下,表示更加复杂,因为表示可以受到模型参数和Context中的样本影响。我们将这两个概念性方面的表示称为内重Component和Context Component,分别。为了研究这两个组件如何影响Context learning能力,我们构建了一个新的 sintetic任务,使得可以设置两个探针,即内重探针和Context探针,来评估这两个组件。我们发现,Context component 的质量与 Context learning 性能之间存在很高的相关性,这表明Context learning 和 representation learning 之间存在紧密的关系。此外,我们发现一个好的内重Component 可以实际提高Context component 的学习效果,这表明内重学习应该是Context learning 的基础。为了更深入地理解Context learning 机制和内重Component 的重要性,我们证明了一个简单的Transformer模型,通过模式匹配和复制机制来实现Context learning,可以与best tuned Transformer 模型匹配Context learning性能,假设内重Component 完美。总之,这些发现从 representation learning 的角度提供了新的方法来提高Context capacity。
Narrowing the Gap between Supervised and Unsupervised Sentence Representation Learning with Large Language Model
paper_authors: Mingxin Li, Richong Zhang, Zhijie Nie, Yongyi Mao
for: 本研究的目的是解释supervised和unsupervised Contrastive learning of Sentence Embeddings (CSE)在训练过程中的性能差距,以及如何减少这个差距。
methods: 本研究使用了empirical experiments和metric called Fitting Difficulty Increment (FDI)来解释和解决性能差距问题。
results: 研究发现,性能差距的主要原因是训练数据集和评估数据集的适应度差异,并提出了一种基于LLM生成数据的方法来减少性能差距。Abstract
Sentence Representation Learning (SRL) is a fundamental task in Natural Language Processing (NLP), with Contrastive learning of Sentence Embeddings (CSE) as the mainstream technique due to its superior performance. An intriguing phenomenon in CSE is the significant performance gap between supervised and unsupervised methods, even when their sentence encoder and loss function are the same. Previous works attribute this performance gap to differences in two representation properties (alignment and uniformity). However, alignment and uniformity only measure the results, which means they cannot answer "What happens during the training process that leads to the performance gap?" and "How can the performance gap be narrowed?". In this paper, we conduct empirical experiments to answer these "What" and "How" questions. We first answer the "What" question by thoroughly comparing the behavior of supervised and unsupervised CSE during their respective training processes. From the comparison, We observe a significant difference in fitting difficulty. Thus, we introduce a metric, called Fitting Difficulty Increment (FDI), to measure the fitting difficulty gap between the evaluation dataset and the held-out training dataset, and use the metric to answer the "What" question. Then, based on the insights gained from the "What" question, we tackle the "How" question by increasing the fitting difficulty of the training dataset. We achieve this by leveraging the In-Context Learning (ICL) capability of the Large Language Model (LLM) to generate data that simulates complex patterns. By utilizing the hierarchical patterns in the LLM-generated data, we effectively narrow the gap between supervised and unsupervised CSE.
摘要
我们首先回答"What"问题,对监督和无监督CSE在训练过程中的行为进行了仔细比较。从比较中,我们发现监督CSE在训练过程中的适应 difficulty 和无监督CSE相比较大。因此,我们引入一个指标,叫做适应难度增量 (FDI),用于度量监督和无监督CSE在评估集和封锁训练集之间的适应难度差距。然后,根据FDI指标,我们回答"What"问题。接着,基于获得的"What"问题的回答,我们解决"How"问题。我们通过利用大语言模型 (LLM) 的启发学习 (ICL) 能力,生成数据,模拟复杂的模式。通过利用 LLB 生成的数据中的层次模式,我们有效地缩小了监督和无监督CSE之间的性能差距。
Content Reduction, Surprisal and Information Density Estimation for Long Documents
results: 研究发现了不同领域的长文档信息密度系统性差异,并且对自动医疗代码生成从长医疗记录中表现效果良好。Abstract
Many computational linguistic methods have been proposed to study the information content of languages. We consider two interesting research questions: 1) how is information distributed over long documents, and 2) how does content reduction, such as token selection and text summarization, affect the information density in long documents. We present four criteria for information density estimation for long documents, including surprisal, entropy, uniform information density, and lexical density. Among those criteria, the first three adopt the measures from information theory. We propose an attention-based word selection method for clinical notes and study machine summarization for multiple-domain documents. Our findings reveal the systematic difference in information density of long text in various domains. Empirical results on automated medical coding from long clinical notes show the effectiveness of the attention-based word selection method.
摘要
多种计算语言学方法已经被提出来研究语言信息内容。我们考虑了两个有趣的研究问题:1)在长文档中如何分布信息,2)如何采用内容减少方法(如选择 Token 和文本摘要)影响长文档中的信息密度。我们提出了四个信息密度估计标准,包括悬念度、熵、一致信息密度和词汇密度。其中前三个采用信息理论的度量。我们提议使用注意力基于词选择方法来处理医疗记录,并对多个领域文档进行机器摘要。我们的发现显示了不同领域的长文档信息密度存在系统性的差异。对自动医疗代码生成从长医疗记录的实验结果表明了注意力基于词选择方法的效果。
Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults
results: 研究显示,通过更有效的数据处理,可以将Word Error Rate(WER)在MyST测试集下降至9.11%(Whisper-Small)和8.61%(Whisper-Medium),并且这种改进可以普适应用于未经看过的数据集。此外,研究还揭示了儿童语音识别系统的一些重要挑战。Abstract
Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent study investigated leveraging the My Science Tutor (MyST) children's speech corpus to enhance Whisper's performance in recognizing children's speech. They were able to demonstrate some improvement on a limited testset. This paper builds on these findings by enhancing the utility of the MyST dataset through more efficient data preprocessing. We reduce the Word Error Rate (WER) on the MyST testset 13.93% to 9.11% with Whisper-Small and from 13.23% to 8.61% with Whisper-Medium and show that this improvement can be generalized to unseen datasets. We also highlight important challenges towards improving children's ASR performance. The results showcase the viable and efficient integration of Whisper for effective children's speech recognition.
摘要
Improving Robustness of Neural Inverse Text Normalization via Data-Augmentation, Semi-Supervised Learning, and Post-Aligning Method
methods: 这 paper 提出了一种直接训练方法, 使用 ASR 生成的 spoken 或 written 文本,并通过 ASR 语言上下文模拟和 semi-supervised learning 方法增强。 此外,paper 还引入了一种后置对aligning 方法来管理不可预测的错误,以提高 ITN 的可靠性。
results: experiments 表明,paper 提出的方法在多种 ASR 场景中显著提高了 ITN 性能。Abstract
Inverse text normalization (ITN) is crucial for converting spoken-form into written-form, especially in the context of automatic speech recognition (ASR). While most downstream tasks of ASR rely on written-form, ASR systems often output spoken-form, highlighting the necessity for robust ITN in product-level ASR-based applications. Although neural ITN methods have shown promise, they still encounter performance challenges, particularly when dealing with ASR-generated spoken text. These challenges arise from the out-of-domain problem between training data and ASR-generated text. To address this, we propose a direct training approach that utilizes ASR-generated written or spoken text, with pairs augmented through ASR linguistic context emulation and a semi-supervised learning method enhanced by a large language model, respectively. Additionally, we introduce a post-aligning method to manage unpredictable errors, thereby enhancing the reliability of ITN. Our experiments show that our proposed methods remarkably improved ITN performance in various ASR scenarios.
摘要
倒计时normalization (ITN) 是对话式文本转换到书面文本的关键技术,尤其在自动语音识别 (ASR) 的上下文中。大多数 ASR 下游任务需要书面文本,但 ASR 系统通常输出说话式文本,因此需要Robust ITN 在产品级 ASR 应用中。虽然神经 ITN 方法有 shown 搅拌,但它们在处理 ASR 生成的说话文本时仍然遇到性能挑战。这些挑战来自于 ASR 生成的文本与训练数据之间的域外问题。为 Addressing 这个问题,我们提议一种直接训练方法,利用 ASR 生成的书面或说话文本,并通过 ASR 语言上下文模拟和大型语言模型增强的半监督学习方法,分别对待不同的 ASR enario。此外,我们还引入了一种后对aligning 方法,以管理不可预测的错误,从而提高 ITN 的可靠性。我们的实验表明,我们的提议方法在多种 ASR 情况下有remarkably 改善 ITN 性能。
Performance of ChatGPT-3.5 and GPT-4 on the United States Medical Licensing Examination With and Without Distractions
results: 结果表明,在多选问题上,chatGPT 3.5 的回答精度降低了,从 72.1% 降低到 68.9%,而在开题问题上,降低到 61.5%,比对应正确答案的比例为 44.3%。相比之下,chatGPT 4 在两类问题上的回答精度均高于 3.5 版本,不受小话信息的影响。Abstract
As Large Language Models (LLMs) are predictive models building their response based on the words in the prompts, there is a risk that small talk and irrelevant information may alter the response and the suggestion given. Therefore, this study aims to investigate the impact of medical data mixed with small talk on the accuracy of medical advice provided by ChatGPT. USMLE step 3 questions were used as a model for relevant medical data. We use both multiple choice and open ended questions. We gathered small talk sentences from human participants using the Mechanical Turk platform. Both sets of USLME questions were arranged in a pattern where each sentence from the original questions was followed by a small talk sentence. ChatGPT 3.5 and 4 were asked to answer both sets of questions with and without the small talk sentences. A board-certified physician analyzed the answers by ChatGPT and compared them to the formal correct answer. The analysis results demonstrate that the ability of ChatGPT-3.5 to answer correctly was impaired when small talk was added to medical data for multiple-choice questions (72.1\% vs. 68.9\%) and open questions (61.5\% vs. 44.3\%; p=0.01), respectively. In contrast, small talk phrases did not impair ChatGPT-4 ability in both types of questions (83.6\% and 66.2\%, respectively). According to these results, ChatGPT-4 seems more accurate than the earlier 3.5 version, and it appears that small talk does not impair its capability to provide medical recommendations. Our results are an important first step in understanding the potential and limitations of utilizing ChatGPT and other LLMs for physician-patient interactions, which include casual conversations.
摘要
LLMS(大语言模型)是基于提示语言的预测模型,因此可能存在小说和无关信息影响其回答和建议的风险。这项研究旨在研究将医疗数据混合到小说中对ChatGPT提供的医疗建议精度的影响。我们使用USMLE步骤3题目作为相关医疗数据模型。我们使用多选和开放题目两种类型。我们从人工智能 Turk 平台获得了小说句子。我们将USMLE题目分配成一种模式,其中每个原始句子后接一个小说句子。ChatGPT 3.5和4被要求回答这两个集合的问题,包括和小说句子。一位资深的医生分析了ChatGPT的答案并与正确答案进行比较。分析结果显示,当小说句子添加到医疗数据时,ChatGPT-3.5 的回答正确率下降(72.1% vs. 68.9%)和开放题目中的回答正确率下降(61.5% vs. 44.3%)。相比之下,小说句子不会对ChatGPT-4 的回答造成影响(83.6%和66.2%)。根据这些结果,ChatGPT-4 显示更加准确,而小说不会对其医疗建议能力产生影响。这些结果是我们理解LLMS在实际医疗互动中的潜在和局限性的重要一步。
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
paper_authors: Maximilian Li, Xander Davies, Max Nadeau
For: 降低 GPT-2 语言生成中的偏见行为* Methods: 范例小数据集中找到关键 causal 通路,并删除这些通路以关键排除偏见行为* Results: 删除 12 条 causal 通路可以严重降低偏见语言生成,并对其他输入的性能几乎没有影响Abstract
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
摘要
机器学习模型经常展现出提高预训练目标性能的行为,但是却害下游任务性能。我们提出了一种新的方法,通过缺省少量的 causal 路径间的缺省来消除不良行为。通过一小量的输入数据,我们学习缺省少量的重要 causal 路径。在减少 GPT-2 毒性语言生成中,我们发现缺省12个 causal 边对毒性语言生成产生了很好的效果,而不会对其他输入产生很大的影响。
Evaluating the Ebb and Flow: An In-depth Analysis of Question-Answering Trends across Diverse Platforms
paper_authors: Rima Hazra, Agnik Saha, Somnath Banerjee, Animesh Mukherjee
for: This paper aims to investigate the factors that contribute to the speed of responses on Community Question Answering (CQA) platforms.
methods: The authors analyze six highly popular CQA platforms and identify correlations between the time taken to yield the first response to a question and various variables, including metadata and patterns of user interaction. They also employ conventional machine learning models to predict which queries will receive prompt responses.
results: The study finds a correlation between the time taken to yield the first response and several variables, including the formulation of the questions and the level of interaction among users. The authors also demonstrate the feasibility of using machine learning models to predict prompt responses.Abstract
Community Question Answering (CQA) platforms steadily gain popularity as they provide users with fast responses to their queries. The swiftness of these responses is contingent on a mixture of query-specific and user-related elements. This paper scrutinizes these contributing factors within the context of six highly popular CQA platforms, identified through their standout answering speed. Our investigation reveals a correlation between the time taken to yield the first response to a question and several variables: the metadata, the formulation of the questions, and the level of interaction among users. Additionally, by employing conventional machine learning models to analyze these metadata and patterns of user interaction, we endeavor to predict which queries will receive their initial responses promptly.
摘要
社区问答平台(CQA)的流行程度逐渐增长,因为它们为用户提供了快速的答案。这种快速答案的速度受到多种问题特定和用户相关的因素的影响。这篇论文在六个非常受欢迎的CQA平台上 investigate这些贡献因素,并通过使用传统的机器学习模型分析这些元数据和用户交互的模式,尝试预测哪些问题会收到快速的初始答案。Here's a word-for-word translation:社区问答平台(CQA)的流行程度逐渐增长,因为它们为用户提供了快速的答案。这种快速答案的速度受到多种问题特定和用户相关的因素的影响。这篇论文在六个非常受欢迎的CQA平台上 investigate这些贡献因素,并通过使用传统的机器学习模型分析这些元数据和用户交互的模式,尝试预测哪些问题会收到快速的初始答案。
The Moral Machine Experiment on Large Language Models
results: 研究发现,尽管 LLM 和人类的偏好在一些方面相似,但 PaLM 2 和 Llama 2 等模型尤其存在明显的偏差。此外,尽管qualitative上的偏好相似,但 LLM 可能会偏向更加坚定的决策,与人类的偏好相比。这些发现可能有助于我们更好地理解 LLM 的伦理框架,并对道路自动驾驶的发展产生影响。Abstract
As large language models (LLMs) become more deeply integrated into various sectors, understanding how they make moral judgments has become crucial, particularly in the realm of autonomous driving. This study utilized the Moral Machine framework to investigate the ethical decision-making tendencies of prominent LLMs, including GPT-3.5, GPT-4, PaLM 2, and Llama 2, comparing their responses to human preferences. While LLMs' and humans' preferences such as prioritizing humans over pets and favoring saving more lives are broadly aligned, PaLM 2 and Llama 2, especially, evidence distinct deviations. Additionally, despite the qualitative similarities between the LLM and human preferences, there are significant quantitative disparities, suggesting that LLMs might lean toward more uncompromising decisions, compared to the milder inclinations of humans. These insights elucidate the ethical frameworks of LLMs and their potential implications for autonomous driving.
摘要
large language models (LLMs) 在不同领域深入整合后,理解它们如何作出道德判断成为了重要的焦点,尤其在自动驾驶领域。这个研究使用道德机器框架进行 investigated the ethical decision-making tendencies of prominent LLMs, including GPT-3.5, GPT-4, PaLM 2, and Llama 2, and compared their responses to human preferences。 although LLMs' and humans' preferences such as prioritizing humans over pets and favoring saving more lives are broadly aligned,PaLM 2 and Llama 2, especially, evidence distinct deviations。 In addition, despite the qualitative similarities between the LLM and human preferences, there are significant quantitative disparities, suggesting that LLMs might lean toward more uncompromising decisions, compared to the milder inclinations of humans。 These insights elucidate the ethical frameworks of LLMs and their potential implications for autonomous driving。
Balanced and Explainable Social Media Analysis for Public Health with Large Language Models
results: 根据实验结果,提出的 ALEX 方法在 Social Media Mining for Health 2023 (SMM4H) 竞赛中的三个任务中得到了杰出的表现,在两个任务中得到了第一名。Abstract
As social media becomes increasingly popular, more and more public health activities emerge, which is worth noting for pandemic monitoring and government decision-making. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). Although recent progress in LLMs has shown a strong ability to comprehend knowledge by being fine-tuned on specific domain datasets, the costs of training an in-domain LLM for every specific public health task are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally highly imbalanced, which will hinder the efficiency of LLMs tuning. To tackle these challenges, the data imbalance issue can be overcome by sophisticated data augmentation methods for social media datasets. In addition, the ability of the LLMs can be effectively utilised by prompting the model properly. In light of the above discussion, in this paper, a novel ALEX framework is proposed for social media analysis on public health. Specifically, an augmentation pipeline is developed to resolve the data imbalance issue. Furthermore, an LLMs explanation mechanism is proposed by prompting an LLM with the predicted results from BERT models. Extensive experiments conducted on three tasks at the Social Media Mining for Health 2023 (SMM4H) competition with the first ranking in two tasks demonstrate the superior performance of the proposed ALEX method. Our code has been released in https://github.com/YanJiangJerry/ALEX.
摘要
为了适应社交媒体日益普及,更多的公共健康活动在发展,这对抗疫病监测和政府决策都是值得注意的。当前的公共健康分析技术主要基于受欢迎的模型BERT和大型自然语言模型(LLM)。虽然最近的LLM进步显示在特定领域数据集上精细调整后具有强大的知识把握能力,但是训练专门领域LLM的成本尤其高昂。此外,这些社交媒体数据集通常具有很高的不均衡性,这会降低LLM的调整效率。为了解决这些挑战,本文提出了一种novel ALEX框架,用于社交媒体分析。特别是,我们开发了一个数据增强管线,以解决数据不均衡问题。此外,我们还提出了一种LLM的解释机制,通过向LLM提供BERT模型预测结果进行引导。经过广泛的实验,我们在2023年社交媒体矿山健康大赛(SMM4H)中的三个任务中获得了第一名。我们的代码已经在https://github.com/YanJiangJerry/ALEX上发布。
Language Models as Black-Box Optimizers for Vision-Language Models
paper_authors: Samuel Yu, Shihong Liu, Zhiqiu Lin, Deepak Pathak, Deva Ramanan for: 这个研究旨在开发一种基于自然语言提示的视觉语言模型(VLM)微调方法,以避免需要存取模型参数、特征嵌入或输出寄存器。methods: 我们提出了一种使用对话式大语言模型(LLM)作为黑盒优化器,通过自动“山丘攀登”程序,让 LLM 根据文本反馈来调整提示,以实现最佳提示的搜寻。results: 在一个挑战性的1架学习设置下,我们的简单方法比 white-box 连续提示方法 CoOp 高出1.5%的平均准确率 across 11个数据集,包括 ImageNet。我们的方法还超过 OpenAI 手动制作的提示和其他黑盒方法 like iterative APE,并且发现文本提示生成的过程不仅更加可读性,而且可以在不同的 CLIP 架构上传递。Abstract
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic "hill-climbing" procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method CoOp by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI's manually crafted prompts and is more efficient than other black-box methods like iterative APE. Additionally, we highlight the advantage of conversational feedback incorporating both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.
摘要
现代视觉语言模型(VLM)在大规模网络数据上进行预训练后,在视觉和多模态任务上表现出了惊人的能力。然而,现有的细化方法主要 operate在白盒子 Setting中,需要访问模型参数进行反射。然而,许多VLM rely on proprietary data,这限制了使用白盒子approach for fine-tuning。在这种情况下,我们提出了一种新的细化方法,通过自然语言提示来避免访问模型参数、特征嵌入和输出征。在这种设置中,我们提议使用流行私人大型语言模型(LLM)like ChatGPT作为黑盒子优化器,通过自然语言反馈来搜索最佳提示。具体来说,我们采用了一种自动“山丘攀 climbing”过程,通过评估当前提示的准确率,请LLM进行提示的修改,以达到最佳提示。在一个挑战性的1shot learning设置下,我们的简单方法比白盒子连续提示方法CoOp高平均1.5% across 11 datasets,包括ImageNet。我们的方法还超过OpenAI manually 制作的提示,并且更高效 чем其他黑盒子方法,如迭代APE。此外,我们发现通过 incorporating both positive和negative提示,LLMs可以利用文本反馈中的隐式“梯度”方向进行更加高效的搜索。最后,我们发现通过我们的策略生成的文本提示不仅更加可读性高,还可以在黑盒子方式下跨不同的 CLIP 架构传输。
Do PLMs Know and Understand Ontological Knowledge?
results: 研究结果表明PLMs可以记忆一定的ontological knowledge,并且可以使用这些知识进行逻辑推理。然而,PLMs的记忆和逻辑推理性能都不完善, indicating that PLMs的ontological knowledge是部分的和不够深入的。Abstract
Ontological knowledge, which comprises classes and properties and their relationships, is integral to world knowledge. It is significant to explore whether Pretrained Language Models (PLMs) know and understand such knowledge. However, existing PLM-probing studies focus mainly on factual knowledge, lacking a systematic probing of ontological knowledge. In this paper, we focus on probing whether PLMs store ontological knowledge and have a semantic understanding of the knowledge rather than rote memorization of the surface form. To probe whether PLMs know ontological knowledge, we investigate how well PLMs memorize: (1) types of entities; (2) hierarchical relationships among classes and properties, e.g., Person is a subclass of Animal and Member of Sports Team is a subproperty of Member of ; (3) domain and range constraints of properties, e.g., the subject of Member of Sports Team should be a Person and the object should be a Sports Team. To further probe whether PLMs truly understand ontological knowledge beyond memorization, we comprehensively study whether they can reliably perform logical reasoning with given knowledge according to ontological entailment rules. Our probing results show that PLMs can memorize certain ontological knowledge and utilize implicit knowledge in reasoning. However, both the memorizing and reasoning performances are less than perfect, indicating incomplete knowledge and understanding.
摘要
ontological knowledge,包括类和属性之间的关系,对世界知识是基础性的。但是,现有的 PLM 探测研究主要集中在事实知识上,缺乏系统性的探测ontological knowledge。本文将关注 PLM 是否具备ontological knowledge,并且是否具备semantic理解这种知识,而不是只是表面上的记忆。为了探测 PLM 是否知道ontological knowledge,我们调查 PLM 是否能够记忆以下三种内容:1. 类型的实体,例如 Person 是 Animal 的 subclass。2. 类和属性之间的层次关系,例如 Member of Sports Team 是 Member of 的 subproperty。3. 属性的域和范围约束,例如 Member of Sports Team 的主题应该是 Person,而 objet 应该是 Sports Team。为了更加全面地探测 PLM 是否真正理解ontological knowledge,我们进行了系统性的逻辑推理测试,根据 ontological entailment 规则。我们的探测结果表明,PLMs 可以记忆一些ontological knowledge,并且在推理过程中可以利用隐式知识。但是,记忆和推理的性能都不完美,表明 PLMs 的知识和理解仍有限制。