results: 本研究通过理论分析和质量评估来描述这些方法的工作原理和优劣点,并提供了可能的新architecture。未来的实验研究将比较这些方法的效果,以了解它们在不同情况下的优劣点。Abstract
The previous work on controllable text generation is organized using a new schema we provide in this study. Seven components make up the schema, and each one is crucial to the creation process. To accomplish controlled generation for scientific literature, we describe the various modulation strategies utilised to modulate each of the seven components. We also offer a theoretical study and qualitative examination of these methods. This insight makes possible new architectures based on combinations of these components. Future research will compare these methods empirically to learn more about their strengths and utility.
摘要
先前的文本控制生成研究由我们提供的新架构组织。这个架构包括7个组件,每个组件都是生成过程中不可或缺的。为了实现科学文献控制生成,我们介绍了对每个组件进行调整的各种调制策略。我们还提供了这些方法的理论研究和质量分析。这些洞察可能导致基于这些组件的新架构的开发。未来的研究将通过实验比较这些方法的优势和实用性。
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
results: 经过广泛的实验表明,提出的检测和纠正方法可以成功地降低GPT-3.5模型中的幻见率,从47.5%降低到14.5%的平均值。此外,这种方法还可以在不同类型的问题上进行有效地应用,包括多步问题和False Premise问题。Abstract
Recently developed large language models have achieved remarkable success in generating fluent and coherent text. However, these models often tend to 'hallucinate' which critically hampers their reliability. In this work, we address this crucial problem and propose an approach that actively detects and mitigates hallucinations during the generation process. Specifically, we first identify the candidates of potential hallucination leveraging the model's logit output values, check their correctness through a validation procedure, mitigate the detected hallucinations, and then continue with the generation process. Through extensive experiments with GPT-3.5 (text-davinci-003) on the 'article generation task', we first demonstrate the individual efficacy of our detection and mitigation techniques. Specifically, the detection technique achieves a recall of ~88% and the mitigation technique successfully mitigates 57.6% of the correctly detected hallucinations. Importantly, our mitigation technique does not introduce new hallucinations even in the case of incorrectly detected hallucinations, i.e., false positives. Then, we show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average. We further demonstrate the effectiveness and wide applicability of our approach through additional studies including performance on different types of questions (multi-hop and false premise questions) and with another LLM from a different model family (Vicuna). In summary, our work contributes to improving the reliability and trustworthiness of large language models, a crucial step en route to enabling their widespread adoption in real-world applications.
摘要
现在已经开发出的大型语言模型已经达到了非常出色的成绩,可以生成流畅、一致的文本。然而,这些模型经常会“幻觉”,这会严重降低其可靠性。在这个工作中,我们解决这个重要的问题,我们提出了一种活动检测和纠正幻觉的方法。具体来说,我们首先通过模型的极值输出值来认为潜在的幻觉者,然后通过验证过程来确认其正确性,并在检测到的幻觉被纠正后继续进行生成过程。通过对GPT-3.5(文本达文奇003)进行了广泛的实验,我们证明了我们的检测和纠正技术的个人效果。 Specifically, our detection technique achieves a recall of approximately 88%, and the mitigation technique successfully mitigates 57.6% of the correctly detected hallucinations. Furthermore, our mitigation technique does not introduce new hallucinations even in the case of incorrectly detected hallucinations, i.e., false positives. Finally, we show that our active detection and mitigation approach successfully reduces the hallucinations of the GPT-3.5 model from 47.5% to 14.5% on average. We also demonstrate the effectiveness and wide applicability of our approach through additional studies including performance on different types of questions (multi-hop and false premise questions) and with another LLM from a different model family (Vicuna). In summary, our work contributes to improving the reliability and trustworthiness of large language models, a crucial step en route to enabling their widespread adoption in real-world applications.
Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task
results: 我们发现LLMs在自动评估指标上的性能不足前一代模型,并且存在过 corrections 的问题。此外,我们还发现了不同数据分布下LLMs的性能有很大差异。这些发现表明需要进一步调查LLMs在中文GEC任务中的应用。Abstract
Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Processing (NLP) tasks and attracted lots of attention recently. However, some studies indicated that large language models fail to achieve promising result beyond the state-of-the-art models in English grammatical error correction (GEC) tasks. In this report, we aim to explore the how large language models perform on Chinese grammatical error correction tasks and provide guidance for future work. We conduct experiments with 3 different LLMs of different model scale on 4 Chinese GEC dataset. Our experimental results indicate that the performances of LLMs on automatic evaluation metrics falls short of the previous sota models because of the problem of over-correction. Furthermore, we also discover notable variations in the performance of LLMs when evaluated on different data distributions. Our findings demonstrates that further investigation is required for the application of LLMs on Chinese GEC task.
摘要
大规模语言模型(LLM)在自然语言处理(NLP)任务中表现出色,引起了广泛的关注。然而,一些研究表明,大型语言模型在英语grammatical error correction(GEC)任务中未能达到前景模型的成绩。在这份报告中,我们想要探究大型语言模型在中文GEC任务中的表现,并提供未来工作的指导。我们在4个中文GEC数据集上进行了3种不同的LLM模型的实验。我们的实验结果表明,LLM的自动评估指标的表现不足,主要是因为过度修复的问题。此外,我们还发现了不同数据分布下LLM的表现异常大的现象。我们的发现表明,未来应该进一步调查大型语言模型在中文GEC任务中的应用。
Is ChatGPT a Good Personality Recognizer? A Preliminary Study
paper_authors: Yu Ji, Wen Wu, Hong Zheng, Yi Hu, Xi Chen, Liang He for:这个研究的目的是评估 chatGPT 在文本基础人格识别任务中的能力,以生成有效的人格数据。methods:本研究使用了多种提示策略,包括自然语言生成和逻辑推理,以测试 chatGPT 在文本分析和推理方面的能力。results:实验结果显示,使用 zero-shot chain-of-thought 提示策略可以帮助 chatGPT 在文本基础人格识别任务中表现出色,并且能够提供自然语言的解释。此外,通过对 chatGPT 进行水平调整的提示策略,可以将其与相应的现有模型之间的性能差距更加缩小。但是,研究发现 chatGPT 对某些敏感特征(如性别和年龄)存在不公正现象。同时,通过询问 chatGPT 的人格识别能力,可以提高它在相关下游任务中的表现,如情感分类和压力预测。Abstract
In recent years, personality has been regarded as a valuable personal factor being incorporated into numerous tasks such as sentiment analysis and product recommendation. This has led to widespread attention to text-based personality recognition task, which aims to identify an individual's personality based on given text. Considering that ChatGPT has recently exhibited remarkable abilities on various natural language processing tasks, we provide a preliminary evaluation of ChatGPT on text-based personality recognition task for generating effective personality data. Concretely, we employ a variety of prompting strategies to explore ChatGPT's ability in recognizing personality from given text, especially the level-oriented prompting strategy we designed for guiding ChatGPT in analyzing given text at a specified level. The experimental results on two representative real-world datasets reveal that ChatGPT with zero-shot chain-of-thought prompting exhibits impressive personality recognition ability and is capable to provide natural language explanations through text-based logical reasoning. Furthermore, by employing the level-oriented prompting strategy to optimize zero-shot chain-of-thought prompting, the performance gap between ChatGPT and corresponding state-of-the-art model has been narrowed even more. However, we observe that ChatGPT shows unfairness towards certain sensitive demographic attributes such as gender and age. Additionally, we discover that eliciting the personality recognition ability of ChatGPT helps improve its performance on personality-related downstream tasks such as sentiment classification and stress prediction.
摘要
Recently, personality has been recognized as a valuable personal factor in various tasks such as sentiment analysis and product recommendation. This has led to widespread attention to text-based personality recognition tasks, which aim to identify an individual's personality based on given text. Considering ChatGPT's recent remarkable abilities in natural language processing tasks, we provide a preliminary evaluation of ChatGPT on text-based personality recognition tasks to generate effective personality data.To explore ChatGPT's ability in recognizing personality from given text, we employ various prompting strategies. Specifically, we use a level-oriented prompting strategy designed to guide ChatGPT in analyzing given text at a specified level. The experimental results on two representative real-world datasets show that ChatGPT with zero-shot chain-of-thought prompting exhibits impressive personality recognition ability and can provide natural language explanations through text-based logical reasoning. Furthermore, by optimizing zero-shot chain-of-thought prompting with the level-oriented prompting strategy, the performance gap between ChatGPT and corresponding state-of-the-art models has been narrowed even more.However, we observe that ChatGPT shows unfairness towards certain sensitive demographic attributes such as gender and age. Additionally, we find that eliciting the personality recognition ability of ChatGPT can improve its performance on personality-related downstream tasks such as sentiment classification and stress prediction.
Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators
results: 研究发现,许多自称为“开源”的项目仍然存在不明文法律性的数据问题,并且少数项目分享人工调教数据,导致 scientific documentation 非常罕见。Abstract
Large language models that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of OpenAI's ChatGPT, a proprietary large language model for text generation fine-tuned through reinforcement learning from human feedback (LLM+RLHF). We review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. The main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. We evaluate projects in terms of openness of code, training data, model weights, RLHF data, licensing, scientific documentation, and access methods. We find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. Degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.
摘要
大型语言模型 exhibiting instruction-following behavior 是最近一大革命的 conversational interfaces 的趋势,这趋势受到 OpenAI 的 ChatGPT 的发布以及人类反馈驱动的 reinforcement learning 的激发。我们评估了依赖于专有软件的风险,并检查了相同架构和功能的开源项目的第一批。本文的主要贡献是显示开源是 differentiated,并提供了这个快速发展的领域中科学文献的记录。我们将项目评估为开源代码、训练数据、模型权重、RLHF 数据、许可证、科学文献和访问方法等方面的开放程度进行评估。我们发现许多自称为 'open source' 的项目继承了未经Documented的数据,少数分享了关键的指令调整(一个关键的人工注释劳动 Site),并且精心的科学文献记录是非常罕见。度量开放程度对公平和负责任是非常重要,从数据收集和整理到模型架构、训练和细化到发布和部署都是如此。
On decoder-only architecture for speech-to-text and large language model integration
For: 这种研究旨在探索将语音信号细目Integrated into文本大语言模型中,以提高人机交互的自然语言处理能力。* Methods: 该方法使用Connectionist Temporal Classification和一个简单的音频编码器将压缩语音特征映射到文本大语言模型中的连续 semantic space。* Results: 实验结果表明,使用这种方法可以在多语言speech-to-text翻译任务中实现显著的提升,这 highlights the potential advantages of decoder-only models for speech-to-text conversion。Abstract
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
摘要
大型自然语言处理模型(LLM)已经在人工智能领域取得了很大的成功,使得人机交互使用自然语言更加简单。然而,将语音信号集成到LLM中还没有得到了充分的探索。“解码器只”架构也尚未得到了充分的研究。在这项研究中,我们介绍了一种新的方法,即将语音信号集成到文本基于大语言模型中。我们的方法利用了Connectionist Temporal Classification和一个简单的音频编码器,将压缩的语音特征映射到文本基于大语言模型的连续 semantics 空间中。此外,我们还进一步探索了基于 speech-to-text 任务的 decoder-only 架构,通过从 speech-text 对应数据 alone initialize 一个较小规模的随机 initialize 的 speech-LLaMA 模型进行训练。我们在多语言 speech-to-text 翻译任务上进行了实验,并表明了 decoder-only 模型在 speech-to-text 转换中的潜在优势。
Answering Ambiguous Questions via Iterative Prompting
methods: integrates an answering model with a prompting model in an iterative manner, with task-specific post-pretraining approach
results: achieves state-of-the-art or competitive results while using less memory and having a lower inference latency than competing approaches, performs well in low-resource settingsAbstract
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist. To provide feasible answers to an ambiguous question, one approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity. An alternative is to gather candidate answers and aggregate them, but this method can be computationally costly and may neglect dependencies among answers. In this paper, we present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions. Specifically, we integrate an answering model with a prompting model in an iterative manner. The prompting model adaptively tracks the reading process and progressively triggers the answering model to compose distinct and relevant answers. Additionally, we develop a task-specific post-pretraining approach for both the answering model and the prompting model, which greatly improves the performance of our framework. Empirical studies on two commonly-used open benchmarks show that AmbigPrompt achieves state-of-the-art or competitive results while using less memory and having a lower inference latency than competing approaches. Additionally, AmbigPrompt also performs well in low-resource settings. The code are available at: https://github.com/sunnweiwei/AmbigPrompt.
摘要
在开放领域问答中,由于问题的抽象性,可能存在多个有可能的答案。为提供有可能性的答案,一种方法是直接预测所有有效答案,但这可能会困难平衡相关性和多样性。另一种方法是收集候选答案并聚合它们,但这可能会很计算昂贵并可能忽视答案之间的依赖关系。在本文中,我们提出了 AmbigPrompt,以解决现有问答系统中的缺陷。具体来说,我们将答题模型与提示模型集成在迭代方式下,以便在不同的读者过程中适应地触发答题模型,并生成具有不同特征和相关性的答案。此外,我们还开发了特定任务的预训练方法,用于提高我们的框架的性能。经验研究表明,AmbigPrompt在两个常用的开放benchmark上实现了状态当前或竞争性的结果,同时使用的内存和执行时间比竞争方法更低。此外,AmbigPrompt还在低资源环境下表现良好。代码可以在以下链接中找到:https://github.com/sunnweiwei/AmbigPrompt。
Incomplete Utterance Rewriting as Sequential Greedy Tagging
results: 在多个公共数据集上实验,模型在九个重建得分中均 achievable 最佳结果,而其他 метриック分数与之前的州流量模型相似。此外,由于模型的简单性,我们的方法在推断速度上表现出优于大多数先前的模型。Abstract
The task of incomplete utterance rewriting has recently gotten much attention. Previous models struggled to extract information from the dialogue context, as evidenced by the low restoration scores. To address this issue, we propose a novel sequence tagging-based model, which is more adept at extracting information from context. Meanwhile, we introduce speaker-aware embedding to model speaker variation. Experiments on multiple public datasets show that our model achieves optimal results on all nine restoration scores while having other metric scores comparable to previous state-of-the-art models. Furthermore, benefitting from the model's simplicity, our approach outperforms most previous models on inference speed.
摘要
近期,句子重构任务受到了广泛关注。先前的模型在对话上提取信息方面存在问题,这可以从低的重建得分来看出来。为解决这问题,我们提议一种基于序列标记的新模型,它更好地从对话上提取信息。同时,我们引入了说话人变化的嵌入,以模型说话人的变化。在多个公共数据集上进行实验,我们发现我们的模型在所有九个重建得分上具有优化的结果,而其他维度得分与先前状态艺模型相似。此外,由于我们的模型简单,我们的方法在推理速度方面超过了大多数先前模型。
Embedding Mental Health Discourse for Community Recommendation
paper_authors: Hy Dang, Bang Nguyen, Noah Ziems, Meng Jiang
for: investigate the use of discourse embedding techniques to develop a community recommendation system for mental health support groups on social media
methods: use content-based and collaborative filtering techniques to enhance the performance of the recommendation system
results: the proposed approach outperforms the use of each technique separately and provides interpretability in the recommendation process.Here’s the full text in Simplified Chinese:
results: 我们的研究结果表明,提pose的方法在单独使用每种技术时都表现出优异,并且具有更高的可解释性。Abstract
Our paper investigates the use of discourse embedding techniques to develop a community recommendation system that focuses on mental health support groups on social media. Social media platforms provide a means for users to anonymously connect with communities that cater to their specific interests. However, with the vast number of online communities available, users may face difficulties in identifying relevant groups to address their mental health concerns. To address this challenge, we explore the integration of discourse information from various subreddit communities using embedding techniques to develop an effective recommendation system. Our approach involves the use of content-based and collaborative filtering techniques to enhance the performance of the recommendation system. Our findings indicate that the proposed approach outperforms the use of each technique separately and provides interpretability in the recommendation process.
摘要
我们的论文研究了利用话语嵌入技术开发一个关注社交媒体上心理健康支持群体的社区推荐系统。社交媒体平台提供了用户匿名连接到专门针对他们的兴趣的社区的机制。然而,由于在线社区的数量繁多,用户可能面临着找到与他们心理健康问题相关的群体的挑战。为解决这个问题,我们研究了将话语信息从不同的 subreddit 社区嵌入技术的集成,以开发一个高效的推荐系统。我们的方法包括使用内容基本和合作筛选技术来提高推荐系统的性能。我们的发现表明,我们的提议方法比使用每一种技术分别提供更高的性能和可解释性。
MDACE: MIMIC Documents Annotated with Code Evidence
methods: 这个论文使用了EffectiveCAN模型(Liu et al., 2021)来实现代码证据提取方法的基准性能。
results: 这个论文 introduce了第一个公共可用的代码证据集(MDACE),该集基于MIMIC-III医疗记录子集,并由专业医疗编码人员进行标注。该集包括302名入院病人的3,934个证据段和52名医生的5,563个证据段。Abstract
We introduce a dataset for evidence/rationale extraction on an extreme multi-label classification task over long medical documents. One such task is Computer-Assisted Coding (CAC) which has improved significantly in recent years, thanks to advances in machine learning technologies. Yet simply predicting a set of final codes for a patient encounter is insufficient as CAC systems are required to provide supporting textual evidence to justify the billing codes. A model able to produce accurate and reliable supporting evidence for each code would be a tremendous benefit. However, a human annotated code evidence corpus is extremely difficult to create because it requires specialized knowledge. In this paper, we introduce MDACE, the first publicly available code evidence dataset, which is built on a subset of the MIMIC-III clinical records. The dataset -- annotated by professional medical coders -- consists of 302 Inpatient charts with 3,934 evidence spans and 52 Profee charts with 5,563 evidence spans. We implemented several evidence extraction methods based on the EffectiveCAN model (Liu et al., 2021) to establish baseline performance on this dataset. MDACE can be used to evaluate code evidence extraction methods for CAC systems, as well as the accuracy and interpretability of deep learning models for multi-label classification. We believe that the release of MDACE will greatly improve the understanding and application of deep learning technologies for medical coding and document classification.
摘要
我们介绍一个数据集用于证据/理由提取的极多标签分类任务, specifically Computer-Assisted Coding (CAC)。在过去几年,随着机器学习技术的进步,CAC 技术得到了显著改进。然而,只是预测病人遇到的最终代码是不够的,CAC 系统需要提供可靠的文本证据来证明计费代码。一个能够生成准确和可靠的证据 для每个代码的模型会是一项很大的利益。然而,人工标注的代码证据集是非常困难的创建,因为它需要专业的医疗知识。在这篇论文中,我们介绍了 MDACE,第一个公共可用的代码证据集,建立在 MIMIC-III 医疗记录子集上。该数据集由专业医疗编码员标注,包括302 例入院记录和 52 例 Profee 记录,共计 3,934 个证据段和 5,563 个证据段。我们基于 EffectiveCAN 模型(Liu et al., 2021)实现了多种证据提取方法,以建立基线性能于这个数据集。 MDACE 可以用来评估代码证据提取方法,以及深度学习模型的多标签分类精度和可读性。我们认为,释放 MDACE 将大大提高深度学习技术在医疗编码和文档分类领域的理解和应用。
Subjective Crowd Disagreements for Subjective Data: Uncovering Meaningful CrowdOpinion with Population-level Learning
results: 该论文通过在五个社交媒体平台上进行实验,发现该方法可以有效地降低标注差异,并在 Facebook 上进行了在野实验,证明了该方法的可行性。Abstract
Human-annotated data plays a critical role in the fairness of AI systems, including those that deal with life-altering decisions or moderating human-created web/social media content. Conventionally, annotator disagreements are resolved before any learning takes place. However, researchers are increasingly identifying annotator disagreement as pervasive and meaningful. They also question the performance of a system when annotators disagree. Particularly when minority views are disregarded, especially among groups that may already be underrepresented in the annotator population. In this paper, we introduce \emph{CrowdOpinion}\footnote{Accepted for publication at ACL 2023}, an unsupervised learning based approach that uses language features and label distributions to pool similar items into larger samples of label distributions. We experiment with four generative and one density-based clustering method, applied to five linear combinations of label distributions and features. We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media (Twitter, Gab, and Reddit). We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts. We evaluate \emph{CrowdOpinion} as a label distribution prediction task using KL-divergence and a single-label problem using accuracy measures.
摘要
人类标注数据在人工智能系统中发挥 kritical 作用,包括决策生活方式或修订人类创建的网络/社交媒体内容。 Conventionally, annotator disagreements are resolved before any learning takes place. However, researchers are increasingly identifying annotator disagreement as pervasive and meaningful. They also question the performance of a system when annotators disagree, especially when minority views are disregarded, especially among groups that may already be underrepresented in the annotator population. In this paper, we introduce \emph{CrowdOpinion}\footnote{Accepted for publication at ACL 2023}, an unsupervised learning based approach that uses language features and label distributions to pool similar items into larger samples of label distributions. We experiment with four generative and one density-based clustering method, applied to five linear combinations of label distributions and features. We use five publicly available benchmark datasets (with varying levels of annotator disagreements) from social media (Twitter, Gab, and Reddit). We also experiment in the wild using a dataset from Facebook, where annotations come from the platform itself by users reacting to posts. We evaluate \emph{CrowdOpinion} as a label distribution prediction task using KL-divergence and a single-label problem using accuracy measures.
Linguistic representations for fewer-shot relation extraction across domains
methods: 这些研究使用了 freely available off-the-shelf 工具 construct 语法和 semantics 图,并将其与 popular transformer-based 架构结合使用,以提高 Generalization 性。
results: 研究发现,通过 incorporating 语言表示,可以significantly 提高 few-shot 转移中的性能,但是两种类型的图都 display roughly equivalent utility。Abstract
Recent work has demonstrated the positive impact of incorporating linguistic representations as additional context and scaffolding on the in-domain performance of several NLP tasks. We extend this work by exploring the impact of linguistic representations on cross-domain performance in a few-shot transfer setting. An important question is whether linguistic representations enhance generalizability by providing features that function as cross-domain pivots. We focus on the task of relation extraction on three datasets of procedural text in two domains, cooking and materials science. Our approach augments a popular transformer-based architecture by alternately incorporating syntactic and semantic graphs constructed by freely available off-the-shelf tools. We examine their utility for enhancing generalization, and investigate whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains. We find that while the inclusion of these graphs results in significantly higher performance in few-shot transfer, both types of graph exhibit roughly equivalent utility.
摘要
最近的研究已经证明在各种自然语言处理任务中,通过添加语言表示来提供更多的上下文和托管的环境可以提高域内性能。我们延续这项工作,探索语言表示在跨领域传输中的影响。关键问题是否reno linguistic representations enhance generalizability by providing features that function as cross-domain pivots。我们选择关注在三个dataset上进行Relation Extraction任务,这三个dataset分别是cooking和materials science。我们的方法是在流行的transformer-based architecture中,通过自由可用的off-the-shelf工具来构建语法和Semantic graphs,并尝试以这些graphs来提高通用性。我们检查它们是否有助于提高通用性,并 investigates whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains。我们发现,虽然包含这些graphs可以在几个shot传输中显著提高性能,但是两种类型的graph具有相当的有用性。
results: 研究发现,使用 sampling adapters 技术可以导致更高质量的文本生成,并且可以在不同的测试集上保持这种改善。此外,这些技术可以帮助模型更好地遵循语言的规则,从而提高文本的可读性和可理解性。Abstract
Sampling is a common strategy for generating text from probabilistic models, yet standard ancestral sampling often results in text that is incoherent or ungrammatical. To alleviate this issue, various modifications to a model's sampling distribution, such as nucleus or top-k sampling, have been introduced and are now ubiquitously used in language generation systems. We propose a unified framework for understanding these techniques, which we term sampling adapters. Sampling adapters often lead to qualitatively better text, which raises the question: From a formal perspective, how are they changing the (sub)word-level distributions of language generation models? And why do these local changes lead to higher-quality text? We argue that the shift they enforce can be viewed as a trade-off between precision and recall: while the model loses its ability to produce certain strings, its precision rate on desirable text increases. While this trade-off is not reflected in standard metrics of distribution quality (such as perplexity), we find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution. Further, these measures correlate with higher sequence-level quality scores, specifically, Mauve.
摘要
<>将文本翻译成简化中文。<>采样是一种常见的语言生成模型策略,然而标准祖先采样通常会导致文本无法理解或不正确。为了解决这个问题,许多模型采样分布的修改,如核心采样或top-k采样,已经被引入并广泛应用于语言生成系统。我们提出一个统一框架来理解这些技术,我们称之为采样适配器。采样适配器通常会导致更高质量的文本,这引起了问题:从形式上讲,这些地方性改变如何影响语言生成模型的(子)字元级分布?而这些本地改变是如何导致更高质量的文本呢?我们认为这种shift可以被视为一种精度和回归之间的交易:虽然模型失去了生成某些字串的能力,但它在愿景字串上的精度提高。尽管这种交易不被标准的分布质量指标(如复杂度)反映,我们发现了一些精度强调的指标确实表明采样适配器可以导致更加适应true分布的概率分布。此外,这些指标与高级序级质量分数相关,具体是Mauve。
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
paper_authors: Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel
for: 支持量化生成推理在 LLMA 或 OPT 上的自动代码生成方法。
methods: 使用目标架构和性能模型,包括硬件特性和方法特定的准确性约束。
results: 对 CPU 上的 LLMA 模型进行了快速和准确的推理,与现有开源解决方案相比,性能和准确性都比较高。Abstract
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution. A preliminary implementation is available at https://github.com/IST-DASLab/QIGen.
摘要
我们现在正在进行一种新的自动代码生成方法的研究,用于支持量化生成推理在LLaMA或OPT类型的大语言模型上进行OFF-the-SHELF CPU上的推理。我们的方法被目标架构和性能模型所指导,包括硬件特点和方法特定的准确性约束。对CPU上的LLaMA模型进行推理的结果显示,我们的方法可以达到高性能和高准确性,与现有开源解决方案相比之下,表现出色。一个初步的实现可以在https://github.com/IST-DASLab/QIGen中找到。
Improving Automatic Quotation Attribution in Literary Novels
results: 研究显示,使用state-of-the-art models on each sub-task independently,可以 achieve high accuracy scores. 特别是,一种简单的sequential prediction model可以达到与现有模型相同的准确率。Abstract
Current models for quotation attribution in literary novels assume varying levels of available information in their training and test data, which poses a challenge for in-the-wild inference. Here, we approach quotation attribution as a set of four interconnected sub-tasks: character identification, coreference resolution, quotation identification, and speaker attribution. We benchmark state-of-the-art models on each of these sub-tasks independently, using a large dataset of annotated coreferences and quotations in literary novels (the Project Dialogism Novel Corpus). We also train and evaluate models for the speaker attribution task in particular, showing that a simple sequential prediction model achieves accuracy scores on par with state-of-the-art models.
摘要
当前的引用归属模型在文学小说中假设有不同水平的可用信息,这会对在野外推理中带来挑战。我们将引用归属看作为四个相互连接的子任务:人物识别、核心引用解决、引用归属和说话人归属。我们使用大量Literary novels中注释的核心引用和引用的数据集来对每个子任务进行独立的 bencharking,并对说话人归属任务进行特点验证,发现简单的顺序预测模型可以达到与状态值模型相同的准确率。
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
methods: INT-FP-QSim 使用现有的开源库 such as TensorRT, QPytorch 和 AIMET,组合了这些库来支持不同的浮点数和整数格式。
results: 通过使用 INT-FP-QSim,我们对大语言模型和视Transformers 的性能做了评估,并比较了最近提出的 Adaptive Block Floating Point, SmoothQuant, GPTQ 和 RPTQ 方法的影响。Abstract
The recent rise of large language models (LLMs) has resulted in increased efforts towards running LLMs at reduced precision. Running LLMs at lower precision supports resource constraints and furthers their democratization, enabling users to run billion-parameter LLMs on their personal devices. To supplement this ongoing effort, we propose INT-FP-QSim: an open-source simulator that enables flexible evaluation of LLMs and vision transformers at various numerical precisions and formats. INT-FP-QSim leverages existing open-source repositories such as TensorRT, QPytorch and AIMET for a combined simulator that supports various floating point and integer formats. With the help of our simulator, we survey the impact of different numerical formats on the performance of LLMs and vision transformers at 4-bit weights and 4-bit or 8-bit activations. We also compare recently proposed methods like Adaptive Block Floating Point, SmoothQuant, GPTQ and RPTQ on the model performances. We hope INT-FP-QSim will enable researchers to flexibly simulate models at various precisions to support further research in quantization of LLMs and vision transformers.
摘要
INT-FP-QSim leverages existing open-source repositories such as TensorRT, QPytorch, and AIMET to create a combined simulator that supports various floating point and integer formats. With the help of our simulator, we survey the impact of different numerical formats on the performance of LLMs and vision transformers when using 4-bit weights and 4-bit or 8-bit activations. We also compare recently proposed methods like Adaptive Block Floating Point, SmoothQuant, GPTQ, and RPTQ on the model performances.Our hope is that INT-FP-QSim will enable researchers to flexibly simulate models at various precisions, supporting further research in quantization of LLMs and vision transformers.
LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad
results: 实验结果表明,该方法可以创造出比随机生成方法更好的音乐视觉,并且具有更广泛的音乐视觉应用前景。Abstract
Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/.
摘要
Launchpad是一种音乐 инструмент,允许用户通过点击灯光按钮创作和演奏音乐。为帮助设计Launchpad的光效和启发创新,以及为Beginner提供更 accessible的音乐视觉创作方式,我们提议了LaunchpadGPT模型,自动生成Launchpad演奏的视觉设计。基于出色的语言模型,我们的提议LaunchpadGPT接受音乐作品作为输入,并输出Launchpad演奏的光效效果 видео(Launchpad演奏视频)。我们收集了Launchpad演奏视频,并处理它们,以获得音乐和对应的视频帧的Launchpad演奏Prompt-completion对。通过训练语言模型,我们实现了该方法可以创造更好的音乐视觉,并且具有更广泛的音乐视觉应用前景。我们的代码可以在https://github.com/yunlong10/LaunchpadGPT/上获取。