cs.CL - 2023-11-26

Uncertainty-aware Language Modeling for Selective Question Answering

paper_url: http://arxiv.org/abs/2311.15451
repo_url: None
paper_authors: Qi Yang, Shreya Ravikumar, Fynn Schmitt-Ulms, Satvik Lolla, Ege Demir, Iaroslav Elistratov, Alex Lavaee, Sadhana Lolla, Elaheh Ahmadi, Daniela Rus, Alexander Amini, Alejandro Perez
for: 本研究开发了一个自动转换大语言模型（LLM）的方法，以生成预测时的不确定性评估。
methods: 本方法不需要外部模型或系统，可以与不同的模型和数据集搭配使用，且 computationally-efficient。
results: 我们在选择性问题回答任务中评估了 converts 模型，结果显示使用我们的方法提供的不确定性评估来选择回答问题可以实现更高的准确率，比直接使用模型概率更高。

Abstract
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs capable of estimating uncertainty with every prediction. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems. We evaluate converted models on the selective question answering setting -- to answer as many questions as possible while maintaining a given accuracy, forgoing providing predictions when necessary. As part of our results, we test BERT and Llama 2 model variants on the SQuAD extractive QA task and the TruthfulQA generative QA task. We show that using the uncertainty estimates provided by our approach to selectively answer questions leads to significantly higher accuracy over directly using model probabilities.

摘要
我们提出了一种自动转换大语言模型（LLM）的方法，该方法可以生成对每个预测都能提供不确定性的LLM。我们的方法是数据和模型无关的，计算效率高，不需要外部模型或系统。我们在选择性问答Setting上评估转换模型，即Answer as many questions as possible while maintaining a given accuracy, if necessary, do not provide predictions。在我们的结果中，我们测试了BERT和Llama 2模型变体在SQuAD抽取式问答任务和TruthfulQA生成式问答任务上。我们发现，使用我们的方法提供的不确定性估计来选择answerQuestion leads to significantly higher accuracy than directly using model probabilities。

Learning to Skip for Language Modeling

paper_url: http://arxiv.org/abs/2311.15436
repo_url: https://github.com/Sfedfcv/redesigned-pancake
paper_authors: Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui
for: 提高语言模型在少量示例学习中的一般化性能
methods: 使用变量计算量分配方法，通过简单的路由机制来实现
results: 在24个NLP任务上进行了广泛评估，与其他竞争性基elines相比，提高了1架shot性能，仅带有极少的执行成本增加

Abstract
Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the complexity or importance of the input data. We argue that in language model pretraining, a variable amount of computation should be assigned to different tokens, and this can be efficiently achieved via a simple routing mechanism. Different from conventional early stopping techniques where tokens can early exit at only early layers, we propose a more general method that dynamically skips the execution of a layer (or module) for any input token with a binary router. In our extensive evaluation across 24 NLP tasks, we demonstrate that the proposed method can significantly improve the 1-shot performance compared to other competitive baselines only at mild extra cost for inference.

摘要
大规模语言模型具有吸引人的一些通用性表现，但大多数语言模型对输入数据的复杂度或重要性不做调整。我们认为在语言模型预训中，应该对不同的 Token 分配不同的计算量，这可以通过简单的路由机制进行效率地实现。与传统的早期停止技术不同，我们的方法可以在任何输入 Token 上进行动态的层（或模组）跳过。在我们的广泛评估中，我们发现在24个 NLP 任务中，我们的方法可以与其他竞争性基eline Only at mild extra cost for inference.

Machine-Generated Text Detection using Deep Learning

paper_url: http://arxiv.org/abs/2311.15425
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Raghav Gaggar, Ashish Bhagchandani, Harsh Oza
for: 本研究探讨了大语言模型（LLM）生成文本与人类生成文本之间的区别，这有助于多种应用。
methods: 我们使用了多种检测方法，包括支持向量机（SVM）、RoBERTa-base和RoBERTa-large等。
results: 研究发现，检测结果主要受到句子长度的影响。

Abstract
Our research focuses on the crucial challenge of discerning text produced by Large Language Models (LLMs) from human-generated text, which holds significance for various applications. With ongoing discussions about attaining a model with such functionality, we present supporting evidence regarding the feasibility of such models. We evaluated our models on multiple datasets, including Twitter Sentiment, Football Commentary, Project Gutenberg, PubMedQA, and SQuAD, confirming the efficacy of the enhanced detection approaches. These datasets were sampled with intricate constraints encompassing every possibility, laying the foundation for future research. We evaluate GPT-3.5-Turbo against various detectors such as SVM, RoBERTa-base, and RoBERTa-large. Based on the research findings, the results predominantly relied on the sequence length of the sentence.

摘要
我们的研究关注了大语言模型生成文本与人类生成文本之间的核心挑战，这对各种应用场景都具有重要性。随着实现这种功能的讨论，我们提供了支持这种模型可行性的证据。我们对多个数据集进行了评估，包括推特情感、足球评论、Project Gutenberg、PubMedQA和SQuAD，并证明了提高检测方法的效果。这些数据集具有复杂的约束，涵盖了每一种可能性，为未来研究提供了基础。我们对GPT-3.5-Turbo与多种检测器进行比较，包括SVM、RoBERTa-base和RoBERTa-large。根据研究发现，结果主要受到句子长度的影响。

Learning Section Weights for Multi-Label Document Classification

paper_url: http://arxiv.org/abs/2311.15402
repo_url: https://github.com/MaziarMF/Learning-Section-Weights-for-Multi-Label-Document-Classification
paper_authors: Maziar Moradi Fard, Paula Sorrolla Bayod, Kiomars Motarjem, Mohammad Alian Nejadi, Saber Akhondi, Camilo Thorne
for: 这篇论文旨在提出一种新的多标签文档分类方法，以更好地处理具有多个标签的文档。methods: 该方法称为学习段重要性（LSW），它通过多个循环层来学习将每个文档中的每个部分 assign 不同的权重，以便在分类时进行权重融合。results: 实验结果表明，LSW 方法在公共（arXiv）和私人（Elsevier）数据集上的实验结果都表现出色，与当前最佳的多标签文档分类方法相比，LSW 方法提高了1.3%的macro平均F1分数和1.3%的macro平均回归率。

Abstract
Multi-label document classification is a traditional task in NLP. Compared to single-label classification, each document can be assigned multiple classes. This problem is crucially important in various domains, such as tagging scientific articles. Documents are often structured into several sections such as abstract and title. Current approaches treat different sections equally for multi-label classification. We argue that this is not a realistic assumption, leading to sub-optimal results. Instead, we propose a new method called Learning Section Weights (LSW), leveraging the contribution of each distinct section for multi-label classification. Via multiple feed-forward layers, LSW learns to assign weights to each section of, and incorporate the weights in the prediction. We demonstrate our approach on scientific articles. Experimental results on public (arXiv) and private (Elsevier) datasets confirm the superiority of LSW, compared to state-of-the-art multi-label document classification methods. In particular, LSW achieves a 1.3% improvement in terms of macro averaged F1-score while it achieves 1.3% in terms of macro averaged recall on the publicly available arXiv dataset.

摘要
多标签文档分类是NLП中的传统任务。相比单标签分类，每个文档可以被分配多个标签。这个问题在各个领域都是非常重要，例如标注科学论文。文档通常被分成几个部分，如摘要和标题。现有的方法假设不同的部分具有相同的重要性，导致优化结果不佳。我们认为这是一个不切实际的假设，因此提出了一种新的方法 called Learning Section Weights (LSW)，利用每个不同部分的贡献来进行多标签分类。通过多个批处理层，LSW学习将每个部分的权重分配给每个标签，并将权重纳入预测中。我们在科学论文上进行了实验，并证明LSW在比 estado-of-the-art 多标签文档分类方法的情况下具有优越性。具体来说，LSW在公开available arXiv 数据集上取得了1.3%的提升 macro 平均 F1 分数，而在公开available arXiv 数据集上取得了1.3%的提升 macro 平均 recall。

Enhancing Empathetic and Emotion Support Dialogue Generation with Prophetic Commonsense Inference

paper_url: http://arxiv.org/abs/2311.15316
repo_url: None
paper_authors: Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Weiping Wang
for: 提高对话机器人的回应质量，使其更能理解人们的情感和情境。
methods: 利用大语言模型理解对话和做出通俗知识推理，并训练可调模型以桥接过去和未来对话主题。
results: 在EmpatheticDialogues和Emotion Support Conversation中，通过我们提出的 prophetic commonsense inference，对话机器人的回应质量得到了显著提高。

Abstract
The interest in Empathetic and Emotional Support conversations among the public has significantly increased. To offer more sensitive and understanding responses, leveraging commonsense knowledge has become a common strategy to better understand psychological aspects and causality. However, such commonsense inferences can be out of context and unable to predict upcoming dialogue themes, resulting in responses that lack coherence and empathy. To remedy this issue, we present Prophetic Commonsense Inference, an innovative paradigm for inferring commonsense knowledge. By harnessing the capabilities of Large Language Models in understanding dialogue and making commonsense deductions, we train tunable models to bridge the gap between past and potential future dialogues. Extensive experiments conducted on EmpatheticDialogues and Emotion Support Conversation show that equipping dialogue agents with our proposed prophetic commonsense inference significantly enhances the quality of their responses.

摘要
“公众对实情和情感支持聊天的兴趣增长了。为了更好地理解心理方面和 causality，许多人使用通俗知识作为战略。然而，这些通俗推理可能会在不同的背景下无法预测下一个对话主题，从而导致响应缺乏一致性和同情。为解决这个问题，我们提出了《启示 Commonsense 推理》，一种新的推理模式。通过利用大型自然语言模型理解对话和做出通俗推理，我们训练可调模型，以 bridge 过去和未来对话的鸿沟。我们在《实情对话》和《情感支持对话》中进行了广泛的实验，发现使用我们提出的《启示 Commonsense 推理》可以significantly enhance 对话机器人的响应质量。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

paper_url: http://arxiv.org/abs/2311.15296
repo_url: https://github.com/IAAR-Shanghai/UHGEval
paper_authors: Xun Liang, Shichao Song, Simin Niu, Zhiyu Li, Feiyu Xiong, Bo Tang, Zhaohui Wy, Dawei He, Peng Cheng, Zhonghao Wang, Haiying Deng
For: This paper aims to assess the authentic reliability of large language models (LLMs) in text generation, specifically in the context of Chinese language.* Methods: The paper develops an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark to evaluate the ability of LLMs to generate text without restrictions, and establishes a comprehensive benchmark evaluation framework for scalable and reproducible experiments.* Results: The paper conducts extensive experiments using prominent Chinese language models and the GPT series models, providing insights into the challenges of hallucination in text generation and the professional performance of these models.

Abstract
Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.

摘要
大型语言模型（LLM）在当代自然语言处理中获得重要贡献，并在多种业界中应用。然而，这些大规模的概率统计模型目前无法确保专业内容生成的质量。这些模型经常生成幻视文本，限制其实际应用的实用性。为评估 LLM 在文本生成中的实际可靠性，许多项目已经开发了检测幻视现象的标准评估 benchmark。然而，这些 benchmark часто使用受限的生成技术，例如导向幻视启动和变更 Authentic 文本以生成幻视。这些方法与实际世界中的内容生成无法匹配。此外，现在没有一个成熟的中文Dataset，用于评估文本生成中的幻视现象。因此，我们已经开发了一个不受限制的幻视生成评估 benchmark（UHGEval），用于 compiling LLM 生成的输出。同时，我们建立了一个全面的评估框架，以便后续研究人员可以进行可扩展和可重现的实验。我们还执行了广泛的实验，评估了中文模型和 GPT 系列模型，以 derive 专业性能见解幻视挑战。

Dataset for Stock Market Forecasting Based on Quantitative Analysis and Qualitative Data

paper_url: http://arxiv.org/abs/2311.15218
repo_url: None
paper_authors: Sai Akash Bathini, Dagli Cihan
For: The paper is written for researchers and practitioners in the field of finance and machine learning, with a focus on stock market forecasting.* Methods: The paper uses a combination of numerical stock data and qualitative text data, including news articles, TV news captions, radio transcripts, and tweets, to extract sentiment and provide a holistic view of the stock market.* Results: The paper provides an unprecedented, publicly available dataset of technical and fundamental data, sentiment, and daily entries from January 2018 to December 2022 for 8 different companies and the Dow Jones Index as a whole, which can be used to train and deploy deep learning models for stock market forecasting.Here is the information in Simplified Chinese text:* 为：本文为金融和机器学习领域的研究者和实践者写的，主要关注股票市场预测。* 方法：本文使用 numerical 股票数据和qualitative 文本数据，包括新闻文章、电视新闻笔记、广播笔记和推特等，提取情绪，为股票市场提供整体视图。* 结果：本文提供了历史上无 precedent 的、公共可用的股票市场数据集，包括技术和基本数据、情绪、日常发布自2018年1月至2022年12月的8家公司和道琛指数全部日记录，可用于模型学习和部署。

Abstract
The application of Machine learning to finance has become a familiar approach, even more so in stock market forecasting. The stock market is highly volatile and huge amounts of data are generated every minute globally. The extraction of effective intelligence from this data is of critical importance. However, a collaboration of numerical stock data with qualitative text data can be a challenging task. In this work, we accomplish this and provide an unprecedented, publicly available dataset with technical and fundamental data, sentiment that we gathered from News Archives, TV news captions, Radio Transcripts, Tweets, Daily financial newspapers, etc. The text data entries used for sentiment extraction total more than 1.4 Million. The dataset comprises of daily entries from January 2018 to December 2022 for 8 different companies and Dow Jones Index as a whole. Holistic Fundamental and Technical data is provided training ready for Model learning and deployment. The predictive power of deep learning models is highly determined by the training data provided. This dataset would be of benefit for research globally incorporating qualitative intelligence for stock market forecasting. The dataset is made available at https://github.com/batking24/Huge-Stock-Dataset.

摘要
Machine learning在金融领域的应用已经变得非常普遍，尤其是在股票市场预测方面。股票市场的波动性很高，全球每分钟产生大量数据。从这些数据中提取有效的智能是非常重要。然而，将数字股票数据与文本数据进行合作可以是一项挑战性的任务。在这项工作中，我们完成了这项任务，并提供了前所未有的、公共可用的数据集，包括技术和基础数据、情感等，从新闻档案、电视新闻笔记、广播笔记、推特等获取。文本数据用于情感EXTRACTING总计超过140万个。该数据集包括2018年1月至2022年12月的每天数据，涵盖8家公司和道琴指数的整体情况。该数据集包含了深度学习模型的训练ready的整体基础和技术数据。深度学习模型的预测力量受训练数据的提供程度很大。这个数据集将对全球股票市场预测研究提供很大的帮助。该数据集可以在GitHub上下载：https://github.com/batking24/Huge-Stock-Dataset。

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

paper_url: http://arxiv.org/abs/2311.15211
repo_url: https://github.com/whynlp/probabilistic-transformer
paper_authors: Haoyi Wu, Kewei Tu
for: This paper is written to propose a new model of contextual word representation that is based on syntactic and probabilistic principles, with the goal of bridging the gap between traditional and neural approaches to NLP.
methods: The proposed model uses a conditional random field to model discrete latent representations of all words in a sentence, as well as dependency arcs between them. The model also uses mean field variational inference for approximate inference.
results: The authors find that their model performs competitively to transformers on small to medium sized datasets, and that the computation graph of their model resembles transformers in terms of dependencies and self-attention.Here is the information in Simplified Chinese text:
for: 这篇论文是为了提出一种基于语法和概率原理的新的语言处理（NLP）模型，目的是将传统的语法和概率方法和现代神经网络方法相结合。
methods: 该模型使用Conditional Random Field来模型所有句子中单词的隐藏表示，以及它们之间的依赖关系。模型还使用mean field variational inference来进行approximate inference。
results: 作者们发现，他们的模型在小到中等大小的数据集上表现竞争力强，并且该模型的计算图与transformer类似，包括依赖关系和自注意力。

Abstract
Syntactic structures used to play a vital role in natural language processing (NLP), but since the deep learning revolution, NLP has been gradually dominated by neural models that do not consider syntactic structures in their design. One vastly successful class of neural models is transformers. When used as an encoder, a transformer produces contextual representation of words in the input sentence. In this work, we propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. Specifically, we design a conditional random field that models discrete latent representations of all words in a sentence as well as dependency arcs between them; and we use mean field variational inference for approximate inference. Strikingly, we find that the computation graph of our model resembles transformers, with correspondences between dependencies and self-attention and between distributions over latent representations and contextual embeddings of words. Experiments show that our model performs competitively to transformers on small to medium sized datasets. We hope that our work could help bridge the gap between traditional syntactic and probabilistic approaches and cutting-edge neural approaches to NLP, and inspire more linguistically-principled neural approaches in the future.

摘要
natural language processing (NLP) 在过去， sintactic structures 扮演着关键的角色，但是自深度学习革命以来， NLP 被慢慢地由不考虑 sintactic structures 的神经网络模型所取代。一个非常成功的神经网络模型是 transformers。作为编码器， transformers 会生成输入句子中每个单词的上下文表示。在这项工作中，我们提出了一种新的上下文单词表示模型，不是从神经网络角度来看，而是从纯粹的 sintactic 和概率角度来看。我们设计了一个条件随机场，该模型所有句子中的单词的不同的latent表示和它们之间的依赖关系。我们使用mean field variational inference来进行approximate inference。很surprisingly，我们发现我们的计算图与 transformers 的计算图很相似，它们之间存在依赖和自我注意的对应关系，以及latent表示分布和上下文嵌入的单词之间的对应关系。实验表明，我们的模型与 transformers 在小到中型数据集上表现相当。我们希望通过这种工作，可以将传统的 sintactic 和概率方法与最新的神经方法之间的差距bridged，并且激励更多基于语言学原理的神经方法在未来的发展。

Benchmarking Large Language Model Volatility

paper_url: http://arxiv.org/abs/2311.15180
repo_url: None
paper_authors: Boyang Yu
for: 这个研究旨在探讨大语言模型（LLM）在读取金融文本时的不确定性的影响。
methods: 研究使用了新闻情感分类任务来研究投资在美国股市中的应用。
results: 研究发现，使用大语言模型时，句子级情感分类结果具有较大的不确定性，这种不确定性会在下游 tasks 中带来更大的变化。同时，调整温度参数可以减轻这种不确定性，但是会导致模型的创造力减退。 ensemble 多个输出可以减轻不确定性的影响，但是需要大量的计算投入。

Abstract
The impact of non-deterministic outputs from Large Language Models (LLMs) is not well examined for financial text understanding tasks. Through a compelling case study on investing in the US equity market via news sentiment analysis, we uncover substantial variability in sentence-level sentiment classification results, underscoring the innate volatility of LLM outputs. These uncertainties cascade downstream, leading to more significant variations in portfolio construction and return. While tweaking the temperature parameter in the language model decoder presents a potential remedy, it comes at the expense of stifled creativity. Similarly, while ensembling multiple outputs mitigates the effect of volatile outputs, it demands a notable computational investment. This work furnishes practitioners with invaluable insights for adeptly navigating uncertainty in the integration of LLMs into financial decision-making, particularly in scenarios dictated by non-deterministic information.

摘要
大型语言模型（LLM）的非决定性输出对金融文本理解任务的影响尚未得到充分探讨。通过一个有力的案例研究，我们发现在美国股票市场中通过新闻情感分析进行投资时，句子水平情感分类结果存在很大的变化，这反映了LLM输出的内在波动性。这些不确定性会在下游传递给更大的股票组合和收益变化。虽然在语言模型解码器中调整温度参数可能有所缓解，但这会导致创造力受限。同时，通过多个输出 ensemble 可以减轻非确定性的影响，但这需要显著的计算投入。这项研究为金融决策中 интеграble LLM 提供了价值的经验，特别是在不确定信息的情况下。