2023-09-11

cs.CL

cs.CL - 2023-09-11

Hi Model, generating ‘nice’ instead of ‘good’ is not as bad as generating ‘rice’! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

paper_url: http://arxiv.org/abs/2309.05804
repo_url: None
paper_authors: Abhisek Tiwari, Muhammed Sinan, Kaushik Roy, Amit Sheth, Sriparna Saha, Pushpak Bhattacharyya
for: 本研究旨在提出一种新的对话生成损失函数和评价指标，以改进对话生成模型的评价和优化。
methods: 本研究使用了新的Semantic Infused Contextualized diaLogue（SemTextualLogue）损失函数和Dialuation评价指标，并在两个对话数据集上进行了实验，包括任务对话和开放对话场景。
results: 研究发现，使用SemTextualLogue损失函数和Dialuation指标进行训练，对话生成模型的性能有显著提升，比传统的cross-entropy损失函数更能够评价对话生成模型的表现。

Abstract
Over the past two decades, dialogue modeling has made significant strides, moving from simple rule-based responses to personalized and persuasive response generation. However, despite these advancements, the objective functions and evaluation metrics for dialogue generation have remained stagnant, i.e., cross-entropy and BLEU, respectively. These lexical-based metrics have the following key limitations: (a) word-to-word matching without semantic consideration: It assigns the same credit for failure to generate 'nice' and 'rice' for 'good'. (b) missing context attribute for evaluating the generated response: Even if a generated response is relevant to the ongoing dialogue context, it may still be penalized for not matching the gold utterance provided in the corpus. In this paper, we first investigate these limitations comprehensively and propose a new loss function called Semantic Infused Contextualized diaLogue (SemTextualLogue) loss function. Furthermore, we formulate a new evaluation metric called Dialuation, which incorporates both context relevance and semantic appropriateness while evaluating a generated response. We conducted experiments on two benchmark dialogue corpora, encompassing both task-oriented and open-domain scenarios. We found that the dialogue generation model trained with SemTextualLogue loss attained superior performance (in both quantitative and qualitative evaluation) compared to the traditional cross-entropy loss function across the datasets and evaluation metrics.

摘要
过去二十年，对话模型化已经做出了 significiant 进步，从简单的规则基于响应演进到个性化和说服性响应生成。然而，虽然这些进步，对话生成的目标函数和评价指标仍然停滞不前，即cross-entropy和BLEU，分别。这些lexical-based 指标具有以下两点限制：（a）word-to-word匹配无semantic考虑：它将生成 'good'和'rice'的不同的响应视为相同的失败。（b）缺少对话上下文特征：即使生成的响应与对话上下文相关，仍可能因为不匹配goldutterance而受到penalty。在这篇论文中，我们首先对这些限制进行了全面的调查，并提出了一种新的损失函数called Semantic Infused Contextualized diaLogue (SemTextualLogue)损失函数。此外，我们提出了一种新的评价指标called Dialuation，该指标包含对话上下文相关性和semantic适用性的两个方面。我们在两个标准对话 corpora上进行了实验，包括任务域和开放域场景。我们发现，使用SemTextualLogue损失函数训练的对话生成模型在所有数据集和评价指标上表现出色，比传统的cross-entropy损失函数更好。

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

paper_url: http://arxiv.org/abs/2309.05653
repo_url: https://github.com/TIGER-AI-Lab/MAmmoTH
paper_authors: Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
for: The paper is written for developing a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
methods: The paper uses a meticulously curated instruction tuning dataset called MathInstruct, which includes 13 math datasets with intermediate rationales, six of which were newly curated by the authors. The models are trained on this dataset, which presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales.
results: The MAmmoTH series of models substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales, with an average accuracy gain between 13% and 29%. The MAmmoTH-7B model achieves 35% accuracy on MATH, which exceeds the best open-source 7B model (WizardMath) by 25%, and the MAmmoTH-34B model achieves 46% accuracy on MATH, even surpassing GPT-4’s CoT result.

Abstract
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 13% and 29%. Remarkably, our MAmmoTH-7B model reaches 35% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 25%, and the MAmmoTH-34B model achieves 46% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

摘要
我们介绍MAmmoTH，一系列开源大型自然语言模型（LLMs），特别针对数学问题的解释。MAmmoTH模型在我们仔细组合的 instruNet 训练集上训练， instruNet 是我们新compile的 13 个数学数据集，其中六个是我们新给出的 rationales。这些 rationales 是一种 chain-of-thought（CoT）和 program-of-thought（PoT）的混合类型，并且涵盖了数学多个领域。这种混合类型不仅发挥工具的潜力，而且允许不同的思维过程，因此 MAmmoTH 系列在九个数学推理数据集上表现出色，具有13% 至 29% 的总精度提升。特别是我们的 MAmmoTH-7B 模型在 MATH 竞赛级数据集上 дости得 35% 的精度，超过了最佳开源 7B 模型（WizardMath）的 25%，而 MAmmoTH-34B 模型在 MATH 上取得 46% 的精度，甚至超过 GPT-4 的 CoT 结果。我们的工作强调了数学多个领域的多元问题覆盖和 hybrid 的 rationales 在开发出色数学通用模型方面的重要性。

Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

paper_url: http://arxiv.org/abs/2309.05619
repo_url: None
paper_authors: Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel J Perry, Prashant Shiralkar, Zhengzheng Xing, Aaron Colak
for: 评估大语言模型（LLMs）在实际世界中的性能，以验证其在不同语言和领域中的总体性能。
methods: 使用 ensemble disagreement scores 作为人工标注的代理，以评估 LLM 在零shot、几shot 和 fine-tuned 设置下的性能。
results: 结果表明，使用 ensemble disagreement scores 可以准确地评估 LLM 的性能，与真实的人工标注 Error 相比，MAE 为 0.4% 左右，与使用另一个 LLM 作为机器标注（silver labels）的情况相比，平均提高了 13.8%。

Abstract
Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.

摘要
大型语言模型（LLM）已经展示了广泛的应用准确性。为工业应用，需要定期评估LLM在实际世界数据上的表现，以验证其可行性。人工标注来评估模型错误需要巨大的成本和时间延迟。在本研究中，我们展示了 ensemble disagreement scores 可以作为人工标注的代理，并在零shot、少shot和 fine-tuned 设定下进行评估。我们通过比较 true error measured from human labeled ground truth 和 ensemble disagreement scores 的精度，发现 ensemble disagreement scores 能够提供更好的模型性能估计，mean average error（MAE）只有0.4%，并且在平均上比 silver labels 高13.8%。 results across various languages and domains 表明，ensemble disagreement scores 能够提供更好的模型性能估计。

Incorporating Pre-trained Model Prompting in Multimodal Stock Volume Movement Prediction

paper_url: http://arxiv.org/abs/2309.05608
repo_url: https://github.com/rayruibochen/promuse
paper_authors: Ruibo Chen, Zhiyuan Zhang, Yi Liu, Ruihan Bao, Keiko Harimoto, Xu Sun
for: 用于预测股票交易量运动的多modal数据movement prediction
methods: 使用预训练语言模型和提示学习方法来处理文本和时间序列模式
results: 比较 existing baselines 表现出色，并通过多种分析 validate 模型的效果

Abstract
Multimodal stock trading volume movement prediction with stock-related news is one of the fundamental problems in the financial area. Existing multimodal works that train models from scratch face the problem of lacking universal knowledge when modeling financial news. In addition, the models ability may be limited by the lack of domain-related knowledge due to insufficient data in the datasets. To handle this issue, we propose the Prompt-based MUltimodal Stock volumE prediction model (ProMUSE) to process text and time series modalities. We use pre-trained language models for better comprehension of financial news and adopt prompt learning methods to leverage their capability in universal knowledge to model textual information. Besides, simply fusing two modalities can cause harm to the unimodal representations. Thus, we propose a novel cross-modality contrastive alignment while reserving the unimodal heads beside the fusion head to mitigate this problem. Extensive experiments demonstrate that our proposed ProMUSE outperforms existing baselines. Comprehensive analyses further validate the effectiveness of our architecture compared to potential variants and learning mechanisms.

摘要
多Modal股票交易量运动预测与股票相关新闻是金融领域的基本问题。现有的多Modal工作都是从头开始训练模型，面临缺乏通用知识的问题。另外，模型的能力可能受到数据集中的域相关知识不充分的限制。为解决这个问题，我们提出了Prompt-based MUltimodal Stock volumE prediction model（ProMUSE）来处理文本和时间序Modalities。我们使用预训练语言模型来更好地理解金融新闻，并采用提问学习方法来利用其在通用知识中的能力来模型文本信息。此外，简单地将两Modalities进行混合可能会对单Modalities的表示带来害。因此，我们提出了一种新的交叉Modalities强制对齐，以保持单Modalities的表示。广泛的实验表明，我们提出的ProMUSE超过了现有的基准值。进一步的分析还证明了我们的architecture的效果与可能的变体和学习机制相比。

Long-Range Transformer Architectures for Document Understanding

paper_url: http://arxiv.org/abs/2309.05503
repo_url: https://github.com/thibaultdouzon/long-range-document-transformer
paper_authors: Thibault Douzon, Stefan Duffner, Christophe Garcia, Jérémy Espinas
for: 这篇论文旨在应用Transformer模型于长multi-page文档处理中。
methods: 该论文提出了两种多模态（文本+布局）长距离模型，以及一种2D相对注意力偏好来引导自注意力。
results: 对多页企业文档进行信息检索时，该模型表现出了改善，与小 sequences 的性能成本相对较低。

Abstract
Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019. However, the computational complexity of the self-attention operation limits their capabilities to small sequences. In this paper we explore multiple strategies to apply Transformer based models to long multi-page documents. We introduce 2 new multi-modal (text + layout) long-range models for DU. They are based on efficient implementations of Transformers for long sequences. Long-range models can process whole documents at once effectively and are less impaired by the document's length. We compare them to LayoutLM, a classical Transformer adapted for DU and pre-trained on millions of documents. We further propose 2D relative attention bias to guide self-attention towards relevant tokens without harming model efficiency. We observe improvements on multi-page business documents on Information Retrieval for a small performance cost on smaller sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.

摘要
Since their release, transformers have revolutionized many fields, from natural language understanding to computer vision. Document understanding (DU) was not left behind, with the first transformer-based models for DU dating back to late 2019. However, the computational complexity of the self-attention operation limits their capabilities to small sequences. In this paper, we explore multiple strategies to apply transformer-based models to long multi-page documents. We introduce two new multi-modal (text + layout) long-range models for DU. They are based on efficient implementations of transformers for long sequences. Long-range models can process whole documents at once effectively and are less impaired by the document's length. We compare them to LayoutLM, a classical transformer adapted for DU and pre-trained on millions of documents. We further propose 2D relative attention bias to guide self-attention towards relevant tokens without harming model efficiency. We observe improvements on multi-page business documents on information retrieval for a small performance cost on smaller sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

Personality Detection and Analysis using Twitter Data

paper_url: http://arxiv.org/abs/2309.05497
repo_url: https://github.com/SRIGURUPRASAD/Trending-Polarity_Diagnosis-Word_Cloud-Profile_analysis
paper_authors: Abhilash Datta, Souvic Chakraborty, Animesh Mukherjee
for: 这篇论文是为了探讨人格特质自动检测的问题，以及将大量文本数据集用于研究人格类型的可能性。
methods: 本论文使用自动检测人格特质的方法，并对大量文本数据集进行了质量控制和分类。
results: 研究发现，自动检测人格特质的方法可以准确地预测个人的人格类型，并且可以提供有价值的信息用于各种应用领域。

Abstract
Personality types are important in various fields as they hold relevant information about the characteristics of a human being in an explainable format. They are often good predictors of a person's behaviors in a particular environment and have applications ranging from candidate selection to marketing and mental health. Recently automatic detection of personality traits from texts has gained significant attention in computational linguistics. Most personality detection and analysis methods have focused on small datasets making their experimental observations often limited. To bridge this gap, we focus on collecting and releasing the largest automatically curated dataset for the research community which has 152 million tweets and 56 thousand data points for the Myers-Briggs personality type (MBTI) prediction task. We perform a series of extensive qualitative and quantitative studies on our dataset to analyze the data patterns in a better way and infer conclusions. We show how our intriguing analysis results often follow natural intuition. We also perform a series of ablation studies to show how the baselines perform for our dataset.

摘要
人格类型在不同领域具有重要的意义，它们可以带来人类特性的可观察性格。它们经常是人类在特定环境中行为的预测器，并且在选拔候选人、营销和心理健康等领域有广泛的应用。现在，自动检测人格特质从文本中的研究受到了计算语言学的广泛关注。大多数人格检测和分析方法都集中在小 dataset 上，导致其实验观察通常有限。为了bridging这个差距，我们集中在收集和发布最大自动筛选的数据集，这个数据集包含152万篇微博和56千个数据点，用于Myers-Briggs人格类型（MBTI）预测任务。我们进行了系列的详细和量化研究，以分析数据的 patrern 以及得出结论。我们的研究结果经常遵循自然的直觉，并且我们进行了一系列的减少研究，以示baseline 在我们的数据集上的性能。

paper_url: http://arxiv.org/abs/2309.05494
repo_url: None
paper_authors: Rabindra Lamsal, Maria Rodriguez Read, Shanika Karunasekera
for: This paper is written to address the challenges of analyzing crisis-related social media texts and to introduce an ensemble of pre-trained language models and sentence encoders called CrisisTransformers.
methods: The authors use an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events to train their models, including BERT and RoBERTa, and evaluate their performance on 18 crisis-specific public datasets.
results: The authors find that their pre-trained models outperform strong baselines across all datasets in classification tasks, and their best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, they investigate the impact of model initialization on convergence and the significance of domain-specific models in generating semantically meaningful sentence embeddings.

Abstract
Social media platforms play an essential role in crisis communication, but analyzing crisis-related social media texts is challenging due to their informal nature. Transformer-based pre-trained models like BERT and RoBERTa have shown success in various NLP tasks, but they are not tailored for crisis-related texts. Furthermore, general-purpose sentence encoders are used to generate sentence embeddings, regardless of the textual complexities in crisis-related texts. Advances in applications like text classification, semantic search, and clustering contribute to effective processing of crisis-related texts, which is essential for emergency responders to gain a comprehensive view of a crisis event, whether historical or real-time. To address these gaps in crisis informatics literature, this study introduces CrisisTransformers, an ensemble of pre-trained language models and sentence encoders trained on an extensive corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents. We evaluate existing models and CrisisTransformers on 18 crisis-specific public datasets. Our pre-trained models outperform strong baselines across all datasets in classification tasks, and our best-performing sentence encoder improves the state-of-the-art by 17.43% in sentence encoding tasks. Additionally, we investigate the impact of model initialization on convergence and evaluate the significance of domain-specific models in generating semantically meaningful sentence embeddings. All models are publicly released (https://huggingface.co/crisistransformers), with the anticipation that they will serve as a robust baseline for tasks involving the analysis of crisis-related social media texts.

摘要
社交媒体平台在危机通信中发挥了重要作用，但分析危机相关的社交媒体文本具有挑战性，这是因为这些文本的形式不具有正式的特征。BERT和RoBERTa等基于Transformer的预训练模型在不同的自然语言处理任务中显示出了成功，但它们没有特定的针对危机相关文本的训练。此外，通用的句子编码器在处理危机相关文本时会遇到文本复杂性的问题。为了解决危机信息学Literature中的漏洞，本研究提出了危机 трансформа（CrisisTransformers），这是一个基于广泛的危机事件 Tweets 集合（超过 15 亿字符）和多种危机类型的预训练语言模型和句子编码器的ensemble。我们对 existed 模型和危机 трансформа进行了18个危机特定的公共数据集的评估。我们的预训练模型在所有数据集中都高于强基eline，并且我们的最佳句子编码器在句子编码任务中提高了状态艺术的最佳性能 by 17.43%。此外，我们还 investigate了模型初始化对叠入的影响和预训练模型在生成Semantically meaningful句子编码的重要性。所有模型都公开发布（https://huggingface.co/crisistransformers），我们anticipate 它们将作为危机相关社交媒体文本分析任务的稳定基线。

paper_url: http://arxiv.org/abs/2309.05475
repo_url: None
paper_authors: Neel Bhate, Ansh Mittal, Zhe He, Xiao Luo
for: 本研究旨在 investigate Zero-shot learning 方法，以掌握不同条件下的 clinical notes 中的 demographics、社会条件和家族历史信息。
methods: 本研究使用 GPT 模型，并提供 minimum information 来检查模型的性能。
results: 研究结果显示，GPT-3.5 方法在 demographics 抽出中取得了 0.975 F1 的平均分，在 social determinants 抽出中取得了 0.615 F1 的平均分，在 family history 抽出中取得了 0.722 F1 的平均分。

Abstract
Demographics, Social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.

摘要
《人口学、社会决定因素和家庭历史记录在电子健康记录中的不结构化文本是逐渐被研究以利用这些信息与结构化数据共同改善医疗结果。》After the release of GPT models, many studies have applied GPT models to extract this information from clinical notes. Different from existing work, our research focuses on investigating zero-shot learning to extract this information together by providing minimum information to the GPT model. We use de-identified real-world clinical notes annotated with demographics, social determinants, and family history information. Given that the GPT model may provide text different from the original data, we explore two sets of evaluation metrics, including traditional NER evaluation metrics and semantic similarity evaluation metrics, to fully understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through case studies, we also identified the limitations of GPT models, which need to be addressed in future research.

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

paper_url: http://arxiv.org/abs/2309.05454
repo_url: None
paper_authors: Joseph Marvin Imperial, Harish Tayyar Madabushi
for: 这个研究的目的是评估不同开源和关闭源语言模型在写作完结和简化故事任务中的表现，以便教师可以根据标准指南来评估这些任务的难度。
methods: 这个研究使用了多种开源和关闭源语言模型，包括ChatGPT和BLOOMZ等，并使用标准指南来控制文本的阅读难度。
results: 研究发现，使用标准指南控制文本阅读难度可以提高模型的表现，而ChatGPT模型在这些生成任务中表现较差，而BLOOMZ和FlanT5等开源模型则表现更加出色。

Abstract
Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate their performances in writing story completions and simplifying narratives$-$tasks that teachers perform$-$using standard-guided prompts controlling text readability. Our extensive findings provide empirical proof of how globally recognized models like ChatGPT may be considered less effective and may require more refined prompts for these generative tasks compared to other open-sourced models such as BLOOMZ and FlanT5$-$which have shown promising results.

摘要
教学工具和标准，如费希-金凯德学年级水平（FKGL）和欧洲共同语言参照体系（CEFR），用于导引教师和教育工作者评估教学材料的复杂性，以确保在教室使用前，材料的阅读性能得到适当的评估。在这项研究中，我们选择了一些多样化的开源和关闭源的 instrucit-调整语言模型，并 investigate其在写作续写和简化故事任务中的表现，使用标准化的提示控制文本阅读性。我们的广泛发现证明了 globally recognized模型 like ChatGPT 可能不太有效，并且可能需要更加细化的提示来完成这些生成任务，相比于其他开源模型如 BLOOMZ 和 FlanT5，这些模型在这些任务中表现出色。

Evaluating the Deductive Competence of Large Language Models

paper_url: http://arxiv.org/abs/2309.05452
repo_url: None
paper_authors: S. M. Seals, Valerie L. Shalin
for: 这个研究旨在评估大语言模型（LLMs）的逻辑和问题解决能力。
methods: 研究使用了多种大语言模型（LLMs）来解决一种从认知科学文献中的逻辑推理问题。
results: 研究发现，这些LLMs在问题的常规形式下表现有限，并且对问题的表示形式和内容进行了跟进实验，但发现表现之间存在差异，并且与人类表现不同。总的来说，这些结果表明LLMs具有独特的逻辑偏见，与人类逻辑性表现相互关联。

Abstract
The development of highly fluent large language models (LLMs) has prompted increased interest in assessing their reasoning and problem-solving capabilities. We investigate whether several LLMs can solve a classic type of deductive reasoning problem from the cognitive science literature. The tested LLMs have limited abilities to solve these problems in their conventional form. We performed follow up experiments to investigate if changes to the presentation format and content improve model performance. We do find performance differences between conditions; however, they do not improve overall performance. Moreover, we find that performance interacts with presentation format and content in unexpected ways that differ from human performance. Overall, our results suggest that LLMs have unique reasoning biases that are only partially predicted from human reasoning performance.

摘要
发展高度流畅的大语言模型（LLMs）已引发了评估其逻辑和问题解决能力的兴趣。我们研究了一些LLMs是否可以解决知识科学文献中的一种经典逻辑推理问题。我们发现，在传统形式下，测试LLMs的能力并不高。我们进行了续试实验，以确定是否可以通过改变格式和内容来改善模型表现。结果发现，尽管存在具体的表现差异，但这并不能提高总体表现。此外，我们发现模型的表现与显示格式和内容之间存在不可预期的交互作用，与人类表现不同。总之，我们的结果表明，LLMs具有人类逻辑思维不同的偏好，这些偏好只有部分与人类逻辑表现相符。

TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

paper_url: http://arxiv.org/abs/2309.05447
repo_url: None
paper_authors: Yongrui Chen, Haiyun Jiang, Xinting Huang, Shuming Shi, Guilin Qi
for: 提高 LLM 能力，需要高质量的指令调整数据。现有的数据收集方法受限于人工标注成本过高或者 LLM 生成幻化。
methods: 本文提出了一种扩展的方法，通过训练语言模型自动设计任务，以获取高质量的指令调整数据。模型通过人工写的文本来减少幻化。
results: 自动和手动评估实验结果表明，我们的数据集具有高质量。

Abstract
High-quality instruction-tuning data is critical to improving LLM capabilities. Existing data collection methods are limited by unrealistic manual labeling costs or by the hallucination of relying solely on LLM generation. To address the problems, this paper presents a scalable method to automatically collect high-quality instructional adaptation data by training language models to automatically design tasks based on human-written texts. Intuitively, human-written text helps to help the model attenuate illusions during the generation of tasks. Unlike instruction back-translation-based methods that directly take the given text as a response, we require the model to generate the \textit{instruction}, \textit{input}, and \textit{output} simultaneously to filter the noise. The results of the automated and manual evaluation experiments demonstrate the quality of our dataset.

摘要
高品质的指导数据对于提高LLM能力至关重要。现有的数据收集方法受限于不现实的手动标签成本或者依赖solely LLM生成所导致的幻觉。为解决这些问题，本文提出了一种可扩展的方法，通过训练语言模型自动设计任务基于人类写的文本。人类写的文本可以帮助模型减少幻觉。不同于基于回答 instruction back-translation 的方法，我们需要模型同时生成 \textit{指导}, \textit{输入} 和 \textit{输出}，以过滤噪音。经自动和 manual 评估实验表明，我们的数据集具有高质量。

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

paper_url: http://arxiv.org/abs/2309.05444
repo_url: None
paper_authors: Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, Sara Hooker
for: 这 paper 的目的是推动 Mixture of Experts（MoE） neural network 的 Parameter Efficient Fine-Tuning（PEFT）方法，以实现一个常规 MoE 模型的缩放。
methods: 这 paper 使用了 MoE 架构，并将它与轻量级专家结合在一起，以实现 Parameter Efficient MoE（PEMoE）方法。这种方法可以在约 1% 的参数上进行微调，并且可以在不知道先前任务的情况下进行普适化。
results: 根据 экспериментах，PEMoE 方法可以与标准 PEFT 方法相比，在更小的参数上实现更高的性能。此外，PEMoE 方法还可以在未经过任务知识的情况下进行普适化，并且可以在不同的任务上实现良好的性能。

Abstract
The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.

摘要
“混合专家（MoE）是一种广泛知名的神经网络架构，其中一个ensemble of specialized sub-models可以提高总性能减少计算成本。然而，传统的MoE遇到了规模化的挑战，因为需要存储所有专家。在这篇论文中，我们将MoE推到了界限。我们提出了非常 Paramater-efficient MoE，通过独特地将MoE架构和轻量级专家结合在一起。我们的MoE架构超越了标准的 Paramater-efficient fine-tuning（PEFT）方法，并与全面 fine-tuning 相当，只需更新轻量级专家—— menos than 1% of an 11B parameters model。此外，我们的方法可以泛化到未看到的任务，因为它不依赖任务知识。我们的研究强调了混合专家架构的灵活性，显示它可以提供坚强的性能，即使面临严格的参数约束。我们在所有实验中使用的代码可以在以下链接获取：https://github.com/for-ai/parameter-efficient-moe。”

Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts

paper_url: http://arxiv.org/abs/2309.05312
repo_url: None
paper_authors: Olga Kellert, Mahmud Uz Zaman, Nicholas Hill Matlis, Carlos Gómez-Rodríguez
for: 这个论文描述了一种基于 Universal Dependencies (UD) 的无监督、分析性和递归 (UCR) 规则集合方法的情感分析 (SA) 实验结果，并在 Rest-Mex 2023 共同任务中提交 (Team Olga/LyS-SALSA) (内部的 IberLEF 2023 会议)。
methods: 我们的方法使用基本的 sintactic 规则，如修饰和否定词的规则，从情感词典中提取words，利用这些规则来实现无监督方法的优势：(1) 情感分析的解释性和可读性，(2) 鲁棒性适用于不同的数据集、语言和领域，(3) 非 NLP 专家可以使用。
results: 我们的方法比其他无监督方法具有更好的表现，我们还讨论了将 modal 特征作为另一种偏置规则以提高结果，以及使用 word ambiguation 技术来正确地识别情感词。

Abstract
This paper summarizes the results of experimenting with Universal Dependencies (UD) adaptation of an Unsupervised, Compositional and Recursive (UCR) rule-based approach for Sentiment Analysis (SA) submitted to the Shared Task at Rest-Mex 2023 (Team Olga/LyS-SALSA) (within the IberLEF 2023 conference). By using basic syntactic rules such as rules of modification and negation applied on words from sentiment dictionaries, our approach exploits some advantages of an unsupervised method for SA: (1) interpretability and explainability of SA, (2) robustness across datasets, languages and domains and (3) usability by non-experts in NLP. We compare our approach with other unsupervised approaches of SA that in contrast to our UCR rule-based approach use simple heuristic rules to deal with negation and modification. Our results show a considerable improvement over these approaches. We discuss future improvements of our results by using modality features as another shifting rule of polarity and word disambiguation techniques to identify the right sentiment words.

摘要

Interpretability and explainability of SA results2. Robustness across datasets, languages, and domains3. Usability by non-experts in NLPWe compare our approach with other unsupervised approaches of SA that use simple heuristic rules to deal with negation and modification, and our results show a significant improvement over these approaches. In the future, we plan to improve our results by incorporating modality features as another shifting rule of polarity and using word disambiguation techniques to identify the correct sentiment words.

Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity Recognition

paper_url: http://arxiv.org/abs/2309.05311
repo_url: https://github.com/michael-beukman/nertransfer
paper_authors: Michael Beukman, Manuel Fokam
for: 本研究探讨了十种低资源语言之间的跨语言转移学习Property，具体是Named Entity Recognition任务。
methods: 研究者采用了适应细化调教和转移语言的选择对Zero-shot转移性能的影响。
results: 研究发现，能够在单个语言上表现出色的模型通常会在其他语言上表现不佳，而能够在多种语言上准确预测的模型通常会在单个语言上表现不佳。此外，数据集之间的数据重叠度更好地预测转移性能 than geographical或生物学距离 между语言。

Abstract
Transfer learning has led to large gains in performance for nearly all NLP tasks while making downstream models easier and faster to train. This has also been extended to low-resourced languages, with some success. We investigate the properties of cross-lingual transfer learning between ten low-resourced languages, from the perspective of a named entity recognition task. We specifically investigate how much adaptive fine-tuning and the choice of transfer language affect zero-shot transfer performance. We find that models that perform well on a single language often do so at the expense of generalising to others, while models with the best generalisation to other languages suffer in individual language performance. Furthermore, the amount of data overlap between the source and target datasets is a better predictor of transfer performance than either the geographical or genetic distance between the languages.

摘要
通过转移学习，大多数自然语言处理任务上的性能有了大幅提升，而同时使下游模型更容易和更快地训练。此外，这种技术还被扩展到低资源语言中，并获得了一定的成功。我们对十种低资源语言之间的跨语言转移学习性能进行了调查，从命名实体识别任务的角度来看。我们专门研究了跨语言转移学习后，模型如何影响单个语言和其他语言之间的性能。我们发现，能够在单一语言上表现出色的模型通常是在其他语言上的性能下降的代价，而能够在多种语言上具有最好的总体性能的模型通常是单一语言上的性能下降的代价。此外，源语言和目标语言数据集之间的数据重叠度比较地理或基因距离更好地预测跨语言转移性能。

Minuteman: Machine and Human Joining Forces in Meeting Summarization

paper_url: http://arxiv.org/abs/2309.05272
repo_url: None
paper_authors: František Kmječ, Ondřej Bojar
for: 这篇论文的目的是提出一种新的会议笔记工具，帮助会议笔记人员更加快速地制作高质量的会议笔记。
methods: 该工具使用了语音识别和摘要模型，提供了现场 trascript 和会议笔记，让用户可以在实时Collaborative manner中编辑和修正 trascript 和笔记。
results: 试验结果表明，该工具可以减轻会议笔记人员的认知压力，并帮助他们更加快速地恢复 missed 部分会议。

Abstract
Many meetings require creating a meeting summary to keep everyone up to date. Creating minutes of sufficient quality is however very cognitively demanding. Although we currently possess capable models for both audio speech recognition (ASR) and summarization, their fully automatic use is still problematic. ASR models frequently commit errors when transcribing named entities while the summarization models tend to hallucinate and misinterpret the transcript. We propose a novel tool -- Minuteman -- to enable efficient semi-automatic meeting minuting. The tool provides a live transcript and a live meeting summary to the users, who can edit them in a collaborative manner, enabling correction of ASR errors and imperfect summary points in real time. The resulting application eases the cognitive load of the notetakers and allows them to easily catch up if they missed a part of the meeting due to absence or a lack of focus. We conduct several tests of the application in varied settings, exploring the worthiness of the concept and the possible user strategies.

摘要
多数会议需要创建会议摘要以保持所有人的更新。创建足够质量的会议笔记是非常认知吃力的。虽然我们目前拥有了可靠的语音识别模型和摘要模型，但它们的完全自动使用仍然存在问题。语音识别模型经常对名称实体进行误报，而摘要模型往往会假设和 Misinterpret 笔记文本。我们提议一种新工具---Minuteman---以实现高效的半自动会议笔记。该工具提供了实时的会议笔记和会议摘要，用户可以在协作模式下编辑，以更正语音识别错误和摘要点。结果使得笔记员的认知负担减轻，使其更容易catch up if 缺席或缺少注意力。我们在不同的设置下进行了多次测试，探讨该概念的可行性和用户策略。

CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

paper_url: http://arxiv.org/abs/2309.05270
repo_url: None
paper_authors: Mohsin Ali, Kandukuri Sai Teja, Neeharika Gupta, Parth Patwa, Anubhab Chatterjee, Vinija Jain, Aman Chadha, Amitava Das
for: 本研究旨在提出一种基于神经语言模型的代码混合语言模型（CONFLATOR），以便更好地处理混合语言文本。
methods: 研究人员采用了多种 позицион编码方法，包括旋转 позицион编码和 switching point 信息，以提高模型的表达能力。
results: 研究人员通过对两个基于混合语言的任务（即 sentiment analysis 和 machine translation）进行实验，发现 CONFLATOR 可以在这些任务中达到更高的表达能力，比如state-of-the-art。

Abstract
The mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been very effective on many NLP tasks. However, NLM for CM is an under-explored area. Though transformers are capable and powerful, they cannot always encode positional/sequential information since they are non-recurrent. Therefore, to enrich word information and incorporate positional information, positional encoding is defined. We hypothesize that Switching Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2 or L2-> L1), pose a challenge for CM Language Models (LMs), and hence give special emphasis to switching points in the modeling process. We experiment with several positional encoding mechanisms and show that rotatory positional encodings along with switching point information yield the best results. We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. CONFLATOR tries to learn to emphasize switching points using smarter positional encoding, both at unigram and bigram levels. CONFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English (Hinglish): (i) sentiment analysis and (ii) machine translation.

摘要
mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been very effective on many NLP tasks. However, NLM for CM is an under-explored area. Though transformers are capable and powerful, they cannot always encode positional/sequential information since they are non-recurrent. Therefore, to enrich word information and incorporate positional information, positional encoding is defined. We hypothesize that Switching Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2 or L2-> L1), pose a challenge for CM Language Models (LMs), and hence give special emphasis to switching points in the modeling process. We experiment with several positional encoding mechanisms and show that rotatory positional encodings along with switching point information yield the best results. We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. CONFLATOR tries to learn to emphasize switching points using smarter positional encoding, both at unigram and bigram levels. CONFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English (Hinglish): (i) sentiment analysis and (ii) machine translation.

Exploring the Law of Numbers: Evidence from China’s Real Estate

paper_url: http://arxiv.org/abs/2309.05221
repo_url: None
paper_authors: Fuqian Zhang, Zhenhua Wang
for: 这篇论文探讨了中国地产公司的财务报表，以便更全面地描述数字的法律。
methods: 该论文使用了Benford的法律来研究数字的分布，同时还研究了数字的频率和长度。
results: 研究发现，中国地产公司的财务报表中的数字不具有完整性，而且存在数据修饰的问题。这些结果不仅有经济 significancen，还可以深入理解数字的分布和用途。

Abstract
The renowned proverb, Numbers do not lie, underscores the reliability and insight that lie beneath numbers, a concept of undisputed importance, especially in economics and finance etc. Despite the prosperity of Benford's Law in the first digit analysis, its scope fails to remain comprehensiveness when it comes to deciphering the laws of number. This paper delves into number laws by taking the financial statements of China real estate as a representative, quantitatively study not only the first digit, but also depict the other two dimensions of numbers: frequency and length. The research outcomes transcend mere reservations about data manipulation and open the door to discussions surrounding number diversity and the delineation of the usage insights. This study wields both economic significance and the capacity to foster a deeper comprehension of numerical phenomena.

摘要
著名的成语“数字不假”强调数字下面的可靠性和洞察力的重要性，尤其在经济和金融等领域。尽管本福德法在首位数分析方面取得了很大的成功，但其范围却无法涵盖数字法律的全面性。这篇论文通过利用中国地产公司的财务报表作为例子，量化研究不仅首位数，还描述了其他两个维度：频率和长度。研究结果超越了仅仅是数据报告的担忧，开启了数字多样性和使用情况的描述的讨论。这种研究具有经济意义和深入了解数字现象的能力。

Understanding the Impact of Post-Training Quantization on Large Language Models

paper_url: http://arxiv.org/abs/2309.05210
repo_url: None
paper_authors: Somnath Roy
for: The paper focuses on the deployment and operation of large language models (LLMs) on consumer-grade GPUs, and the impact of hyperparameters on the performance of quantized models.
methods: The paper compares and analyzes the performance of different quantization techniques, including nf4, fp4, and fp4-dq, on various LLMs, and investigates the effects of temperature on the performance of these models.
results: The study finds that nf4 and fp4 are equally proficient 4-bit quantization techniques, but nf4 displays greater resilience to temperature variations in the case of the llama2 series of models at lower temperature. Additionally, the study shows that 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, and that int8 quantization is associated with significantly slower inference speeds.

Abstract
Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. the study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be a more suitable choice for falcon series of models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.

摘要
大型语言模型（LLM）在大小方面迅速增长，参数的数量成为许多商业模型的成功关键，如ChatGPT、Claude和Bard。即使最近公开的商业模型，如Falcon和Llama2，也搭载了数十亿个参数。这种 Parameters 的增加使得部署和运行变得非常昂贵。在大型神经网络和 LLM 的减量方面做出了重要进步，使得这些模型可以在消费级 GPU 上部署。减量模型通常与不减量模型的性能水平相当。然而，对于下一个字母预测中的 гиперparameters，如温度、最大新字母数和topk，我们对这些减量模型的理解仍然存在一定的差距。 presente 分析发现，nf4 和 fp4 是Equally proficient 4位减量技术，具有相似的特点，如执行速度、内存占用率和生成内容质量。study 发现，nf4 在 llama2 系列模型下 Displayed greater resilience to temperature variations at lower temperature, while fp4 和 fp4-dq 适用于 falcon 系列模型。通常 speaking, 4位减量模型的不同大小在温度范围内 0.5-0.8 exhibit higher sensitivity to temperature, unlike their unquantized counterparts。此外，int8 减量与不减量 bfloat16 模型相比，执行速度明显 slower。

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

paper_url: http://arxiv.org/abs/2309.05203
repo_url: None
paper_authors: Yuhan Chen, Nuwa Xi, Yanrui Du, Haochun Wang, Chen Jianyu, Sendong Zhao, Bing Qin
for: 提高底层资源缺乏的cross-modal分子发现方法的效果
methods: 利用人工生成的大语言模型生成的 pseudo data进行适应Domain adaptation
results: 使用 pseudo data 的方法比现有方法有更好的性能，同时需要较小的模型规模、数据量和训练成本，表明其高效性。

Abstract
Molecule discovery serves as a cornerstone in numerous scientific domains, fueling the development of new materials and innovative drug designs. Recent developments of in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.

摘要
分子发现在许多科学领域中 serves as a cornerstone, 推动新材料和创新药物设计的发展。 latest developments in in-silico molecule discovery have highlighted the promising results of cross-modal techniques, which bridge molecular structures with their descriptive annotations. However, these cross-modal methods frequently encounter the issue of data scarcity, hampering their performance and application. In this paper, we address the low-resource challenge by utilizing artificially-real data generated by Large Language Models (LLMs). We first introduce a retrieval-based prompting strategy to construct high-quality pseudo data, then explore the optimal method to effectively leverage this pseudo data. Experiments show that using pseudo data for domain adaptation outperforms all existing methods, while also requiring a smaller model scale, reduced data size and lower training cost, highlighting its efficiency. Furthermore, our method shows a sustained improvement as the volume of pseudo data increases, revealing the great potential of pseudo data in advancing low-resource cross-modal molecule discovery.

Two is Better Than One: Answering Complex Questions by Multiple Knowledge Sources with Generalized Links

paper_url: http://arxiv.org/abs/2309.05201
repo_url: None
paper_authors: Minhao Zhang, Yongliang Ma, Yanzeng Li, Ruoyu Zhang, Lei Zou, Ming Zhou
for: 本研究旨在解决多知识库（KB）合并问答（QA）问题中，不能充分利用多KB之间的不同链接类型所带来的限制。
methods: 本研究提出了一种新的多知识库问答（Multi-KB-QA）任务，利用多KB之间的全链接和半链接来获取正确答案。同时，我们还构建了一个多样化链接和问题类型的准则集，以便效率地评估多KB-QA性能。
results: 实验结果表明，我们提出的方法在多知识库问答任务中，与传统KB-QA系统相比，显著提高了性能。这表明，需要解决多KB之间的不同链接类型，以提高QA性能。

Abstract
Incorporating multiple knowledge sources is proven to be beneficial for answering complex factoid questions. To utilize multiple knowledge bases (KB), previous works merge all KBs into a single graph via entity alignment and reduce the problem to question-answering (QA) over the fused KB. In reality, various link relations between KBs might be adopted in QA over multi-KBs. In addition to the identity between the alignable entities (i.e. full link), unalignable entities expressing the different aspects or types of an abstract concept may also be treated identical in a question (i.e. partial link). Hence, the KB fusion in prior works fails to represent all types of links, restricting their ability to comprehend multi-KBs for QA. In this work, we formulate the novel Multi-KB-QA task that leverages the full and partial links among multiple KBs to derive correct answers, a benchmark with diversified link and query types is also constructed to efficiently evaluate Multi-KB-QA performance. Finally, we propose a method for Multi-KB-QA that encodes all link relations in the KB embedding to score and rank candidate answers. Experiments show that our method markedly surpasses conventional KB-QA systems in Multi-KB-QA, justifying the necessity of devising this task.

摘要
combining multiple knowledge sources has been proven to be beneficial for answering complex factoid questions. to utilize multiple knowledge bases (kb), previous works merge all kbs into a single graph via entity alignment and reduce the problem to question-answering (qa) over the fused kb. in reality, various link relations between kbs might be adopted in qa over multi-kbs. in addition to the identity between the alignable entities (i.e. full link), unalignable entities expressing different aspects or types of an abstract concept may also be treated identical in a question (i.e. partial link). hence, the kb fusion in prior works fails to represent all types of links, restricting their ability to comprehend multi-kbs for qa. in this work, we formulate the novel multi-kb-qa task that leverages the full and partial links among multiple kbs to derive correct answers, a benchmark with diversified link and query types is also constructed to efficiently evaluate multi-kb-qa performance. finally, we propose a method for multi-kb-qa that encodes all link relations in the kb embedding to score and rank candidate answers. experiments show that our method markedly surpasses conventional kb-qa systems in multi-kb-qa, justifying the necessity of devising this task.

Does Writing with Language Models Reduce Content Diversity?

paper_url: http://arxiv.org/abs/2309.05196
repo_url: https://github.com/vishakhpk/hai-diversity
paper_authors: Vishakh Padmakumar, He He
for: measure the impact of co-writing on diversity in produced content
methods: controlled experiment with three setups (base LLM, feedback-tuned LLM, and no model help) and diversity metrics
results: writing with InstructGPT (but not GPT3) results in a statistically significant reduction in diversity, with increased similarity between writings of different authors and reduced lexical and content diversity, primarily due to InstructGPT contributing less diverse text to co-written essays.

Abstract
Large language models (LLMs) have led to a surge in collaborative writing with model assistance. As different users incorporate suggestions from the same model, there is a risk of decreased diversity in the produced content, potentially limiting diverse perspectives in public discourse. In this work, we measure the impact of co-writing on diversity via a controlled experiment, where users write argumentative essays in three setups -- using a base LLM (GPT3), a feedback-tuned LLM (InstructGPT), and writing without model help. We develop a set of diversity metrics and find that writing with InstructGPT (but not the GPT3) results in a statistically significant reduction in diversity. Specifically, it increases the similarity between the writings of different authors and reduces the overall lexical and content diversity. We additionally find that this effect is mainly attributable to InstructGPT contributing less diverse text to co-written essays. In contrast, the user-contributed text remains unaffected by model collaboration. This suggests that the recent improvement in generation quality from adapting models to human feedback might come at the cost of more homogeneous and less diverse content.

摘要
Translation notes:* "Large language models" (LLMs) is translated as "大型语言模型" (dàxìng yǔyán módelì)* "Collaborative writing" is translated as "合作写作" (hézuò xiǎoqian)* "Base LLM" is translated as "基础模型" (jīchū módelì)* "Feedback-tuned LLM" is translated as "反馈调整模型" (fǎngxiàn tiángzhèng módelì)* "Co-written essays" is translated as "合作写作的文章" (hézuò xiǎoqian de wénzhang)* "Diversity metrics" is translated as "多样性指标" (duōyànxìng zhǐbǐ)* "Statistically significant reduction in diversity" is translated as " statistically significant reduction in diversity" (统计学上的多样性减少)* "Lexical diversity" is translated as "词语多样性" (cíyǔ duōyànxìng)* "Content diversity" is translated as "内容多样性" (néngjīng duōyànxìng)

2023-09-11

Hi Model, generating ‘nice’ instead of ‘good’ is not as bad as generating ‘rice’! Towards Context and Semantic Infused Dialogue Generation Loss Function and Evaluation Metric

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

Incorporating Pre-trained Model Prompting in Multimodal Stock Volume Movement Prediction

Long-Range Transformer Architectures for Document Understanding

Personality Detection and Analysis using Twitter Data

CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts

Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model

Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

Evaluating the Deductive Competence of Large Language Models

TeGit: Generating High-Quality Instruction-Tuning Data with Text-Grounded Task Design

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Experimenting with UD Adaptation of an Unsupervised Rule-based Approach for Sentiment Analysis of Mexican Tourist Texts

Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity Recognition

Minuteman: Machine and Human Joining Forces in Meeting Summarization

CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling

Exploring the Law of Numbers: Evidence from China’s Real Estate

Understanding the Impact of Post-Training Quantization on Large Language Models

From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery

Two is Better Than One: Answering Complex Questions by Multiple Knowledge Sources with Generalized Links

Does Writing with Language Models Reduce Content Diversity?