cs.CL - 2023-11-03

Grounded Intuition of GPT-Vision’s Abilities with Scientific Images

paper_url: http://arxiv.org/abs/2311.02069
repo_url: https://github.com/ahwang16/grounded-intuition-gpt-vision
paper_authors: Alyssa Hwang, Andrew Head, Chris Callison-Burch
for: 本研究旨在帮助研究者更好地理解新型模型GPT-Vision的能力和局限性。
methods: 本研究使用了grounded theory和主题分析，从社会科学和人机交互的角度来设置一个严格的质量评估框架，以便对自然语言处理领域的新模型进行评估。
results: 研究发现，GPT-Vision具有特殊的激励特性，它响应于提示、图像中的对话文本和相对空间关系。这种方法和分析可以帮助研究者更好地了解新模型的应用前景，同时探索如何使用GPT-Vision来减轻信息的访问难度。

Abstract
GPT-Vision has impressed us on a range of vision-language tasks, but it comes with the familiar new challenge: we have little idea of its capabilities and limitations. In our study, we formalize a process that many have instinctively been trying already to develop "grounded intuition" of this new model. Inspired by the recent movement away from benchmarking in favor of example-driven qualitative evaluation, we draw upon grounded theory and thematic analysis in social science and human-computer interaction to establish a rigorous framework for qualitative evaluation in natural language processing. We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting, counterfactual text in images, and relative spatial relationships. Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible.

摘要

Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

paper_url: http://arxiv.org/abs/2311.02025
repo_url: None
paper_authors: Gretel Liz De la Peña Sarracén, Paolo Rosso, Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto
for: 本研究旨在提高跨语言恶意语言识别的性能，使用数据扩充和持续预训练进行领域适应。
methods: 本研究使用了两种现有的数据扩充技术，并提出了一种新的数据扩充方法（MIXAG），该方法根据实例表示的角度进行 interpolate 对照对。
results: 实验结果表明，数据扩充策略可以提高跨语言少量恶意语言识别的性能，特别是在多领域和多语言环境下。

Abstract
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.

摘要
cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.Here's the translation in Traditional Chinese:cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.

ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models

paper_url: http://arxiv.org/abs/2311.01981
repo_url: None
paper_authors: Haotian Luo, Kunming Wu, Cheng Dai, Sixian Ding, Xinhao Chen
for: 解决语言模型在生成过程中忘记提示问题
methods: 使用 sintetic gradient 教导模型在生成过程中记忆提示
results: 实验结果表明，该方法能够解决语言模型在生成过程中忘记提示的问题

Abstract
RNN-like language models are getting renewed attention from NLP researchers in recent years and several models have made significant progress, which demonstrates performance comparable to traditional transformers. However, due to the recurrent nature of RNNs, this kind of language model can only store information in a set of fixed-length state vectors. As a consequence, they still suffer from forgetfulness though after a lot of improvements and optimizations, when given complex instructions or prompts. As the prompted generation is the main and most concerned function of LMs, solving the problem of forgetting in the process of generation is no wonder of vital importance. In this paper, focusing on easing the prompt forgetting during generation, we proposed an architecture to teach the model memorizing prompt during generation by synthetic gradient. To force the model to memorize the prompt, we derive the states that encode the prompt, then transform it into model parameter modification using low-rank gradient approximation, which hard-codes the prompt into model parameters temporarily. We construct a dataset for experiments, and the results have demonstrated the effectiveness of our method in solving the problem of forgetfulness in the process of prompted generation. We will release all the code upon acceptance.

摘要
在这篇论文中，我们关注在生成过程中缓解提示忘记的问题，我们提出了一种建议，通过合成梯度来教育模型在生成过程中记忆提示。我们首先提取了编码提示的状态，然后将其转换成模型参数修改使用低级导数预测，这会将提示短时间内写入模型参数中。我们构建了一个数据集，并进行了实验，结果表明我们的方法有效地解决了生成过程中的忘记问题。我们将代码发布于接受后。

Too Much Information: Keeping Training Simple for BabyLMs

paper_url: http://arxiv.org/abs/2311.01955
repo_url: None
paper_authors: Lukas Edman, Lisa Bylinina
for: 这篇论文描述了格罗宁根大学对 BabyLM 挑战的工作。
methods: 我们采用了如宝宝一样，将语言模型引入 simpler concept 先后理解更复杂的概念的想法。我们通过不同的角度（context size、词汇量和总语言复杂度）来检查这种策略的效果。
results: 我们发现只有context size truly beneficial to training a language model，但这simple change to context size 使我们在(Super)GLUE任务上的平均提高2点，在MSGS任务上的平均提高1点，在BLiMP任务上的平均提高12%。我们的context-limited模型比基线模型，在10 times更多的数据上进行训练。

Abstract
This paper details the work of the University of Groningen for the BabyLM Challenge. We follow the idea that, like babies, language models should be introduced to simpler concepts first and build off of that knowledge to understand more complex concepts. We examine this strategy of simple-then-complex through a variety of lenses, namely context size, vocabulary, and overall linguistic complexity of the data. We find that only one, context size, is truly beneficial to training a language model. However this simple change to context size gives us improvements of 2 points on average on (Super)GLUE tasks, 1 point on MSGS tasks, and 12\% on average on BLiMP tasks. Our context-limited model outperforms the baseline that was trained on 10$\times$ the amount of data.

摘要
这份论文介绍了格隆根大学对宝宝LM挑战的工作。我们采用了婴儿式学习策略，即首先教育语言模型简单概念，然后逐步增加知识来理解更复杂的概念。我们通过不同的角度来检查这种简单然后复杂的策略，即上下文大小、词汇量和总语言复杂度。我们发现只有上下文大小真正有利于语言模型训练，但这种简单的改变使我们在（Super）GLUE任务上平均提高2点，MSGS任务上平均提高1点，BLiMP任务上平均提高12%。我们的上下文限定模型超过基eline模型，即使训练数据量为10倍。

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks

paper_url: http://arxiv.org/abs/2311.01949
repo_url: None
paper_authors: Yifan Wang, Qingyan Guo, Xinzhe Ni, Chufan Shi, Lemao Liu, Haiyun Jiang, Yujiu Yang
for: 提高大语言模型（LLM）在知识密集任务中表现，特别是开放领域问答任务。
methods: 提出Hint-enhanced In-Context Learning（HICL）新 парадиг，利用LLM的解释能力从示例中提取问题相关的知识，然后将知识用作更Explicit的提示。同时，跟踪示例的来源以确定特定的示例，并引入Hint-related Example Retriever（HER）来选择有用的示例。
results: 对3个开放领域问答 benchmark进行评估，与标准设置相比，HICL加HER得到了平均性能提升2.89 EM score和2.52 F1 score在gpt-3.5-turbo上，7.62 EM score和7.27 F1 score在LLaMA-2-Chat-7B上。

Abstract
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs), enabling them to learn input-label mappings from demonstrations and perform well on downstream tasks. However, under the standard ICL setting, LLMs may sometimes neglect query-related information in demonstrations, leading to incorrect predictions. To address this limitation, we propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering, an important form in knowledge-intensive tasks. HICL leverages LLMs' reasoning ability to extract query-related knowledge from demonstrations, then concatenates the knowledge to prompt LLMs in a more explicit way. Furthermore, we track the source of this knowledge to identify specific examples, and introduce a Hint-related Example Retriever (HER) to select informative examples for enhanced demonstrations. We evaluate HICL with HER on 3 open-domain QA benchmarks, and observe average performance gains of 2.89 EM score and 2.52 F1 score on gpt-3.5-turbo, 7.62 EM score and 7.27 F1 score on LLaMA-2-Chat-7B compared with standard setting.

摘要
受大语言模型（LLM）的规模增长的影响，宽 Context Learning（ICL）能力已经出现，使得 LLMS 可以从示例中学习输入标签映射，并在下游任务中表现良好。然而，在标准 ICLE 设置下， LLMS 可能会忽略示例中相关的查询信息，导致错误预测。为了解决这个限制，我们提出了一种新的思路calledHint-enhanced In-Context Learning（HICL），以探索 ICLE 在开放领域问答中的力量。HICL 利用 LLMS 的理解能力提取示例中相关的查询知识，然后将这些知识 concatenates 到提示 LLMS 以更加显式的方式。此外，我们跟踪这些知识的来源，并引入一个Hint-related Example Retriever（HER）来选择有用的示例，以提高示例的质量。我们在3个开放领域问答标准 benchmark上评估 HICL 和 HER，并观察了gpt-3.5-turbo 和 LLaMA-2-Chat-7B 上的平均性能提升2.89 EM 分数和2.52 F1 分数，升级7.62 EM 分数和7.27 F1 分数。

Constructing Temporal Dynamic Knowledge Graphs from Interactive Text-based Games

paper_url: http://arxiv.org/abs/2311.01928
repo_url: https://github.com/yukw777/temporal-discrete-graph-updater
paper_authors: Keunwoo Peter Yu
for: 这个论文的目的是提出一种新的图ppoydunker模型，以提高对文本游戏中的动态知识图的表示和学习。
methods: 该模型使用一种名为时间点基于图神经网络的方法，将动态知识图表示为一系列时间戳的图事件，以提高知识图的准确性和可解释性。
results: 通过对TextWorld数据集进行实验，研究发现TDGU模型比基elineDGU模型表现更好，并且通过缺省研究和对更复杂的环境的演示，证明TDGU模型具有更好的泛化能力。

Abstract
In natural language processing, interactive text-based games serve as a test bed for interactive AI systems. Prior work has proposed to play text-based games by acting based on discrete knowledge graphs constructed by the Discrete Graph Updater (DGU) to represent the game state from the natural language description. While DGU has shown promising results with high interpretability, it suffers from lower knowledge graph accuracy due to its lack of temporality and limited generalizability to complex environments with objects with the same label. In order to address DGU's weaknesses while preserving its high interpretability, we propose the Temporal Discrete Graph Updater (TDGU), a novel neural network model that represents dynamic knowledge graphs as a sequence of timestamped graph events and models them using a temporal point based graph neural network. Through experiments on the dataset collected from a text-based game TextWorld, we show that TDGU outperforms the baseline DGU. We further show the importance of temporal information for TDGU's performance through an ablation study and demonstrate that TDGU has the ability to generalize to more complex environments with objects with the same label. All the relevant code can be found at \url{https://github.com/yukw777/temporal-discrete-graph-updater}.

摘要
在自然语言处理领域，文本基于游戏作为互动AI系统的测试床。先前的工作已经提议通过基于自然语言描述生成的Discrete Graph Updater（DGU）来控制文本基于游戏。 although DGU has shown promising results with high interpretability, it suffers from lower knowledge graph accuracy due to its lack of temporality and limited generalizability to complex environments with objects with the same label. 为了解决DGU的缺陷而保持高度可读性，我们提出了Temporal Discrete Graph Updater（TDGU），一种新的神经网络模型，它表示动态知识图为一个时间戳的图事件序列，并使用时间点基于图神经网络来模型。通过TextWorld数据集上的实验，我们表明TDGU超过了基准DGU。我们还进行了剖析研究，证明了TDGU的时间信息的重要性，并示出TDGU可以在更复杂的环境中 generale化。所有相关的代码可以在 GitHub上找到，链接为 \url{https://github.com/yukw777/temporal-discrete-graph-updater}.

BoschAI @ PLABA 2023: Leveraging Edit Operations in End-to-End Neural Sentence Simplification

paper_url: http://arxiv.org/abs/2311.01907
repo_url: None
paper_authors: Valentin Knappich, Simon Razniewski, Annemarie Friedrich
for: 这个论文的目的是提出一种基于LLAMA2的自动简化系统，以便非专业人员更好地理解复杂的科学文献。
methods: 该系统使用语言模型将复杂语言翻译成简单语言。论文提出了使用句子级和字节级损失权重来减少模型的训练信号和保守性。
results: 经验证明，该方法可以生成更加接近人工标注者创造的简化文本 (+1.8% / +3.5% SARI),使用更加简单的语言 (-1 / -1.1 FKGL)和更多的修改（1.6x / 1.8x编辑距离），相比同模型通过标准十字 entropy进行 fine-tuning。此外，论文还表明了控制编辑距离和简单性水平（FKGL）的Hyperparameter $\lambda$。

Abstract
Automatic simplification can help laypeople to comprehend complex scientific text. Language models are frequently applied to this task by translating from complex to simple language. In this paper, we describe our system based on Llama 2, which ranked first in the PLABA shared task addressing the simplification of biomedical text. We find that the large portion of shared tokens between input and output leads to weak training signals and conservatively editing models. To mitigate these issues, we propose sentence-level and token-level loss weights. They give higher weight to modified tokens, indicated by edit distance and edit operations, respectively. We conduct an empirical evaluation on the PLABA dataset and find that both approaches lead to simplifications closer to those created by human annotators (+1.8% / +3.5% SARI), simpler language (-1 / -1.1 FKGL) and more edits (1.6x / 1.8x edit distance) compared to the same model fine-tuned with standard cross entropy. We furthermore show that the hyperparameter $\lambda$ in token-level loss weights can be used to control the edit distance and the simplicity level (FKGL).

摘要
自动简化可以帮助非专家理解复杂科学文本。语言模型经常用于这种任务，将复杂语言翻译成简单语言。在这篇论文中，我们描述了基于LLAMA 2的系统，该系统在PLABA共享任务中排名第一，用于简化生物医学文本。我们发现输入和输出共享的大量共同token会导致弱的训练信号和保守的编辑模型。为了解决这些问题，我们提议使用句子级和token级损失权重。它们将修改后的token得到更高的权重，根据编辑距离和编辑操作来进行评估。我们对PLABA数据集进行了实验评估，发现两种方法都能够生成更加简洁的简化文本（+1.8% / +3.5% SARI）， simpler language (-1 / -1.1 FKGL）和更多的编辑（1.6x / 1.8x编辑距离），比同样的模型通过标准十字Entropy训练更好。我们还发现了$\lambda$参数在token级损失权重中可以控制编辑距离和简洁水平（FKGL）。

Indicative Summarization of Long Discussions

paper_url: http://arxiv.org/abs/2311.01882
repo_url: https://github.com/webis-de/emnlp-23
paper_authors: Shahbaz Syed, Dominik Schwabe, Khalid Al-Khatib, Martin Potthast
for: 提供一种novel的无监督方法，使用大型自然语言模型（LLM）生成长讨论的指示性摘要，以便方便用户快速浏览和理解长讨论。
methods: 方法首先对讨论中的argument sentence进行聚类，然后生成聚类标签作为摘要，最后将生成的摘要分类为口语框架。
results: 经过优化的提问工程approach，我们测试了19个LLM的生成聚类标签和口语框架分类能力，并进行了用户研究，结果表明，我们的提出的指示性摘要可以帮助用户快速浏览和理解长讨论。

Abstract
Online forums encourage the exchange and discussion of different stances on many topics. Not only do they provide an opportunity to present one's own arguments, but may also gather a broad cross-section of others' arguments. However, the resulting long discussions are difficult to overview. This paper presents a novel unsupervised approach using large language models (LLMs) to generating indicative summaries for long discussions that basically serve as tables of contents. Our approach first clusters argument sentences, generates cluster labels as abstractive summaries, and classifies the generated cluster labels into argumentation frames resulting in a two-level summary. Based on an extensively optimized prompt engineering approach, we evaluate 19~LLMs for generative cluster labeling and frame classification. To evaluate the usefulness of our indicative summaries, we conduct a purpose-driven user study via a new visual interface called Discussion Explorer: It shows that our proposed indicative summaries serve as a convenient navigation tool to explore long discussions.

摘要
在线讨论区域鼓励不同观点的交流和讨论。不仅可以展示自己的Arguments，还可以收集各种不同的Arguments。然而，长时间的讨论可能很难概括。这篇论文提出了一种新的无监督方法，使用大型自然语言模型（LLMs）生成长讨论的指示性摘要。我们的方法首先对Argument sentence进行聚合，生成聚合Label作为摘要，然后将生成的聚合Label进行分类，生成两级摘要。通过大量优化的提示工程 Approach，我们评估了19种LLMs的生成聚合标签和框架分类。为了评估我们的指示性摘要的有用性，我们进行了一项目的用途驱动的用户研究，通过一种新的视觉界面 called Discussion Explorer：它表明了我们的提posed indicative summaries可以作为浏览长讨论的便捷导航工具。

Sentiment Analysis through LLM Negotiations

paper_url: http://arxiv.org/abs/2311.01876
repo_url: None
paper_authors: Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, Guoyin Wang
for: This paper aims to improve the accuracy of sentiment analysis by introducing a multi-LLM negotiation framework that leverages the complementary abilities of multiple language models to generate more accurate and well-reasoned decisions.
methods: The proposed framework consists of a reasoning-infused generator and an explanation-deriving discriminator, which iterate until a consensus is reached. The generator provides decisions along with rationale, while the discriminator evaluates the credibility of the generator’s decisions.
results: The proposed approach consistently outperforms the in-context learning (ICL) baseline across all benchmarks, and even achieves superior performances compared to supervised baselines on the Twitter and movie review datasets.

Abstract
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round under the framework of in-context learning. This framework suffers the key disadvantage that the single-turn output generated by a single LLM might not deliver the perfect decision, just as humans sometimes need multiple attempts to get things right. This is especially true for the task of sentiment analysis where deep reasoning is required to address the complex linguistic phenomenon (e.g., clause composition, irony, etc) in the input. To address this issue, this paper introduces a multi-LLM negotiation framework for sentiment analysis. The framework consists of a reasoning-infused generator to provide decision along with rationale, a explanation-deriving discriminator to evaluate the credibility of the generator. The generator and the discriminator iterate until a consensus is reached. The proposed framework naturally addressed the aforementioned challenge, as we are able to take the complementary abilities of two LLMs, have them use rationale to persuade each other for correction. Experiments on a wide range of sentiment analysis benchmarks (SST-2, Movie Review, Twitter, yelp, amazon, IMDB) demonstrate the effectiveness of proposed approach: it consistently yields better performances than the ICL baseline across all benchmarks, and even superior performances to supervised baselines on the Twitter and movie review datasets.

摘要
一般来说，用一个单一的深度学习模型（LLM）进行情感分析是一种常见的方法。这种方法的缺点是，单个LLM的输出可能不会提供完美的决策，就像人类在做出决策时有时需要多次尝试。这是特别真的 для情感分析任务，因为这个任务需要深入理解复杂的语言现象（例如句子组成、讽刺等）。为解决这个问题，这篇论文提出了一种多个LLM谈判框架 для情感分析。该框架包括一个理由感染生成器，用于提供决策以及理由，以及一个解释评估器，用于评估生成器的合理性。生成器和解释评估器会进行谈判，直到达成一致。提议的框架自然地解决了以上挑战，因为我们可以利用两个LLM的补充能力，让它们使用理由来证明对方需要更正。实验结果表明，提议的方法在各种情感分析标准benchmark（SST-2、电影评论、Twitter、Yelp、Amazon、IMDB）上表现出色， consistently 超过ICL基线，甚至在Twitter和电影评论数据集上超越了经过监督的基线。

Efficient Black-Box Adversarial Attacks on Neural Text Detectors

paper_url: http://arxiv.org/abs/2311.01873
repo_url: None
paper_authors: Vitalii Fishchuk, Daniel Braun
for: investigate the effectiveness of three simple and resource-efficient strategies to alter texts generated by GPT-3.5 to misclassify neural text detectors.
methods: parameter tweaking, prompt engineering, and character-level mutations.
results: especially parameter tweaking and character-level mutations are effective strategies.Here’s the summary in Traditional Chinese as well:
for: 研究使用三种简单且资源有效的策略，让GPT-3.5生成的文本被神经文本探测器误将为人工生成的文本。
methods: 参数调整、提示工程和字元水平的变化。
results: 特别是参数调整和字元水平的变化是有效的策略。

Abstract
Neural text detectors are models trained to detect whether a given text was generated by a language model or written by a human. In this paper, we investigate three simple and resource-efficient strategies (parameter tweaking, prompt engineering, and character-level mutations) to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors. The results show that especially parameter tweaking and character-level mutations are effective strategies.

摘要
neural text detectors 是模型，用于detect whether a given text was generated by a language model or written by a human。在这篇论文中，我们investigate three simple and resource-efficient strategies（parameter tweaking，prompt engineering，and character-level mutations）to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors。result showsthat especially parameter tweaking and character-level mutations are effective strategies。

$R^3$-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation

paper_url: http://arxiv.org/abs/2311.01862
repo_url: https://github.com/zhiqix/nl2gql
paper_authors: Yuhang Zhou, He Yu, Siyu Tian, Dan Chen, Liuzhi Zhou, Xinlin Yu, Chuanjun Ji, Sen Liu, Guangnan Ye, Hongfeng Chai
for: 这篇论文主要应用于将自然语言转换为graph查询语言（NL2GQL）任务中，并解决了Foundation Models在NL2GQL任务中的挑战。
methods: 本论文使用了Foundation Models，并将其分为大小不同的模型，以进行不同的调整和组合。
results: 实验结果显示，大型Foundation Models在NL2GQL任务中展现出了优秀的横推数据能力，而小型Foundation Models则在细化和调整后，对于意思理解和 grammatical accuracy 有所进步。

Abstract
While current NL2SQL tasks constructed using Foundation Models have achieved commendable results, their direct application to Natural Language to Graph Query Language (NL2GQL) tasks poses challenges due to the significant differences between GQL and SQL expressions, as well as the numerous types of GQL. Our extensive experiments reveal that in NL2GQL tasks, larger Foundation Models demonstrate superior cross-schema generalization abilities, while smaller Foundation Models struggle to improve their GQL generation capabilities through fine-tuning. However, after fine-tuning, smaller models exhibit better intent comprehension and higher grammatical accuracy. Diverging from rule-based and slot-filling techniques, we introduce R3-NL2GQL, which employs both smaller and larger Foundation Models as reranker, rewriter and refiner. The approach harnesses the comprehension ability of smaller models for information reranker and rewriter, and the exceptional generalization and generation capabilities of larger models to transform input natural language queries and code structure schema into any form of GQLs. Recognizing the lack of established datasets in this nascent domain, we have created a bilingual dataset derived from graph database documentation and some open-source Knowledge Graphs (KGs). We tested our approach on this dataset and the experimental results showed that delivers promising performance and robustness.Our code and dataset is available at https://github.com/zhiqix/NL2GQL

摘要
当前的NL2SQL任务使用基础模型构建得到了可嘉的结果，但直接应用于自然语言到图查询语言（NL2GQL）任务却存在挑战，主要是因为GQL和SQL表达之间存在显著差异，以及GQL的多种类型。我们的广泛实验表明，在NL2GQL任务中，更大的基础模型在跨 schema 泛化能力方面表现出色，而更小的基础模型通过细化不能提高其生成GQL能力。然而，经细化后，更小的模型具有更高的意图理解和语法正确率。不同于规则基于和槽填充技术，我们提出了R3-NL2GQL，它使用更小和更大的基础模型来重新排序、重写和精度。这种方法利用更小的模型对信息重新排序和重写的能力，以及更大的模型对输入自然语言查询和代码结构 schema 的转换能力。认识到这个领域的数据集还没有成熔，我们从图数据库文档和一些开源知识图（KG） derivated 一个双语数据集。我们对这个数据集进行了测试，实验结果表明了我们的方法具有扎实的表现和稳定性。代码和数据集可以在https://github.com/zhiqix/NL2GQL 上获取。

Large Language Models to the Rescue: Reducing the Complexity in Scientific Workflow Development Using ChatGPT

paper_url: http://arxiv.org/abs/2311.01825
repo_url: None
paper_authors: Mario Sänger, Ninon De Mecquenem, Katarzyna Ewa Lewińska, Vasilis Bountris, Fabian Lehmann, Ulf Leser, Thomas Kosch
for: 这篇研究旨在测试大自然语言模型（LLM）在科学工作流程中的效率，以支持用户在实现工作流程时所遇到的挑战。
methods: 研究使用了ChatGPT作为LLM，并进行了三个使用者研究，以评估ChatGPT在理解、适应和扩展工作流程方面的效能。
results: 研究结果显示LLM对工作流程的解释有高效性，但在交换组件或目的性工作流程扩展方面表现较差。研究也描述了LLM在这些困难情况下的限制，并建议未来研究的方向。

Abstract
Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary for their execution. Simultaneously, user-supporting tools are rare, and the number of available examples is much lower than in classical programming languages. To address these challenges, we investigate the efficiency of Large Language Models (LLMs), specifically ChatGPT, to support users when dealing with scientific workflows. We performed three user studies in two scientific domains to evaluate ChatGPT for comprehending, adapting, and extending workflows. Our results indicate that LLMs efficiently interpret workflows but achieve lower performance for exchanging components or purposeful workflow extensions. We characterize their limitations in these challenging scenarios and suggest future research directions.

摘要
We conducted three user studies in two scientific domains to evaluate ChatGPT's ability to comprehend, adapt, and extend workflows. Our results show that LLMs can efficiently interpret workflows, but their performance is lower when it comes to exchanging components or creating purposeful workflow extensions. We have identified the limitations of these challenging scenarios and suggest future research directions.Translated into Simplified Chinese:科学工作流系统在表达和执行复杂数据分析管道上占据着越来越多的市场份额，因为它们提供了可重现性、可靠性和可扩展性，并可自动平行化在大型计算集群上。然而，实现工作流程的困难在于许多黑盒工具的参与以及执行所需的深层基础设施。同时，用户支持工具罕见，可用的示例数量也远低于经典编程语言。为了解决这些挑战，我们研究了大语言模型（LLM），具体来说是ChatGPT，在科学工作流程中支持用户。我们在两个科学领域中进行了三个用户研究，以评估ChatGPT在理解、适应和扩展工作流程方面的能力。我们的结果表明，LLM可以高效地理解工作流程，但在交换组件或创造有目的工作流程扩展方面表现较差。我们对这些挑战的限制进行了特点分析，并建议未来的研究方向。

Minimalist Grammar: Construction without Overgeneration

paper_url: http://arxiv.org/abs/2311.01820
repo_url: None
paper_authors: Isidor Konrad Maier, Johannes Kuhn, Jesse Beisegel, Markus Huber-Liebl, Matthias Wolff
for: 这篇论文是如何编写 minimalist grammar (MG) 的指南。
methods: 使用 variant of context free grammars (CFG) 作为输入格式，并使用 licensors/-ees 特殊的方式处理例外情况。
results: 构建的 MG 可以避免过度生成，并且使用 adapters 解决 exceptions 处理中的问题。

Abstract
In this paper we give instructions on how to write a minimalist grammar (MG). In order to present the instructions as an algorithm, we use a variant of context free grammars (CFG) as an input format. We can exclude overgeneration, if the CFG has no recursion, i.e. no non-terminal can (indirectly) derive to a right-hand side containing itself. The constructed MGs utilize licensors/-ees as a special way of exception handling. A CFG format for a derivation $A\_eats\_B\mapsto^* peter\_eats\_apples$, where $A$ and $B$ generate noun phrases, normally leads to overgeneration, e.\,g., $i\_eats\_apples$. In order to avoid overgeneration, a CFG would need many non-terminal symbols and rules, that mainly produce the same word, just to handle exceptions. In our MGs however, we can summarize CFG rules that produce the same word in one item and handle exceptions by a proper distribution of licensees/-ors. The difficulty with this technique is that in most generations the majority of licensees/-ors is not needed, but still has to be triggered somehow. We solve this problem with $\epsilon$-items called \emph{adapters}.

摘要
在这篇论文中，我们提供了写 minimalist grammar（MG）的指导方针。为了表示这些指导方针为算法，我们使用 variant of context free grammars（CFG）作为输入格式。如果 CFG 没有回归，则可以排除过度生成。 constructed MGs 使用licensee/-or作为特殊的例外处理方式。CFG 格式 для一个 derivation $A\_eats\_B\mapsto^* peter\_eats\_apples$，where $A$ 和 $B$ 生成名词短语，通常会导致过度生成，例如 $i\_eats\_apples$。为了避免过度生成，一个 CFG 需要很多非树状符号和规则，主要生成同一个词的不同形式，只是为了处理例外。在我们的 MGs 中，我们可以汇总 CFG 规则生成同一个词的项目，并通过正确的分配licensee/-or来处理例外。这种技术的困难在于，在大多数生成中，主要的licensee/-or并不需要，但仍需要某种触发方式。我们解决这个问题使用 $\epsilon$-item called \emph{adapters}。

Mitigating Framing Bias with Polarity Minimization Loss

paper_url: http://arxiv.org/abs/2311.01817
repo_url: None
paper_authors: Yejin Bang, Nayeon Lee, Pascale Fung
for: 防止新闻报道中的偏见倾向
methods: 提出一种新的损失函数，用于降低多个新闻报道中的偏见差异
results: 实验结果表明，通过在模型中添加该损失函数可以减少偏见倾向，其效果最大化在降低信息偏见倾向（即报道中选择的信息偏见）。

Abstract
Framing bias plays a significant role in exacerbating political polarization by distorting the perception of actual events. Media outlets with divergent political stances often use polarized language in their reporting of the same event. We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias. Specifically, our loss is designed to jointly optimize the model to map polarity ends bidirectionally. Our experimental results demonstrate that incorporating the proposed polarity minimization loss leads to a substantial reduction in framing bias when compared to a BART-based multi-document summarization model. Notably, we find that the effectiveness of this approach is most pronounced when the model is trained to minimize the polarity loss associated with informational framing bias (i.e., skewed selection of information to report).

摘要
帧偏调 plays a significant role in exacerbating political polarization by distorting the perception of actual events. Media outlets with divergent political stances often use polarized language in their reporting of the same event. We propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias. Specifically, our loss is designed to jointly optimize the model to map polarity ends bidirectionally. Our experimental results demonstrate that incorporating the proposed polarity minimization loss leads to a substantial reduction in framing bias when compared to a BART-based multi-document summarization model. Notably, we find that the effectiveness of this approach is most pronounced when the model is trained to minimize the polarity loss associated with informational framing bias (i.e., skewed selection of information to report).Here's the translation in Traditional Chinese:帧偏调对政治化分化具有重要作用，导致现实事件的观察被扭曲。媒体对同一事件的报导 often 使用偏 polarized 的语言，这会导致政治分化。我们提出了一个新的损失函数，这个损失函数鼓励模型将 polarity 的差异最小化，以减少帧偏调。具体来说，我们的损失函数设计来对 polarity 的端点进行bidirectional 的对映。我们的实验结果显示，将 proposed polarity 损失函数添加到模型中可以对帧偏调进行重大减少，相比之下，使用 BART 基于多篇文章摘要模型。当然，我们发现这种方法在对 informational framing bias 进行对映时表现最佳。

UP4LS: User Profile Constructed by Multiple Attributes for Enhancing Linguistic Steganalysis

paper_url: http://arxiv.org/abs/2311.01775
repo_url: None
paper_authors: Yihao Wang, Ruiqi Song, Ru Zhang, Jianyi Liu
for: 提高语言隐藏分析（LS）任务的性能，特别是在社交媒体上。
methods: 利用用户 profiling 技术，挖掘用户的写作习惯、心理状态和关注点，然后与现有方法结合使用语言模型来提取特征。
results: 对现有方法进行改进，实现减少隐藏样本数量下的性能提升，具体提升约25%。

Abstract
Linguistic steganalysis (LS) tasks aim to effectively detect stegos generated by linguistic steganography. Existing LS methods overlook the distinctive user characteristics, leading to weak performance in social networks. The limited occurrence of stegos further complicates detection. In this paper, we propose the UP4LS, a novel framework with the User Profile for enhancing LS performance. Specifically, by delving into post content, we explore user attributes like writing habits, psychological states, and focal areas, thereby building the user profile for LS. For each attribute, we design the identified feature extraction module. The extracted features are mapped to high-dimensional user features via deep-learning networks from existing methods. Then the language model is employed to extract content features. The user and content features are integrated to optimize feature representation. During the training phase, we prioritize the distribution of stegos. Experiments demonstrate that UP4LS can significantly enhance the performance of existing methods, and an overall accuracy improvement of nearly 25%. In particular, the improvement is especially pronounced with fewer stego samples. Additionally, UP4LS also sets the stage for studies on related tasks, encouraging extensive applications on LS tasks.

摘要
文本隐藏分析（LS）任务目的是有效检测基于语言隐藏技术生成的隐藏文本（stegos）。现有的LS方法忽略了用户特征，导致检测效果在社交网络中弱化。隐藏文本的有限发生频率更进一步复杂了检测。本文提出了UP4LS，一种新的框架，通过探索文章内容，捕捉用户特征，如写作习惯、心理状态和焦点领域，建立用户profile，并为每个特征设计特定的特征提取模块。这些特征被映射到现有方法中的深度学习网络中，然后使用语言模型提取内容特征。用户和内容特征被结合，以优化特征表示。在训练阶段，我们优先考虑隐藏文本的分布。实验表明，UP4LS可以显著提高现有方法的性能，具体提高约25%。尤其是在 fewer stego samples 的情况下，提高更加明显。此外，UP4LS还为相关任务提供了开门篇，激发了广泛的应用研究。

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

paper_url: http://arxiv.org/abs/2311.01767
repo_url: https://github.com/gydpku/pptc
paper_authors: Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, Duan Nan
for: 这项研究旨在评估大自然语言模型（LLM）在完成多个交互、多modal操作的复杂多modal环境中的能力。
methods: 该研究使用了PowerPoint Task Completion（PPTC） benchmarch来评估LLM在创建和编辑PPT文件基于用户 instrucion的能力。
results: 研究发现GPT-4在单转对话测试中具有75.1%的准确率，但在完成整个会话中表现不佳，只有6%的会话准确率。研究发现三种主要错误原因：交互累积、长时间处理PPT模板和多modal识别。这些问题对未来LLM和代理系统 pose 极大挑战。

Abstract
Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs' ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1\% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6\% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems. We release the data, code, and evaluation system of PPTC at \url{https://github.com/gydpku/PPTC}.

摘要
最近的大语言模型（LLM）评估中心在测试它们零shot/几shot能力来完成基本的自然语言任务以及将指令转化为工具API。然而，对于使用复杂工具完成多Turn多模态任务在复杂多模态环境中评估LLM的能力还没有被研究。为了解决这一漏洞，我们介绍了PowerPoint任务完成（PPTC）标准测试套件，用于评估LLM在基于用户指令创建和编辑PPT文件方面的能力。该套件包含279个多Turn会话，涵盖多个主题和百度 instrucciones 涉及多模态操作。我们还提出了PPTXMatch评估系统，它根据预测文件而不是标签API序列来评估LLM是否完成了指令。这种支持多种LLM生成的API序列。我们测试了三个关闭LLM和六个开源LLM。结果显示，GPT-4在单Turn对话测试中的准确率为75.1%，但在完成整个会话时表现不佳，只有6%的会话准确率。我们发现了三种主要的错误原因：在多Turn会话中的错误积累、长PPT模板处理和多模态感知。这些问题对未来LLM和代理系统带来了很大挑战。我们将数据、代码和评估系统发布到GitHub上，请参考 \url{https://github.com/gydpku/PPTC}.

Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation

paper_url: http://arxiv.org/abs/2311.01766
repo_url: None
paper_authors: Xin Yuan, Jie Guo, Weidong Qiu, Zheng Huang, Shujun Li
for: 防止在线谣言和false information的扩散
methods: 提出了一种基于多模态证据的偏见抽取网络（SEN），可以同时抽取不同证据的偏见，以提高识别结果的准确性
results: 对大规模公共数据集进行了广泛的实验，发现提出的方法比前期基elines的表现升高3.2%的精度。

Abstract
Mis- and disinformation online have become a major societal problem as major sources of online harms of different kinds. One common form of mis- and disinformation is out-of-context (OOC) information, where different pieces of information are falsely associated, e.g., a real image combined with a false textual caption or a misleading textual description. Although some past studies have attempted to defend against OOC mis- and disinformation through external evidence, they tend to disregard the role of different pieces of evidence with different stances. Motivated by the intuition that the stance of evidence represents a bias towards different detection results, we propose a stance extraction network (SEN) that can extract the stances of different pieces of multi-modal evidence in a unified framework. Moreover, we introduce a support-refutation score calculated based on the co-occurrence relations of named entities into the textual SEN. Extensive experiments on a public large-scale dataset demonstrated that our proposed method outperformed the state-of-the-art baselines, with the best model achieving a performance gain of 3.2% in accuracy.

摘要
互联网上的谬误和不准确信息已成为现代社会的重要问题，是多种不同类型的在线危害的主要来源。一种常见的谬误信息形式是Context Out-of-Context（OOC）信息，即不同的信息元素被谬误地联系起来，例如真实的图像与谬误的文字描述或歪曲的文本描述。 although some past studies have tried to defend against OOC misinformation through external evidence, they tend to ignore the role of different pieces of evidence with different stances. 驱动了寻求解决这个问题的直觉，我们提出了一种姿态提取网络（SEN），可以在一个统一的框架中提取不同类型的多Modal证据的姿态。此外，我们还引入了基于命名实体之间的共occurrence关系的支持驳回分数，来进一步提高文本SEN的准确性。经过了一系列的大规模公共数据集的实验，我们的提议方法在准确性方面超过了现有的基线，最佳模型在准确性方面提高了3.2%。

EmojiLM: Modeling the New Emoji Language

paper_url: http://arxiv.org/abs/2311.01751
repo_url: https://github.com/komeijiforce/emojilm
paper_authors: Letian Peng, Zilong Wang, Hang Liu, Zihan Wang, Jingbo Shang
for: 研究在线上社交媒体上的表情符号（emoji）的使用趋势和应用。
methods: 使用大型自然语言模型创建了大量文本-表情符号平行数据库（Text2Emoji），并基于这个平行数据库对文本-表情符号 bidirectional 翻译进行了几何分析。
results: 比较baseline模型和平行数据库，我们的提案模型在公共benchmark上和人工评估中均有出色的表现，并且显示了文本-表情符号bidirectional 翻译的应用价值。

Abstract
With the rapid development of the internet, online social media welcomes people with different backgrounds through its diverse content. The increasing usage of emoji becomes a noticeable trend thanks to emoji's rich information beyond cultural or linguistic borders. However, the current study on emojis is limited to single emoji prediction and there are limited data resources available for further study of the interesting linguistic phenomenon. To this end, we synthesize a large text-emoji parallel corpus, Text2Emoji, from a large language model. Based on the parallel corpus, we distill a sequence-to-sequence model, EmojiLM, which is specialized in the text-emoji bidirectional translation. Extensive experiments on public benchmarks and human evaluation demonstrate that our proposed model outperforms strong baselines and the parallel corpus benefits emoji-related downstream tasks.

摘要
“因互联网的快速发展，在线社交媒体逐渐推广不同背景的人透过各种多元内容。增加使用表情符号的趋势也因为表情符号具有跨文化或语言边界的丰富信息，成为当前研究热点。然而，现有的研究仅专注于单一表情符号预测，有限的数据资源对进一步研究表情符号的兴趣语言现象提供了有限的支持。为此，我们合成了大量文本-表情符号平行数据库，Text2Emoji，基于大型语言模型。根据平行数据库，我们提炼了文本-表情符号双向翻译模型，EmojiLM，并进行了广泛的公共benchmark和人类评价。实验结果显示，我们提议的模型优于强基eline，并且平行数据库对表情符号相关下游任务具有助益。”

SAC$^3$: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

paper_url: http://arxiv.org/abs/2311.01740
repo_url: None
paper_authors: Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, Sricharan Kumar
for: 检测语言模型中的幻觉是现代自然语言处理中的一个关键步骤，以确定语言模型的可靠性。
methods: 我们基于语言模型的自我一致性进行检测，并发现了问题水平和模型水平的两种幻觉，这些幻觉不能通过自我一致性检测察看到。我们提出了一种新的采样方法，即含义相关的检查三重方法（SAC$^3$），该方法基于自我一致性检测的原理，并具有更多的机制来检测问题水平和模型水平的幻觉。
results: 我们通过广泛和系统的实验分析，证明了SAC$^3$ 方法在多个问答和开放领域生成 benchmark 上的表现，可以准确地检测非事实和事实声明。

Abstract
Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (SAC$^3$) that expands on the principle of self-consistency checking. Our SAC$^3$ approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. Through extensive and systematic empirical analysis, we demonstrate that SAC$^3$ outperforms the state of the art in detecting both non-factual and factual statements across multiple question-answering and open-domain generation benchmarks.

摘要
现代语言模型（LM）的可信worthiness问题是一个关键步骤。为了解决这个问题，我们重新审视了现有的检测方法，基于语言模型自我一致性。我们发现了两种类型的幻觉，即问题级幻觉和模型级幻觉，这些幻觉不可以通过自我一致性检查 alone 检测出来。基于这一发现，我们提出了一种新的采样基于方法，即含义相关的交叉检查一致性（SAC$^3$）。我们的SAC$^3$方法具有检测问题级和模型级幻觉的能力，通过利用包括semantically相同的问题抖动和跨模型响应一致性检查在内的进一步技术。我们通过了广泛和系统的实验分析，证明了SAC$^3$在检测多个问答和开放领域生成benchmark上的非事实和事实陈述性能比前者更高。

Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models

paper_url: http://arxiv.org/abs/2311.01732
repo_url: None
paper_authors: Sean Xie, Soroush Vosoughi, Saeed Hassanpour
for: This paper aims to improve the interpretability of Large Language Models (LLMs) by developing a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings during the fine-tuning stage while maintaining competitive performance.
methods: The proposed method, called proto-lm, uses a prototypical network to learn interpretable embeddings that can be used to understand how the LLM is making predictions. The method is based on a white-box framework, which allows for transparency and interpretability of the model’s inner workings.
results: The authors demonstrate the applicability and interpretability of their method through experiments on a wide range of NLP tasks, and show that their approach can pave the way for more interpretable models without sacrificing performance. Specifically, their results indicate that the proposed method can learn interpretable embeddings that can be used to understand how the LLM is making predictions, and that the method maintains competitive performance on a variety of NLP tasks.

Abstract
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), but their lack of interpretability has been a major concern. Current methods for interpreting LLMs are post hoc, applied after inference time, and have limitations such as their focus on low-level features and lack of explainability at higher level text units. In this work, we introduce proto-lm, a prototypical network-based white-box framework that allows LLMs to learn immediately interpretable embeddings during the fine-tuning stage while maintaining competitive performance. Our method's applicability and interpretability are demonstrated through experiments on a wide range of NLP tasks, and our results indicate a new possibility of creating interpretable models without sacrificing performance. This novel approach to interpretability in LLMs can pave the way for more interpretable models without the need to sacrifice performance.

摘要
(Simplified Chinese translation)大型语言模型（LLMs）已经帮助了自然语言处理（NLP）领域的发展，但它们的无法解释性带来了主要的担忧。现有的LLMs解释方法都是后期应用的，并且有些缺点，如专注于低级特征和文本单位高级解释性的缺失。在这项工作中，我们提出了 proto-lm，一种基于 прото型网络的白色盒框架，使得 LLMs 可以在练习阶段直接学习可解释的嵌入，而不会影响性能。我们的方法在多种 NLP 任务上进行了实验，并证明了其可应用性和解释性。 results 表明了一种可能性，即创建可解释的模型不需要牺牲性能。这种新的LLMs解释方法可能会开辟出一条新的解释性道路，无需牺牲性能。

A New Korean Text Classification Benchmark for Recognizing the Political Intents in Online Newspapers

paper_url: http://arxiv.org/abs/2311.01712
repo_url: https://github.com/kdavid2355/kopolitic-benchmark-dataset
paper_authors: Beomjune Kim, Eunsun Lee, Dongbin Na
for: 本文主要针对在南韩新闻媒体上发表的政治意图文章进行自动识别。
methods: 该文使用了深度学习基于变换器架构的语言模型，并在大规模的韩国新闻数据集上进行训练。
results: 训练后的模型显示了良好的文本分类性能，并且可以同时进行多任务分类。此外，该文还提供了大规模的韩国新闻数据集，可供Future研究使用。

Abstract
Many users reading online articles in various magazines may suffer considerable difficulty in distinguishing the implicit intents in texts. In this work, we focus on automatically recognizing the political intents of a given online newspaper by understanding the context of the text. To solve this task, we present a novel Korean text classification dataset that contains various articles. We also provide deep-learning-based text classification baseline models trained on the proposed dataset. Our dataset contains 12,000 news articles that may contain political intentions, from the politics section of six of the most representative newspaper organizations in South Korea. All the text samples are labeled simultaneously in two aspects (1) the level of political orientation and (2) the level of pro-government. To the best of our knowledge, our paper is the most large-scale Korean news dataset that contains long text and addresses multi-task classification problems. We also train recent state-of-the-art (SOTA) language models that are based on transformer architectures and demonstrate that the trained models show decent text classification performance. All the codes, datasets, and trained models are available at https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset.

摘要
многие用户在阅读在线报纸时可能会遇到很大的区分隐含意图的困难。在这项工作中，我们关注自动识别在线报纸中的政治意图，通过理解文本的上下文来解决这个问题。为解决这个任务，我们提供了一个新的韩国文本分类数据集，该数据集包含了多种文章。我们还提供了基于深度学习的文本分类基线模型，该模型在我们提posed的数据集上训练。我们的数据集包含12,000篇报纸文章，这些文章可能包含政治意图，来自韩国六家最重要的报纸组织的政治部分。所有的文本样本都同时被标注了两个方面：（1）政治方向的水平和（2）政府支持度的水平。根据我们所知，我们的论文是最大规模的韩国新闻数据集，它包含了长文本，并解决了多任务分类问题。我们还训练了最新的状态zig对应的语言模型，该模型基于变换架构，并示出了训练后的模型在文本分类任务上的不错表现。所有的代码、数据集和训练模型都可以在https://github.com/Kdavid2355/KoPolitic-Benchmark-Dataset上获取。

CASE: Commonsense-Augmented Score with an Expanded Answer Space

paper_url: http://arxiv.org/abs/2311.01684
repo_url: https://github.com/wk-chen/commonsense-augmented-score-with-an-expanded-answer-space
paper_authors: Wenkai Chen, Sahithya Ravi, Vered Shwartz
for: 这个论文是为了提高 Language Model (LM) 在多项选择问答任务中的表现，特别是 Addressing the limitation of basic score 对所有单词的对待。
methods: 该论文提出了 Commonsense-Augmented Score with Expanded Answer Space (CASE)，即基于含义关系的单词重要性权重，以及生成多元答案的方法。
results: 对五个常识 benchmark 进行了测试，RESULTS 表明，在使用 smaller LMs 时，CASE 方法可以超越强基线，并且与答案空间扩展方法相结合时，效果更好。

Abstract
LLMs have demonstrated impressive zero-shot performance on NLP tasks thanks to the knowledge they acquired in their training. In multiple-choice QA tasks, the LM probabilities are used as an imperfect measure of the plausibility of each answer choice. One of the major limitations of the basic score is that it treats all words as equally important. We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space. CASE addresses this limitation by assigning importance weights for individual words based on their semantic relations to other words in the input. The dynamic weighting approach outperforms basic LM scores, not only because it reduces noise from unimportant words, but also because it informs the model of implicit commonsense knowledge that may be useful for answering the question. We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices. When combined with answer space expansion, our method outperforms strong baselines on 5 commonsense benchmarks. We further show these two approaches are complementary and may be especially beneficial when using smaller LMs.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard writing system used in mainland China. The translation may not be perfect, and some nuances or idioms may not be fully conveyed.

Plot Retrieval as an Assessment of Abstract Semantic Association

paper_url: http://arxiv.org/abs/2311.01666
repo_url: None
paper_authors: Shicheng Xu, Liang Pang, Jiangnan Li, Mo Yu, Fandong Meng, Huawei Shen, Xueqi Cheng, Jie Zhou
for: 提高阅读体验和效率，提取相关剧情图文
methods: 使用标注数据集Plot Retrieval进行训练和评估信息检索模型的抽象含义关系能力
results: 现有信息检索模型仍然在捕捉抽象含义关系方面做不够，需要进一步研究抽象含义模型化能力

Abstract
Retrieving relevant plots from the book for a query is a critical task, which can improve the reading experience and efficiency of readers. Readers usually only give an abstract and vague description as the query based on their own understanding, summaries, or speculations of the plot, which requires the retrieval model to have a strong ability to estimate the abstract semantic associations between the query and candidate plots. However, existing information retrieval (IR) datasets cannot reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task Plot Retrieval. Text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can reflect the ability of the IR models to estimate the abstract semantic association, rather than just traditional lexical or semantic matching. Extensive experiments across various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods compared with human studies on Plot Retrieval show current IR models still struggle in capturing abstract semantic association between texts. Plot Retrieval can be the benchmark for further research on the semantic association modeling ability of IR models.

摘要
<> Retrieving relevant plots from a book based on a query is a crucial task that can enhance the reading experience and efficiency of readers. However, existing information retrieval (IR) datasets do not reflect this ability well. In this paper, we propose Plot Retrieval, a labeled dataset to train and evaluate the performance of IR models on the novel task of Plot Retrieval. The text pairs in Plot Retrieval have less word overlap and more abstract semantic association, which can better reflect the ability of IR models to estimate the abstract semantic association rather than just traditional lexical or semantic matching. Extensive experiments comparing various lexical retrieval, sparse retrieval, dense retrieval, and cross-encoder methods with human studies on Plot Retrieval show that current IR models still struggle in capturing abstract semantic associations between texts. Plot Retrieval can serve as a benchmark for further research on the semantic association modeling ability of IR models.