2023-10-08

cs.CL

cs.CL - 2023-10-08

Visual Storytelling with Question-Answer Plans

paper_url: http://arxiv.org/abs/2310.05295
repo_url: None
paper_authors: Danyang Liu, Mirella Lapata, Frank Keller
for: 本研究旨在生成吸引人的故事，从图像序列中提取有趣的视觉表达。
methods: 该模型将图像序列转化为可进行语言模型解释的视觉预фикс，并使用问题对话对话来选择关键的视觉概念并决定如何将它们组织成一个故事。
results: 自动和人工评估结果表明，蓝图基本模型可以生成更加有趣、有логи、自然的故事，比baseline和现有系统更高效。

Abstract
Visual storytelling aims to generate compelling narratives from image sequences. Existing models often focus on enhancing the representation of the image sequence, e.g., with external knowledge sources or advanced graph structures. Despite recent progress, the stories are often repetitive, illogical, and lacking in detail. To mitigate these issues, we present a novel framework which integrates visual representations with pretrained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative. Automatic and human evaluation on the VIST benchmark (Huang et al., 2016) demonstrates that blueprint-based models generate stories that are more coherent, interesting, and natural compared to competitive baselines and state-of-the-art systems.

摘要
Visual storytelling 目标是从图像序列中生成吸引人的故事。现有的模型通常会将注意力集中在图像序列的表现方面，例如通过与外部知识源或高级graph structures整合。despite recent progress, stories are often repetitive, illogical, and lacking in detail. To address these issues, we propose a novel framework that integrates visual representations with pre-trained language models and planning. Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings that language models can interpret. It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative. Automatic and human evaluation on the VIST benchmark (Huang et al., 2016) shows that blueprint-based models generate stories that are more coherent, interesting, and natural compared to competitive baselines and state-of-the-art systems.

Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus

paper_url: http://arxiv.org/abs/2310.05294
repo_url: https://github.com/hlt-mt/fbk-neutr-eval
paper_authors: Andrea Piergentili, Beatrice Savoldi, Dennis Fucci, Matteo Negri, Luisa Bentivogli
for: Addressing the lack of inclusive language in machine translation, particularly in grammatical gender languages.
methods: Proposing a dedicated benchmark and exploring automated evaluation methods for gender-neutral translation from English to Italian, including a natural, bilingual test set (GeNTE) and a reference-free evaluation approach.
results: A new, more inclusive approach to machine translation that challenges traditional binary gender assumptions and provides a more accurate assessment of gender-neutral translation.

Abstract
Gender inequality is embedded in our communication practices and perpetuated in translation technologies. This becomes particularly apparent when translating into grammatical gender languages, where machine translation (MT) often defaults to masculine and stereotypical representations by making undue binary gender assumptions. Our work addresses the rising demand for inclusive language by focusing head-on on gender-neutral translation from English to Italian. We start from the essentials: proposing a dedicated benchmark and exploring automated evaluation methods. First, we introduce GeNTE, a natural, bilingual test set for gender-neutral translation, whose creation was informed by a survey on the perception and use of neutral language. Based on GeNTE, we then overview existing reference-based evaluation approaches, highlight their limits, and propose a reference-free method more suitable to assess gender-neutral translation.

摘要
gender inequality 在我们的沟通习惯中存在并在翻译技术中被延续。这种情况特别在翻译到 grammatical gender 语言时变得明显，MT 常 defaults to masculine 和标准化的表达，从而做出了不当的男性假设。我们的工作解决了包容性语言的增长需求，专注于从英语到意大利语的gender-neutral 翻译。我们从基础开始：提议一个专门的标准和探索自动评估方法。首先，我们介绍了 GeNTE，一个自然、双语测试集 для gender-neutral 翻译，其创建受到了对中性语言的感知和使用的调查。然后，我们概述了现有的参照基础评估方法， highlight их的局限性，并提出了不需要参照的方法，更适合评估 gender-neutral 翻译。

Enhancing Pre-Trained Language Models with Sentence Position Embeddings for Rhetorical Roles Recognition in Legal Opinions

paper_url: http://arxiv.org/abs/2310.05276
repo_url: None
paper_authors: Anas Belfathi, Nicolas Hernandez, Laura Monceaux
For: 这个研究论文是为了提出一种基于预训练语言模型（PLM）和句子位置信息的新型自动预测辩论角色的模型建模方法。* Methods: 该方法使用了一个简单的模型结构，并使用了LegalEval@SemEval2023 competition上的注释 corpora进行训练。在这个 corpus 中，它们使用了一些特定的预处理技术来提高模型的性能。* Results: 研究人员发现，他们的方法比使用复杂的层次模型在全局上的方法更加简单，具有更低的计算成本。此外，他们还发现，通过在本地上增加更多的注意力，以及将句子位置信息纳入模型中，可以进一步提高结果。

Abstract
The legal domain is a vast and complex field that involves a considerable amount of text analysis, including laws, legal arguments, and legal opinions. Legal practitioners must analyze these texts to understand legal cases, research legal precedents, and prepare legal documents. The size of legal opinions continues to grow, making it increasingly challenging to develop a model that can accurately predict the rhetorical roles of legal opinions given their complexity and diversity. In this research paper, we propose a novel model architecture for automatically predicting rhetorical roles using pre-trained language models (PLMs) enhanced with knowledge of sentence position information within a document. Based on an annotated corpus from the LegalEval@SemEval2023 competition, we demonstrate that our approach requires fewer parameters, resulting in lower computational costs when compared to complex architectures employing a hierarchical model in a global-context, yet it achieves great performance. Moreover, we show that adding more attention to a hierarchical model based only on BERT in the local-context, along with incorporating sentence position information, enhances the results.

摘要
法律领域是一个庞大复杂的领域，涉及到大量的文本分析，包括法律、法律论据和法律意见。法律实践者需要分析这些文本，以理解法律案例，研究法律先例，并准备法律文书。随着法律意见的大小不断增加，以至于开发一个可以准确预测法律意见的模型变得愈加挑战。在这篇研究论文中，我们提出一种新的模型建立方法，使用预训练语言模型（PLMs），并在文本中添加句子位置信息，以自动预测法律意见的文化角色。基于LegalEval@SemEval2023比赛获得的标注词汇集，我们示示了我们的方法需要 fewer parameters，相比较复杂的结构，计算成本更低，同时可以达到高效的表现。此外，我们还证明了在地方上添加更多注意力，以及基于BERT的层次模型，可以提高结果。

XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

paper_url: http://arxiv.org/abs/2310.05235
repo_url: None
paper_authors: Robin Algayres, Pablo Diego-Simon, Benoit Sagot, Emmanuel Dupoux
for: 这个论文的目的是提高无文本支持的语音分割任务的性能。methods: 这个论文使用了最新的自我超vised speech模型，通过精度调整来快速适应新任务，即使在资源匮乏的情况下。它们引入了 semi-supervised learning的想法，使用 XLS-R 模型预测语音分割系统生成的字Boundary。results: 这个论文的方法可以一直提高每种系统的性能，并在五种语言 corpora 上设置了新的状态态�idents，平均提高了130%的 F1 分数。此外，这个系统还可以在无seen语言中进行零shot分割。

Abstract
Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we fine-tune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is fine-tuned, it is used to infer new word boundary labels that are used in turn for another fine-tuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.

摘要

Generative Spoken Language Model based on continuous word-sized audio tokens

paper_url: http://arxiv.org/abs/2310.05224
repo_url: None
paper_authors: Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux
for: 该论文旨在提出一种基于word-size连续值音频嵌入的生成语言模型（GSLM），以便生成多样化和表达力强的语言输出。
methods: 该模型使用了 Lexical Embedding 函数取代 lookup 表格，权重损失函数被替换为对比损失函数，以及多omial 采样被替换为 k-NN 采样。
results: 该模型的表现与基于分组单元 GSLMs 相当，自动度量器和人工评价都表示生成质量高，并且具有五倍的内存效率优势。此外，模型中的嵌入before和after Lexical Embedder 具有phonetics和semantics的可读性。

Abstract
In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

摘要
(Simplified Chinese translation)在 NLP 中，文本语言模型 based on words or subwords 知道会比其字符基本的对手表现更好。然而，在语音社区中，标准输入的语音LMs 是20ms或40ms短于一个音素的分 discrete units。 drawing inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

Probing Language Models from A Human Behavioral Perspective

paper_url: http://arxiv.org/abs/2310.05216
repo_url: None
paper_authors: Xintong Wang, Xiaoyu Li, Xingshan Li, Chris Biemann
for: This paper aims to provide a better understanding of how large language models (LLMs) work and how they make predictions.
methods: The authors use eye-tracking measures to correlate with the values produced by LLMs and compare them to those of recurrent neural network-based language models (RNN-LMs). They also analyze the functions of self-attention and gate mechanisms in LLMs.
results: The study finds that LLMs exhibit a distinct prediction pattern compared to RNN-LMs, with a peak in memorization and linguistic knowledge encoding as the number of feed-forward network (FFN) layers increases, followed by a pivot to comprehension capacity. The self-attention mechanisms are found to be distributed across multiple heads, and the gate mechanisms control the flow of information, with some gates promoting and others eliminating information.

Abstract
Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. However, the understanding of their prediction process and internal mechanisms, such as feed-forward networks and multi-head self-attention, remains largely unexplored. In this study, we probe LLMs from a human behavioral perspective, correlating values from LLMs with eye-tracking measures, which are widely recognized as meaningful indicators of reading patterns. Our findings reveal that LLMs exhibit a prediction pattern distinct from that of RNN-based LMs. Moreover, with the escalation of FFN layers, the capacity for memorization and linguistic knowledge encoding also surges until it peaks, subsequently pivoting to focus on comprehension capacity. The functions of self-attention are distributed across multiple heads. Lastly, we scrutinize the gate mechanisms, finding that they control the flow of information, with some gates promoting, while others eliminating information.

摘要
Note:* "Large Language Models" (LLMs) 是现代 NLP 中最具代表性的基础模型，但它们的预测过程和内部机制仍然尚未得到充分的研究。* "feed-forward networks" (FFNs) 是 LLMs 的一种基本结构，它们在预测过程中发挥着重要的作用。* "multi-head self-attention" 是 LLMs 中的一种自注意机制，它可以帮助模型更好地理解语言结构和含义。* "eye-tracking measures" 是一种广泛用于研究人类阅读习惯的方法，它可以反映人们在阅读过程中的注意力和理解程度。* "gate mechanisms" 是 LLMs 中的一种控制信息流动的机制，它可以帮助模型更好地过滤不必要的信息并保留有用信息。

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

paper_url: http://arxiv.org/abs/2310.05203
repo_url: None
paper_authors: Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda
for: 这个论文targets the singing voice conversion challenge (SVCC) 2023, with a recognition-synthesis approach using self-supervised learning-based representation.
methods: 该方法首先使用公共可用的大规模750小时的语音和唱歌数据进行扩散基于的任意到任意语音转换模型的训练，然后对每个目标唱歌者/说话者进行微调。
results: 大规模的听力测试显示，我们的T13系统在SVCC 2023中获得了竞争力强的自然性和说话者相似性，这表明了我们的方法在跨频道SVC中的泛化能力。

Abstract
This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective evaluation results show that using large datasets is particularly beneficial for cross-domain SVC.

摘要
这篇论文介绍我们的系统（简称为T13）在2023年歌唱voice conversions挑战（SVCC）中的应用。对于英语歌唱voice conversions（SVC）的内域和跨域任务（任务1和任务2），我们采用了认知-合成方法，使用自我超vised学习基于表示。为了实现数据精efficient的SVC，我们首先使用公共可用的大规模750小时的说话和唱歌数据来训练一个扩散-based any-to-anyvoice conversions模型。然后，我们对每个目标歌手/说话人进行了微调。SVCC 2023年的大规模听力测试显示，我们的T13系统在跨域SVC（任务2）中实现了竞争性的自然和说话人相似性，这表明了我们提出的方法的泛化能力。我们的目标评价结果表明，使用大量数据对跨域SVC是非常有利的。

Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback

paper_url: http://arxiv.org/abs/2310.05199
repo_url: None
paper_authors: Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
for: 这篇论文的目的是如何使用人类反馈来改善大型自然语言模型，使其更好地适应人类和社会价值。
methods: 这篇论文使用了Product-of-Experts（PoE）技术，将奖励模型分为两部分：主要专家关注人类意图，而偏见专家则targets the identification and capture of length bias。另外，为了进一步提高偏见的学习，我们导入了扰动 INTO the bias-focused expert, disrupting the flow of semantic information。
results: 实验结果显示，我们的方法可以改善语言模型的性能，不受序列长度的影响。

Abstract
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.

摘要
大language模型可以通过人类反馈来进行强化学习，这种反馈可以帮助模型与人类和社会价值观念相Alignment。为了学习奖励模型，需要一个大量的人类反馈，然后使用这个奖励模型来精化语言模型。然而，我们发现奖励模型经常会寻找短cut的缺点，假设人类更喜欢 longer responses。这种Length bias会导致模型偏好 longer outputs，但这并不意味着这些输出中含有更多的有用信息。在这篇论文中，我们提出了一种创新的解决方案，通过Product-of-Experts（PoE）技术分离奖励模型和序列长度的影响。在我们的框架中，主专家专注于理解人类意图，而偏好专家则targets the identification and capture of length bias。为了进一步增强偏好的学习，我们引入了对偏好专家中的干扰，使得 semantic information的流动被中断。实验结果证明了我们的方法的有效性，表明语言模型的性能不受序列长度的限制。

FABRIC: Automated Scoring and Feedback Generation for Essays

paper_url: http://arxiv.org/abs/2310.05191
repo_url: None
paper_authors: Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Hyunseung Lim, Yoonsu Kim, Tak Yeon Lee, Hwajung Hong, Juho Kim, So-Yeon Ahn, Alice Oh
for: 这个论文是为了提供一种自动生成英语写作评分的工具，以帮助学生和教师在写作课程中更好地评分和反馈写作。
methods: 该论文使用了一种管道模型，包括DREsS、CASE和EssayCoT三部分。DREsS是一个基于标准的写作评分数据集，CASE是一种伪造策略，可以提高模型的准确率。EssayCoT是一种写作思维推荐策略，可以根据模型预测的分数提供更好的反馈。
results: 论文表明，使用新的数据集DREsS和伪造策略CASE可以提高模型的准确率，并且使用EssayCoT可以提供更好的反馈。论文还表明，学生和教师对新的评分和反馈表示满意，评分和反馈的帮助程度也得到了提升。

Abstract
Automated essay scoring (AES) provides a useful tool for students and instructors in writing classes by generating essay scores in real-time. However, previous AES models do not provide more specific rubric-based scores nor feedback on how to improve the essays, which can be even more important than the overall scores for learning. We present FABRIC, a pipeline to help students and instructors in English writing classes by automatically generating 1) the overall scores, 2) specific rubric-based scores, and 3) detailed feedback on how to improve the essays. Under the guidance of English education experts, we chose the rubrics for the specific scores as content, organization, and language. The first component of the FABRIC pipeline is DREsS, a real-world Dataset for Rubric-based Essay Scoring (DREsS). The second component is CASE, a Corruption-based Augmentation Strategy for Essays, with which we can improve the accuracy of the baseline model by 45.44%. The third component is EssayCoT, the Essay Chain-of-Thought prompting strategy which uses scores predicted from the AES model to generate better feedback. We evaluate the effectiveness of the new dataset DREsS and the augmentation strategy CASE quantitatively and show significant improvements over the models trained with existing datasets. We evaluate the feedback generated by EssayCoT with English education experts to show significant improvements in the helpfulness of the feedback across all rubrics. Lastly, we evaluate the FABRIC pipeline with students in a college English writing class who rated the generated scores and feedback with an average of 6 on the Likert scale from 1 to 7.

摘要
自动化文章评分（AES）为学生和教师写作课程提供了一个有用的工具，可以在实时生成文章评分。然而，先前的AES模型并不提供更加特定的评分标准和提高文章的细节反馈，这些反馈可能对学习更加重要。我们介绍了FABRIC管道，帮助学生和教师英语写作课程，可以自动生成1）总评分，2）特定评分标准，以及3）提高文章的细节反馈。在英语教育专家的指导下，我们选择了评分标准的内容、组织和语言。FABRIC管道的第一个组成部分是DREsS，一个用于评分标准的实际数据集（DREsS）。第二个组成部分是CASE，一种对文章进行恶意增强策略，可以提高基线模型的准确率45.44%。第三个组成部分是EssayCoT，文章链条思维提醒策略，使用AES模型预测的分数来生成更好的反馈。我们评估了新的数据集DREsS和增强策略CASE的效果，并显示了与现有数据集训练的模型相比有显著提高。我们评估EssayCoT生成的反馈与英语教育专家相比，并显示了所有评分标准上的有用性提高。最后，我们评估了FABRIC管道与大学英语写作课程学生的反馈，学生对生成的分数和反馈给出了7分的满分评价。

Do Large Language Models Know about Facts?

paper_url: http://arxiv.org/abs/2310.05177
repo_url: https://github.com/jettbrains/-L-
paper_authors: Xuming Hu, Junzhe Chen, Xiaochuan Li, Yufei Guo, Lijie Wen, Philip S. Yu, Zhijiang Guo
for: 这paper的目的是评估大型自然语言处理模型（LLMs）中的事实知识，以及这些模型是否可以具备真实知识和抵御黑客攻击。
methods: 这paper使用了一个名为Pinocchio的benchmark，包含20000个多样化的事实问题，以评估LLMs中的事实知识。
results: 经过extensive的实验研究发现，现有的LLMs仍然缺乏事实知识，并且存在多种假相关性。

Abstract
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to comprehensively evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs are able to compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes will be publicly available.

摘要
大型自然语言处理模型（LLM）在最近几年中带来了一系列的性能提升。这些模型在预训练和调教过程中获得的 фактиче知识可以在多种下游任务中使用，如问答和语言生成。不同于传统的知识库（KB），LLM中的知识不是显式存储的，而是通过模型参数的方式隐式存储。由模型生成的内容经常会具有误差或不准确，因为模型可能会 incorrectly induce 或者随着时间的推移而变得过时。为了全面评估 LLM 中的知识范围和深度，我们设计了 Pinocchio benchmark。Pinocchio 包含 20,000 个多样化的 фактиче问题，这些问题来自不同的来源、时间线、领域、地区和语言。此外，我们还 investigate 了 LLM 是否能够组合多个 фактиче知识、 temporally 更新知识、理解多个知识之间的关系、察看微妙的知识差异以及抵御骚扰示例。我们对不同大小和类型的 LLM 进行了广泛的实验，发现现有 LLM 仍然缺乏知识和受到多种假 correlate 的影响。我们认为这是人工智能实现可信worthy 的核心瓶颈。Pinocchio 数据集和我们的代码将公开发布。

On the Zero-Shot Generalization of Machine-Generated Text Detectors

paper_url: http://arxiv.org/abs/2310.05165
repo_url: None
paper_authors: Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov, Tianxing He
for: 本研究的目的是检测机器生成的文本，以确定新生成器输出的真实性。
methods: 本研究使用了许多大语言模型生成的数据，并使用神经网络检测器来检测机器生成的文本。
results: 研究发现，使用中等大小的语言模型生成的数据来训练检测器，可以在其他大型模型上实现零基础泛化。这表明，可以通过将中等大小模型的数据作为基础，建立可靠的机器生成文本检测器。

Abstract
The rampant proliferation of large language models, fluent enough to generate text indistinguishable from human-written language, gives unprecedented importance to the detection of machine-generated text. This work is motivated by an important research question: How will the detectors of machine-generated text perform on outputs of a new generator, that the detectors were not trained on? We begin by collecting generation data from a wide range of LLMs, and train neural detectors on data from each generator and test its performance on held-out generators. While none of the detectors can generalize to all generators, we observe a consistent and interesting pattern that the detectors trained on data from a medium-size LLM can zero-shot generalize to the larger version. As a concrete application, we demonstrate that robust detectors can be built on an ensemble of training data from medium-sized models.

摘要
大量的语言模型的蔓延，使得机器生成文本的检测成为了不可或缺的任务。这项工作受到一个重要的研究问题的推动：新生成器输出的机器生成文本检测器如何表现？我们开始sBy collecting generation data from a wide range of LLMs, and training neural detectors on data from each generator, we test the performance of these detectors on held-out generators. While none of the detectors can generalize to all generators, we observe a consistent and interesting pattern that the detectors trained on data from a medium-size LLM can zero-shot generalize to the larger version. As a concrete application, we demonstrate that robust detectors can be built on an ensemble of training data from medium-sized models.Here's the translation breakdown:* 大量 (dà liàng) - large amount* 语言模型 (yǔ yán módel) - language model* 蔓延 (shū yì) - rampant proliferation* 机器生成文本 (jī shì zhì yì wén tǐ) - machine-generated text* 检测 (jiǎn dòu) - detection* 新生成器 (xīn shēng chéng qì) - new generator* 输出 (xū chū) - output* 机器生成文本检测器 (jī shì zhì yì wén tǐ jiàn dòu qì) - machine-generated text detector* none of the detectors can generalize to all generators (zhè yī xiàng qù zhè yī xiàng qù) - none of the detectors can generalize to all generators* medium-size LLM (zhōng xiǎo yǔ yán módel) - medium-size language model* zero-shot generalize (zhè yī xiàng qù) - zero-shot generalize* ensemble (jiān) - ensemble* training data (liào xīng xīng) - training data* robust (dòu lì) - robust* detectors (jiàn dòu qì) - detectors

An Investigation of LLMs’ Inefficacy in Understanding Converse Relations

paper_url: http://arxiv.org/abs/2310.05163
repo_url: https://github.com/3b-group/convre
paper_authors: Chengwen Qi, Bowen Li, Binyuan Hui, Bailin Wang, Jinyang Li, Jinwang Wu, Yuanjun Laili
for: 本文 investigate LLMs 是否真的理解正式语言的结构化 semantics，通过一个特殊情况——抽象 binary relation。
methods: 本文 introduce 一个新的 benchmark ConvRE，该 benchmark 包含 17 关系和 1240 个 triple 从受欢迎的知识 Graph completion 数据集中提取出来。本 benchmark 包含 two 个任务：Re2Text 和 Text2Re，它们是通过多选问答来评估 LLMs 对关系和相关文本的匹配能力。
results: 经过实验表明，LLMs 经常采用短cut 学习，并且在我们的 proposed benchmark 上仍然遇到挑战。

Abstract
Large Language Models (LLMs) have achieved remarkable success in many formal language oriented tasks, such as structural data-to-text and semantic parsing. However current benchmarks mostly follow the data distribution of the pre-training data of LLMs. Therefore, a natural question rises that do LLMs really understand the structured semantics of formal languages. In this paper, we investigate this problem on a special case, converse binary relation. We introduce a new benchmark ConvRe focusing on converse relations, which contains 17 relations and 1240 triples extracted from popular knowledge graph completion datasets. Our ConvRE features two tasks, Re2Text and Text2Re, which are formulated as multi-choice question answering to evaluate LLMs' ability to determine the matching between relations and associated text. For the evaluation protocol, apart from different prompting methods, we further introduce variants to the test text and few-shot example text. We conduct experiments on three popular LLM families and have observed various scaling trends. The results suggest that LLMs often resort to shortcut learning and still face challenges on our proposed benchmark.

摘要

Recurrent Neural Language Models as Probabilistic Finite-state Automata

paper_url: http://arxiv.org/abs/2310.05161
repo_url: None
paper_authors: Anej Svete, Ryan Cotterell
for: 本文研究语言模型（LM）的表示能力和限制，通过使用已知的ormalism来准确地描述LM的能力和限制。
methods: 本文使用了回归神经网络（RNN）LM来研究LM可以表示哪些概率分布。
results: 研究结果表明，简单的RNN可以表示一个子集的概率分布，而且需要至少有 $\Omega\left(N |\Sigma|\right)$ 神经元来表示一个任意决定性 finite-state LM。

Abstract
Studying language models (LMs) in terms of well-understood formalisms allows us to precisely characterize their abilities and limitations. Previous work has investigated the representational capacity of recurrent neural network (RNN) LMs in terms of their capacity to recognize unweighted formal languages. However, LMs do not describe unweighted formal languages -- rather, they define probability distributions over strings. In this work, we study what classes of such probability distributions RNN LMs can represent, which allows us to make more direct statements about their capabilities. We show that simple RNNs are equivalent to a subclass of probabilistic finite-state automata, and can thus model a strict subset of probability distributions expressible by finite-state models. Furthermore, we study the space complexity of representing finite-state LMs with RNNs. We show that, to represent an arbitrary deterministic finite-state LM with $N$ states over an alphabet $\Sigma$, an RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a first step towards characterizing the classes of distributions RNN LMs can represent and thus help us understand their capabilities and limitations.

摘要
Translated into Simplified Chinese:研究语言模型（LM）使用已知的形式主义，可以准确地描述它们的能力和局限性。先前的工作已经研究了基于回归神经网络（RNN）的语言模型的表示能力，但是LM不是形式语言的描述，而是一种字符串上的概率分布。在这项工作中，我们研究了RNN可以表示哪些类型的概率分布，这使得我们可以更直接地说明它们的能力。我们显示了简单的RNN等价于一个子集的概率金字塔自动机，因此它们可以模型一 subset of概率分布可以由金字塔自动机表示。此外，我们研究了表示finite-state LM的RNN空间复杂度。我们显示了，要表示一个任意deterministic finite-state LM，需要$\Omega\left(N |\Sigma|\right)$ neuron。这些结果为我们帮助理解RNN可以表示哪些类型的概率分布，并且帮助我们理解它们的能力和局限性。

From Data to Dialogue: Leveraging the Structure of Knowledge Graphs for Conversational Exploratory Search

paper_url: http://arxiv.org/abs/2310.05150
repo_url: https://github.com/sebischair/kg-conv-exploratory-search
paper_authors: Phillip Schneider, Nils Rehtanz, Kristiina Jokinen, Florian Matthes
for: 这篇研究旨在探索新闻文章中的探索搜寻，以实现对话式搜寻和知识库的融合，从而将结构化和无结构化资料搜寻融合在一起。
methods: 本研究使用了对话式搜寻系统和知识库来支持探索搜寻，并透过自然语言问题来询问新闻文章中的相关资讯。
results: 根据54名参与者的用户研究，这种基于知识库的对话式搜寻系统被证明是有效的，并且提供了开发这类系统的设计假设。

Abstract
Exploratory search is an open-ended information retrieval process that aims at discovering knowledge about a topic or domain rather than searching for a specific answer or piece of information. Conversational interfaces are particularly suitable for supporting exploratory search, allowing users to refine queries and examine search results through interactive dialogues. In addition to conversational search interfaces, knowledge graphs are also useful in supporting information exploration due to their rich semantic representation of data items. In this study, we demonstrate the synergistic effects of combining knowledge graphs and conversational interfaces for exploratory search, bridging the gap between structured and unstructured information retrieval. To this end, we propose a knowledge-driven dialogue system for exploring news articles by asking natural language questions and using the graph structure to navigate between related topics. Based on a user study with 54 participants, we empirically evaluate the effectiveness of the graph-based exploratory search and discuss design implications for developing such systems.

摘要
探索搜寻是一种开放式搜寻过程，旨在探索一个主题或领域中的知识而不是寻找具体的答案或信息。对话式 интерфей斯特别适合支持探索搜寻，允许用户通过交互对话来细化查询和检视搜寻结果。此外，知识图也非常有用于支持信息探索，因为它们可以提供丰富的Semantic Representation的数据项。在这项研究中，我们证明了结合知识图和对话式 интерфей斯可以减少结构化和无结构化搜寻之间的差距，并提供一种基于知识的对话系统来探索新闻文章。基于54名参与者的用户研究，我们Empirically评估了图structure-based探索搜寻的效果，并讨论了开发这类系统的设计方面。Here's a word-for-word translation of the text into Simplified Chinese:探索搜寻是一种开放式搜寻过程，旨在探索一个主题或领域中的知识而不是寻找具体的答案或信息。对话式 интерфей斯特别适合支持探索搜寻，允许用户通过交互对话来细化查询和检视搜寻结果。此外，知识图也非常有用于支持信息探索，因为它们可以提供丰富的Semantic Representation的数据项。在这项研究中，我们证明了结合知识图和对话式 интерфей斯可以减少结构化和无结构化搜寻之间的差距，并提供一种基于知识的对话系统来探索新闻文章。基于54名参与者的用户研究，我们Empirically评估了图structure-based探索搜寻的效果，并讨论了开发这类系统的设计方面。

Retrieval-Generation Synergy Augmented Large Language Models

paper_url: http://arxiv.org/abs/2310.05149
repo_url: None
paper_authors: Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, Bing Qin
for: 提高大型自然语言模型的理解能力和多步逻辑能力
methods: 融合任务相关文档和大型自然语言模型，通过反射-生成协作机制，利用参数化和非参数化知识，找到正确的逻辑路径
results: 在四个问答任务上，经验结果表明我们的方法可以显著提高大型自然语言模型的逻辑能力，并超越先前的基eline。

Abstract
Large language models augmented with task-relevant documents have demonstrated impressive performance on knowledge-intensive tasks. However, regarding how to obtain effective documents, the existing methods are mainly divided into two categories. One is to retrieve from an external knowledge base, and the other is to utilize large language models to generate documents. We propose an iterative retrieval-generation collaborative framework. It is not only able to leverage both parametric and non-parametric knowledge, but also helps to find the correct reasoning path through retrieval-generation interactions, which is very important for tasks that require multi-step reasoning. We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks. Empirical results show that our method significantly improves the reasoning ability of large language models and outperforms previous baselines.

摘要
大型语言模型，通过与任务相关的文档的协同工作，已经在知识型任务中表现出了惊人的表现。然而，现有的方法主要分为两类：一是从外部知识库中检索，另一是利用大型语言模型生成文档。我们提出了一种迭代检索生成协同框架，不仅能充分利用参数化和非参数化知识，而且能够通过检索生成互动，找到正确的逻辑路径，这对于需要多步逻辑的任务非常重要。我们在四个问答dataset上进行了实验，包括单步QA和多步QA任务。实验结果表明，我们的方法可以显著提高大型语言模型的逻辑能力，并超越先前的基elines。

Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature

paper_url: http://arxiv.org/abs/2310.05130
repo_url: https://github.com/baoguangsheng/fast-detect-gpt
paper_authors: Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, Yue Zhang
for: 本研究旨在分别区别机器生成和人类撰写的内容，以建立可信赖的人工智能系统。methods: 本研究使用了 conditional probability curvature 来显示机器学习模型和人类之间的差异。results: Fast-DetectGPT 比 DetectGPT 更高效，可以在不同的数据集、来源模型和测试环境下提高检测效能，并且可以实现340倍的速度提升。

Abstract
Large language models (LLMs) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. To build trustworthy AI systems, it is imperative to distinguish between machine-generated and human-authored content. The leading zero-shot detector, DetectGPT, showcases commendable performance but is marred by its intensive computational costs. In this paper, we introduce the concept of conditional probability curvature to elucidate discrepancies in word choices between LLMs and humans within a given context. Utilizing this curvature as a foundational metric, we present Fast-DetectGPT, an optimized zero-shot detector, which substitutes DetectGPT's perturbation step with a more efficient sampling step. Our evaluations on various datasets, source models, and test conditions indicate that Fast-DetectGPT not only outperforms DetectGPT in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in Table 1.

摘要

Enhancing Document-level Event Argument Extraction with Contextual Clues and Role Relevance

paper_url: http://arxiv.org/abs/2310.05991
repo_url: https://github.com/LWL-cpu/SCPRG-master
paper_authors: Wanlong Liu, Shaohuan Cheng, Dingyi Zeng, Hong Qu
for: 这个论文主要针对的是文档级事件抽象EXTRACTION中的新挑战，即输入长度大、跨句理解。
methods: 我们提出了一种基于Span-trigger-based Contextual Pooling和 latent Role Guidance的SCPRG模型，包括两个新的有效模块，即 Span-Trigger-based Contextual Pooling(STCP)和 Role-based Latent Information Guidance (RLIG)。
results: 我们的SCPRG模型在两个公共数据集上进行了比较，与之前的状态态方法相比，提高了1.13和2.64的F1分数。

Abstract
Document-level event argument extraction poses new challenges of long input and cross-sentence inference compared to its sentence-level counterpart. However, most prior works focus on capturing the relations between candidate arguments and the event trigger in each event, ignoring two crucial points: a) non-argument contextual clue information; b) the relevance among argument roles. In this paper, we propose a SCPRG (Span-trigger-based Contextual Pooling and latent Role Guidance) model, which contains two novel and effective modules for the above problem. The Span-Trigger-based Contextual Pooling(STCP) adaptively selects and aggregates the information of non-argument clue words based on the context attention weights of specific argument-trigger pairs from pre-trained model. The Role-based Latent Information Guidance (RLIG) module constructs latent role representations, makes them interact through role-interactive encoding to capture semantic relevance, and merges them into candidate arguments. Both STCP and RLIG introduce no more than 1% new parameters compared with the base model and can be easily applied to other event extraction models, which are compact and transplantable. Experiments on two public datasets show that our SCPRG outperforms previous state-of-the-art methods, with 1.13 F1 and 2.64 F1 improvements on RAMS and WikiEvents respectively. Further analyses illustrate the interpretability of our model.

摘要
文档级事件语义抽象带来新的挑战，包括长输入和跨句 inference，与句子级对应的模型不同。然而，大多数前作 FoCUS 在事件触发器和候选参与者之间的关系，忽略了两点： a) 非参与者上下文提示信息; b) 参与者角色之间的相关性。在这篇论文中，我们提出了一种SCPRG（Span-trigger-based Contextual Pooling and latent Role Guidance）模型，其包含两个新的有效模块。Span-Trigger-based Contextual Pooling（STCP）模块根据特定参与者-触发器对的上下文注意力权重自适应地选择和聚合非参与者提示词的信息。Role-based Latent Information Guidance（RLIG）模块构建了latent角色表示，使其互相交互编码，捕捉 semantic relevance，并将其与候选参与者结合。STCP和RLIG模块新增 Parameters 不超过1%，可以与基础模型一起使用，并且可以轻松应用于其他事件抽象模型。我们在两个公共数据集上进行了实验，结果显示，我们的SCPRG模型在RAMS和WikiEvents上的F1分别提高1.13和2.64。进一步的分析表明了我们模型的可读性。

CARLG: Leveraging Contextual Clues and Role Correlations for Improving Document-level Event Argument Extraction

paper_url: http://arxiv.org/abs/2310.05116
repo_url: None
paper_authors: Wanlong Liu, Wenyu Chen, Dingyi Zeng, Li Zhou, Hong Qu
for: 提高文档级事件抽象EXTRACTION的精度。
methods: 提出了一种基于CONTEXTUAL CLUES和ROLE correlation的CARLG模型，包括CONTEXTUAL CLUES Aggregation（CCA）模块和ROLE-based Latent Information Guidance（RLIG）模块，利用上下文注意力权重和角色相互作用编码，从而提高文档级EXTRACTION的精度。
results: 在RAMS、WikiEvents和MLEE datasets上进行了广泛的实验，并证明了CARLG模型的超越性，与之前的状态艺术方法相比，提高了1.26倍、1.22倍和1.98倍的F1分数，同时降低了推理时间 by 31%。

Abstract
Document-level event argument extraction (EAE) is a crucial but challenging subtask in information extraction. Most existing approaches focus on the interaction between arguments and event triggers, ignoring two critical points: the information of contextual clues and the semantic correlations among argument roles. In this paper, we propose the CARLG model, which consists of two modules: Contextual Clues Aggregation (CCA) and Role-based Latent Information Guidance (RLIG), effectively leveraging contextual clues and role correlations for improving document-level EAE. The CCA module adaptively captures and integrates contextual clues by utilizing context attention weights from a pre-trained encoder. The RLIG module captures semantic correlations through role-interactive encoding and provides valuable information guidance with latent role representation. Notably, our CCA and RLIG modules are compact, transplantable and efficient, which introduce no more than 1% new parameters and can be easily equipped on other span-base methods with significant performance boost. Extensive experiments on the RAMS, WikiEvents, and MLEE datasets demonstrate the superiority of the proposed CARLG model. It outperforms previous state-of-the-art approaches by 1.26 F1, 1.22 F1, and 1.98 F1, respectively, while reducing the inference time by 31%. Furthermore, we provide detailed experimental analyses based on the performance gains and illustrate the interpretability of our model.

摘要
文档级事件参数提取（EAE）是信息提取中的关键但是挑战性任务。现有大多数方法强调事件触发器和参数之间的交互，忽略了两个关键点：文档背景信息和参数角色之间的 semantics 相关性。在这篇论文中，我们提出了 CARLG 模型，它由两个模块组成：文档背景信息汇集（CCA）和角色相关信息引导（RLIG）。CCA 模块可以适应地捕捉和 инте integrate 文档背景信息，并通过使用上下文注意力权重从预训练的 encoder 获得上下文注意力权重。RLIG 模块通过角色交互编码来捕捉参数角色之间的 semantics 相关性，并提供有价值的信息引导，使用潜在角色表示。各自CCA和RLIG模块都是紧凑、可移植和高效的，其新增参数不超过 1%，可以轻松地在其他基于宽度的方法上采用，并且可以提高性能。我们在 RAMS、WikiEvents 和 MLEE 数据集上进行了广泛的实验，并证明了我们的 CARLG 模型在这些数据集上的超越性。它与前一个状态的方法相比，提高了 1.26 F1、1.22 F1 和 1.98 F1，同时降低了推理时间 31%。此外，我们还提供了详细的实验分析，以及模型的可读性。

Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction

paper_url: http://arxiv.org/abs/2310.05092
repo_url: None
paper_authors: Jun Gao, Huan Zhao, Yice Zhang, Wei Wang, Changlong Yu, Ruifeng Xu
for: 本研究旨在探讨大语言模型（LLMs）在自然语言处理中的信息提取 task 中的应用。
methods: 本研究使用了精细化的信息提取标准 benchmark 数据集，并采用了加强的提取规则和输出格式来适应 LLMS 的能力。
results: 我们的研究发现，使用encoder-decoder模型（特别是 T5 和 FLAN-T5）可以在不同的信息类型中具有普适性，而 ChatGPT 则在新任务形态中具有更高的适应性。我们的结果还表明，模型缩放不是决定性的性能因素，architecture、数据多样性和学习技术也具有重要的作用。这项研究为 LLMS 在信息提取中的更加细化和多样化应用提供了道路。

Abstract
Information Extraction (IE) is an essential task in Natural Language Processing. Traditional methods have relied on coarse-grained extraction with simple instructions. However, with the emergence of Large Language Models (LLMs), there is a need to adapt IE techniques to leverage the capabilities of these models. This paper introduces a fine-grained IE benchmark dataset tailored for LLMs, employing augmented instructions for each information type, which includes task descriptions, extraction rules, output formats, and examples. Through extensive evaluations, we observe that encoder-decoder models, particularly T5 and FLAN-T5, perform well in generalizing to unseen information types, while ChatGPT exhibits greater adaptability to new task forms. Our results also indicate that performance is not solely dictated by model scale, and highlight the significance of architecture, data diversity, and learning techniques. This work paves the way for a more refined and versatile utilization of LLMs in Information Extraction.

摘要
信息提取（IE）是自然语言处理中的一项重要任务。传统方法通常采用粗粒度提取，使用简单的指令。然而，随着大语言模型（LLM）的出现，需要对IE技术进行适应。本文介绍了一个适合LLM的细致提取数据集，使用了增强的指令集，包括任务描述、提取规则、输出格式和示例。经过广泛的评估，我们发现使用encoder-decoder模型，特别是T5和FLAN-T5，在未经见情报类型上进行泛化性能良好，而ChatGPT在新任务形式上表现出更大的适应性。我们的结果还表明，性能不 solely 受模型规模的限制，也受到体系、数据多样性和学习技巧的影响。这项工作为LLM在信息提取中更加细致和多样化的使用开出了新的可能性。

Enhancing Argument Structure Extraction with Efficient Leverage of Contextual Information

paper_url: http://arxiv.org/abs/2310.05073
repo_url: https://github.com/luoxiaoheics/ecase
paper_authors: Yun Luo, Zhen Yang, Fandong Meng, Yingjie Li, Jie Zhou, Yue Zhang
for: 本研究旨在提高对文档中Arguments的结构分析性能。
methods: 我们提出了一种高效的上下文感知ASE模型（ECASE），利用上下文信息来增强模型的表达能力和训练数据。具体来说，我们引入了序列注意力模块和距离权重相似损失函数，以便聚合上下文信息和 argumentative 信息。此外，我们还随机屏蔽了文档中的讨论标识符和句子，以降低模型对特定单词或 menos informative 句子的依赖。
results: 我们在五个不同领域的五个数据集上进行了实验，并确认了我们的模型在这些数据集上的状态知识表现。此外，我们还进行了减少模块的研究，以证明每个模块在我们的模型中的效果。

Abstract
Argument structure extraction (ASE) aims to identify the discourse structure of arguments within documents. Previous research has demonstrated that contextual information is crucial for developing an effective ASE model. However, we observe that merely concatenating sentences in a contextual window does not fully utilize contextual information and can sometimes lead to excessive attention on less informative sentences. To tackle this challenge, we propose an Efficient Context-aware ASE model (ECASE) that fully exploits contextual information by enhancing modeling capacity and augmenting training data. Specifically, we introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information. Additionally, we augment the training data by randomly masking discourse markers and sentences, which reduces the model's reliance on specific words or less informative sentences. Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance. Furthermore, ablation studies confirm the effectiveness of each module in our model.

摘要
Argument structure extraction (ASE) targets to identify the discourse structure of arguments within documents. Previous research has shown that contextual information is crucial for developing an effective ASE model. However, we find that simply concatenating sentences in a contextual window does not fully utilize contextual information and can sometimes lead to excessive attention on less informative sentences. To address this challenge, we propose an Efficient Context-aware ASE model (ECASE) that fully exploits contextual information by enhancing modeling capacity and augmenting training data. Specifically, we introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information. Additionally, we augment the training data by randomly masking discourse markers and sentences, which reduces the model's reliance on specific words or less informative sentences. Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance. Furthermore, ablation studies confirm the effectiveness of each module in our model.Here's the word-for-word translation:Argument structure extraction (ASE) targets to identify the discourse structure of arguments within documents. Previous research has shown that contextual information is crucial for developing an effective ASE model. However, we find that simply concatenating sentences in a contextual window does not fully utilize contextual information and can sometimes lead to excessive attention on less informative sentences. To address this challenge, we propose an Efficient Context-aware ASE model (ECASE) that fully exploits contextual information by enhancing modeling capacity and augmenting training data. Specifically, we introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information. Additionally, we augment the training data by randomly masking discourse markers and sentences, which reduces the model's reliance on specific words or less informative sentences. Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance. Furthermore, ablation studies confirm the effectiveness of each module in our model.

Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration

paper_url: http://arxiv.org/abs/2310.05069
repo_url: https://github.com/ercong21/calibration
paper_authors: Ercong Nie, Helmut Schmid, Hinrich Schütze
for: 这个论文主要针对Zero-shot和少量示例情景下的多语言任务和语言探测问题。
methods: 这个论文使用预训练多语言encoder模型，通过重写输入示例为cloze风格的问题，直接完成多语言任务或语言探测。这种方法不需要更新模型参数。但是，模型偏好预测频繁出现的标签词，导致性能有限制。为了解决这个问题，这个论文提出了一种简单的准确化方法，并与其他现有技术进行比较。
results: 这个论文使用准确化技术与预训练多语言encoder模型结合，在多种任务中实现了显著性能提升。

Abstract
Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model's bias toward predicting label words which frequently occurred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero- and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks.

摘要
预训练多语言encoder模型可以直接执行零shot多语言任务或语言探测，通过重写输入示例为cloze样式提示。这是通过预测掩码Token位置的标签词概率，不需要更新模型参数。然而，这种方法的性能受到模型对预测常见的标签词的偏好的限制。这些词通常会 Receive高概率预测。为解决这个问题，我们将模型与加拟定技术相结合， modify模型预测标签词的概率。我们首先验证提议的简单加拟定方法，以及其他现有的技术在单语言encoder上的效果。然后，我们在多语言encoder上使用这些加拟定技术， resulting in 广泛任务中的性能提升。

Guideline Learning for In-context Information Extraction

paper_url: http://arxiv.org/abs/2310.05066
repo_url: None
paper_authors: Chaoxu Pang, Yixuan Cao, Qiang Ding, Ping Luo
for: 提高嵌入式学习（ICL）中的信息提取性能（IE）。
methods: 提出指南学习（GL）框架，在学习阶段自动生成指南，在推断阶段根据错误案例选择有助于ICL的指南。同时，提出基于自我一致性的活动学习方法，提高GL的效率。
results: 在事件提取和关系提取任务上，GL可以显著提高嵌入式IE的性能。

Abstract
Large language models (LLMs) can perform a new task by merely conditioning on task instructions and a few input-output examples, without optimizing any parameters. This is called In-Context Learning (ICL). In-context Information Extraction (IE) has recently garnered attention in the research community. However, the performance of In-context IE generally lags behind the state-of-the-art supervised expert models. We highlight a key reason for this shortfall: underspecified task description. The limited-length context struggles to thoroughly express the intricate IE task instructions and various edge cases, leading to misalignment in task comprehension with humans. In this paper, we propose a Guideline Learning (GL) framework for In-context IE which reflectively learns and follows guidelines. During the learning phrase, GL automatically synthesizes a set of guidelines based on a few error cases, and during inference, GL retrieves helpful guidelines for better ICL. Moreover, we propose a self-consistency-based active learning method to enhance the efficiency of GL. Experiments on event extraction and relation extraction show that GL can significantly improve the performance of in-context IE.

摘要

sign.mt: Real-Time Multilingual Sign Language Translation Application

paper_url: http://arxiv.org/abs/2310.05064
repo_url: None
paper_authors: Amit Moryossef
for: 这个研究旨在为听语和手语之间的交流问题提供解决方案，实现语言通信的协调。
methods: 这个开源应用程序使用了现代的开源模型，包括对话语言模型和手语识别模型，以提供即时多语言对话的转换。
results: 这个应用程序可以实现即时多语言对话的转换，并且提供了自定义的真实人工手语演示，以激发用户参与和满意度。

Abstract
This demo paper presents sign.mt, an open-source application pioneering real-time multilingual bi-directional translation between spoken and signed languages. Harnessing state-of-the-art open-source models, this tool aims to address the communication divide between the hearing and the deaf, facilitating seamless translation in both spoken-to-signed and signed-to-spoken translation directions. Promising reliable and unrestricted communication, sign.mt offers offline functionality, crucial in areas with limited internet connectivity. It further enhances user engagement by offering customizable photo-realistic sign language avatars, thereby encouraging a more personalized and authentic user experience. Licensed under CC BY-NC-SA 4.0, sign.mt signifies an important stride towards open, inclusive communication. The app can be used, and modified for personal and academic uses, and even supports a translation API, fostering integration into a wider range of applications. However, it is by no means a finished product. We invite the NLP community to contribute towards the evolution of sign.mt. Whether it be the integration of more refined models, the development of innovative pipelines, or user experience improvements, your contributions can propel this project to new heights. Available at https://sign.mt, it stands as a testament to what we can achieve together, as we strive to make communication accessible to all.

摘要
这个示例文章介绍了一个开源应用程序，即sign.mt，它实现了实时多语言对话转化，包括口头语言和手语两种语言之间的对话转化。使用现有的开源模型，这工具计划解决听力和耳语之间的沟通差异，为听力和耳语之间的对话提供流畅的翻译。 sign.mt 提供了可靠和无限制的沟通，并且在网络连接性较差的地区具有离线功能。它还提高了用户参与度，通过提供可定制的真实手语人物，使用户感受到更个性化和原始的用户体验。根据 CC BY-NC-SA 4.0 许可证，sign.mt 表示开放、包容的沟通的重要一步。这个应用程序可以用于个人和学术用途，甚至支持翻译 API，以便更广泛地应用。尽管不是一款完整的产品，但我们邀请 NLP 社区参与 sign.mt 的演进。你的贡献可以使这个项目走向更高的峰点，包括更加精准的模型集成、创新的管道开发和用户体验改进等。可以在上获取更多信息。

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

paper_url: http://arxiv.org/abs/2310.05057
repo_url: None
paper_authors: Yifan Jiang, Filip Ilievski, Kaixin Ma, Zhivar Sourati
for: 该论文旨在检验语义理解模型是否具备倾向性思维能力，以及模型是否能够扭转默认知的关系。
methods: 该论文使用了多选问答任务，以检验模型的倾向性思维能力。其中，模型需要从多个选项中选择正确答案，而不是直接回答问题。
results: 研究发现，当前的语义理解模型在倾向性思维任务中表现不佳，与人类表现的 gap 较大。此外，模型在不同的倾向性思维任务中的表现也异常。

Abstract
The success of language models has inspired the NLP community to attend to tasks that require implicit and complex reasoning, relying on human-like commonsense mechanisms. While such vertical thinking tasks have been relatively popular, lateral thinking puzzles have received little attention. To bridge this gap, we devise BRAINTEASER: a multiple-choice Question Answering task designed to test the model's ability to exhibit lateral thinking and defy default commonsense associations. We design a three-step procedure for creating the first lateral thinking benchmark, consisting of data collection, distractor generation, and generation of adversarial examples, leading to 1,100 puzzles with high-quality annotations. To assess the consistency of lateral reasoning by models, we enrich BRAINTEASER based on a semantic and contextual reconstruction of its questions. Our experiments with state-of-the-art instruction- and commonsense language models reveal a significant gap between human and model performance, which is further widened when consistency across adversarial formats is considered. We make all of our code and data available to stimulate work on developing and evaluating lateral thinking models.

摘要
成功的语言模型使得自然语言处理（NLP）社区受到了关注，把注意力转移到需要间接和复杂的理解的任务上。虽然垂直思维任务在某种程度上受到了普遍的关注，但是水平思维拼图得到了少量的关注。为了填补这个差距，我们设计了Brainteaser：一种多选问答任务，旨在测试模型的水平思维能力和脱离默认的共同理解。我们采用了三步过程来创建第一个水平思维标准 benchmark：数据收集、distractor生成和对抗示例生成，共计1,100个高质量注释的拼图。为了评估模型的水平思维一致性，我们对Brainteaser的问题进行了semantic和contextual重建。我们的实验表明，当模型面临水平思维任务时，与人类的表现存在显著的差距，此差距甚至在对抗格式的一致性上受到了进一步的扩大。我们将所有的代码和数据公开，以便激励开发和评估水平思维模型的工作。

Harnessing the Power of ChatGPT in Fake News: An In-Depth Exploration in Generation, Detection and Explanation

paper_url: http://arxiv.org/abs/2310.05046
repo_url: None
paper_authors: Yue Huang, Lichao Sun
For: The paper aims to explore ChatGPT’s proficiency in generating, explaining, and detecting fake news.* Methods: The paper employs four prompt methods to generate fake news samples and obtains nine features to characterize fake news based on ChatGPT’s explanations. It also examines ChatGPT’s capacity to identify fake news and proposes a reason-aware prompt method to improve its performance.* Results: The paper demonstrates that ChatGPT shows commendable performance in detecting fake news, but there is still room for improvement. It also explores the potential extra information that could bolster its effectiveness in detecting fake news.Here are the three key points in Simplified Chinese text:* For: 本研究旨在探讨ChatGPT在生成、解释和检测假新闻方面的能力。* Methods: 本研究使用四种提示方法生成假新闻样本，并通过自我评估和人类评估来证明这些样本的质量。同时，我们从ChatGPT的解释中获取了九个特征来Characterize假新闻，并分析这些特征在多个公共数据集中的分布。* Results: 我们的实验表明，ChatGPT在检测假新闻方面表现了可嘉的表现，但仍有改进的空间。我们还探讨了可能会增强其检测假新闻效果的额外信息。

Abstract
The rampant spread of fake news has adversely affected society, resulting in extensive research on curbing its spread. As a notable milestone in large language models (LLMs), ChatGPT has gained significant attention due to its exceptional natural language processing capabilities. In this study, we present a thorough exploration of ChatGPT's proficiency in generating, explaining, and detecting fake news as follows. Generation -- We employ four prompt methods to generate fake news samples and prove the high quality of these samples through both self-assessment and human evaluation. Explanation -- We obtain nine features to characterize fake news based on ChatGPT's explanations and analyze the distribution of these factors across multiple public datasets. Detection -- We examine ChatGPT's capacity to identify fake news. We explore its detection consistency and then propose a reason-aware prompt method to improve its performance. Although our experiments demonstrate that ChatGPT shows commendable performance in detecting fake news, there is still room for its improvement. Consequently, we further probe into the potential extra information that could bolster its effectiveness in detecting fake news.

摘要
《假新闻的普遍传播对社会造成了不良影响，导致了各方对其散布的研究。作为大型自然语言模型（LLM）的一项重要里程碑，ChatGPT在自然语言处理方面表现出了突出的能力。本研究中，我们对ChatGPT的能力进行了全面探索，具体来说是：生成、解释和检测假新闻。生成——我们使用四种提示方法生成假新闻样本，并通过自我评估和人类评估来证明这些样本的质量。解释——我们从ChatGPT的解释中提取了九个特征来 caracterize假新闻，并分析这些特征在多个公共数据集中的分布。检测——我们检查ChatGPT是否能够识别假新闻。我们首先检查其检测的一致性，然后提出了基于理由的提示方法来提高其性能。虽然我们的实验表明ChatGPT在检测假新闻方面表现出了良好的表现，但还有一些可以提高其效果的空间。因此，我们进一步探索可能会增强其检测假新闻的效iveness的额外信息。

Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading

paper_url: http://arxiv.org/abs/2310.05029
repo_url: None
paper_authors: Howard Chen, Ramakanth Pasunuru, Jason Weston, Asli Celikyilmaz
for: 这篇论文的目的是提出一种新的长文理解方法，以解决现有的自注意机制受限的问题。
methods: 该方法基于论文自动浏览器，首先将长文处理成摘要节点树，然后根据查询提交，通过 iterative prompting 方式，论文模型在树上寻找相关信息，并在获得足够信息后提供答案。
results: 与基eline方法相比，该方法在长文问答任务上表现出色，并且可以增强解释性，通过在浏览过程中高亮相关的文本段落。

Abstract
Large language models (LLMs) have advanced in large strides due to the effectiveness of the self-attention mechanism that processes and compares all tokens at once. However, this mechanism comes with a fundamental issue -- the predetermined context window is bound to be limited. Despite attempts to extend the context window through methods like extrapolating the positional embedding, using recurrence, or selectively retrieving essential parts of the long sequence, long-text understanding continues to be a challenge. We propose an alternative approach which instead treats the LLM as an interactive agent, allowing it to decide how to read the text via iterative prompting. We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information. On long-text question answering tasks our method outperforms baseline approaches that use long context windows, recurrence, and retrieval. We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.

摘要
We propose an alternative approach that treats the LLM as an interactive agent, allowing it to decide how to read the text through iterative prompting. We introduce MemWalker, a method that first processes the long context into a tree of summary nodes. When receiving a query, the model navigates this tree to search for relevant information and responds once it has gathered sufficient information.On long-text question answering tasks, our method outperforms baseline approaches that use long context windows, recurrence, and retrieval. Additionally, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text, pinpointing the relevant text segments related to the query.

Synslator: An Interactive Machine Translation Tool with Online Learning

paper_url: http://arxiv.org/abs/2310.05025
repo_url: None
paper_authors: Jiayi Wang, Ke Wang, Fengming Zhou, Chengyu Wang, Zhiyong Fu, Zeyu Feng, Yu Zhao, Yuqi Zhang
for: 这篇论文旨在描述一种名为Synslator的计算机助记翻译工具，该工具不仅支持互动翻译（IMT），而且可以在线学习并使用实时翻译记忆。
methods: 该工具使用两种不同的神经翻译模型来处理翻译记忆，以适应不同的部署环境。此外，系统还使用语言模型来提高互动模式下的翻译流畅性。
results: 我们经过评估，确认了在线学习过程中的翻译模型的有效性，并发现使用Synslator的互动功能可以提高翻译效率13%。更多细节可以参考：https://youtu.be/K0vRsb2lTt8。

Abstract
Interactive machine translation (IMT) has emerged as a progression of the computer-aided translation paradigm, where the machine translation system and the human translator collaborate to produce high-quality translations. This paper introduces Synslator, a user-friendly computer-aided translation (CAT) tool that not only supports IMT, but is adept at online learning with real-time translation memories. To accommodate various deployment environments for CAT services, Synslator integrates two different neural translation models to handle translation memories for online learning. Additionally, the system employs a language model to enhance the fluency of translations in an interactive mode. In evaluation, we have confirmed the effectiveness of online learning through the translation models, and have observed a 13% increase in post-editing efficiency with the interactive functionalities of Synslator. A tutorial video is available at:https://youtu.be/K0vRsb2lTt8.

摘要
协助式机器翻译（IMT）已经成为计算机辅助翻译模式的进化，在这种模式下，机器翻译系统和人类翻译员共同努力以生成高质量翻译。这篇文章介绍了Synslator，一款用户友好的计算机辅助翻译（CAT）工具，不仅支持IMT，而且在线学习 WITH 实时翻译记忆。为满足不同的CAT服务部署环境，Synslator integrate了两种不同的神经翻译模型来处理翻译记忆。此外，系统还使用语言模型来提高交互模式下的翻译流畅性。经评估，我们已经确认了在线学习通过翻译模型的效iveness，并观察到了Synslator的交互功能可以提高翻译效率13%。有关教程视频，请参考：https://youtu.be/K0vRsb2lTt8。

Hybrid Quantum-Classical Machine Learning for Sentiment Analysis

paper_url: http://arxiv.org/abs/2310.10672
repo_url: None
paper_authors: Abu Kaisar Mohammad Masum, Anshul Maurya, Dhruthi Sridhar Murthy, Pratibha, Naveed Mahmud
for: 本研究旨在探讨量子计算和经典机器学习的合作在自然语言处理中的可能性，尤其是对大规模数据集中表达的人类情感和意见的情感分析。
methods: 本研究提出了一种混合量子-经典机器学习算法的方法ología，包括量子kernel方法和量子径波变换-基于的分类器，并与经典维度减少技术 such as PCA和Haar wavelet transform进行了集成。
results: 实验结果表明，在减少数据维度后，量子基于的混合算法的性能是稳定和更好于经典方法。

Abstract
The collaboration between quantum computing and classical machine learning offers potential advantages in natural language processing, particularly in the sentiment analysis of human emotions and opinions expressed in large-scale datasets. In this work, we propose a methodology for sentiment analysis using hybrid quantum-classical machine learning algorithms. We investigate quantum kernel approaches and variational quantum circuit-based classifiers and integrate them with classical dimension reduction techniques such as PCA and Haar wavelet transform. The proposed methodology is evaluated using two distinct datasets, based on English and Bengali languages. Experimental results show that after dimensionality reduction of the data, performance of the quantum-based hybrid algorithms were consistent and better than classical methods.

摘要
合作 между量子计算和类别机器学习可以在自然语言处理中提供potential的优势，特别是在大规模数据集中检测人们的情感和意见。在这个工作中，我们提议了一种基于量子-类别机器学习算法的情感分析方法。我们研究了量子kernel方法和量子征值回归-基于分类器，并将其与经典维度减少技术相结合，如PCA和Haar波lets变换。我们对两个不同的数据集进行了实验，一个是英语数据集，另一个是孟加拉语数据集。实验结果表明，在减少数据维度后，量子-基于 hybrid 算法的性能是一致的和更好于经典方法。

WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural Language Instruction

paper_url: http://arxiv.org/abs/2310.05009
repo_url: https://github.com/casparswift/wikiins
paper_authors: Xiang Chen, Zheng Li, Xiaojun Wan
for: 本研究targets the problem of controlled text editing by natural language instruction.
methods: 研究者使用了Wikipedia编辑历史数据库，通过批处理和人工纠正来提高数据集的质量，并提出了自动生成大规模“银”训练集的方法。
results: 研究者通过对WikiIns dataset进行分析和实验，得到了一些有价值的结论和编辑INTENTION分析结果。

Abstract
Text editing, i.e., the process of modifying or manipulating text, is a crucial step in human writing process. In this paper, we study the problem of controlled text editing by natural language instruction. According to a given instruction that conveys the edit intention and necessary information, an original draft text is required to be revised into a target text. Existing automatically constructed datasets for this task are limited because they do not have informative natural language instruction. The informativeness requires the information contained in the instruction to be enough to produce the revised text. To address this limitation, we build and release WikiIns, a high-quality controlled text editing dataset with improved informativeness. We first preprocess the Wikipedia edit history database to extract the raw data (WikiIns-Raw). Then we crowdsource high-quality validation and test sets, as well as a small-scale training set (WikiIns-Gold). With the high-quality annotated dataset, we further propose automatic approaches to generate a large-scale ``silver'' training set (WikiIns-Silver). Finally, we provide some insightful analysis on our WikiIns dataset, including the evaluation results and the edit intention analysis. Our analysis and the experiment results on WikiIns may assist the ongoing research on text editing. The dataset, source code and annotation guideline are available at https://github.com/CasparSwift/WikiIns.

摘要
文本编辑，即对文本进行修改或 manipulate 的过程，是人类写作过程中的关键步骤。在这篇论文中，我们研究了基于自然语言指令的控制文本编辑问题。根据一个拥有修改意图和必要信息的自然语言指令，需要将原始稿件文本修改为目标文本。现有的自动生成的这类数据集有限，因为它们没有具有信息的自然语言指令。为了解决这个限制，我们建立了和发布了高质量的控制文本编辑数据集 WikiIns，其中包括改进的信息含量。我们首先从 Wikipedia 编辑历史数据库中提取原始数据（WikiIns-Raw），然后通过人工审核和测试集，以及一小规模的训练集（WikiIns-Gold）来生成高质量验证集。然后，我们提出了一些自动生成大规模“银”训练集（WikiIns-Silver）的方法。最后，我们提供了一些有价值的分析和实验结果，包括我们的 WikiIns 数据集的评价结果和修改意图分析。我们的分析和实验结果可能会帮助当前的文本编辑研究。我们的数据集、源代码和注释指南可以在 GitHub 上找到：https://github.com/CasparSwift/WikiIns。

MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering

paper_url: http://arxiv.org/abs/2310.05007
repo_url: None
paper_authors: Xiusi Chen, Jyun-Yu Jiang, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Wei Wang
for: 提高机器问答系统的满意度，使其在几个训练样本不足的情况下达到良好的结果。
methods: 提出了一种基于approximate graph算法和无监督问题生成的最小数据扩充框架，可以有效地提高open-domain QA任务中的精度。
results: 经验result表明，MinPrompt能够与基eline相比或者更好地实现精度，在不同的benchmark datasets上提高F-1分数的提升达27.5%。

Abstract
Few-shot question answering (QA) aims at achieving satisfactory results on machine question answering when only a few training samples are available. Recent advances mostly rely on the power of pre-trained large language models (LLMs) and fine-tuning in specific settings. Although the pre-training stage has already equipped LLMs with powerful reasoning capabilities, LLMs still need to be fine-tuned to adapt to specific domains to achieve the best results. In this paper, we propose to select the most informative data for fine-tuning, thereby improving the efficiency of the fine-tuning process with comparative or even better accuracy on the open-domain QA task. We present MinPrompt, a minimal data augmentation framework for open-domain QA based on an approximate graph algorithm and unsupervised question generation. We transform the raw text into a graph structure to build connections between different factual sentences, then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text. We then generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model. Empirical results on several benchmark datasets and theoretical analysis show that MinPrompt is able to achieve comparable or better results than baselines with a high degree of efficiency, bringing improvements in F-1 scores by up to 27.5%.

摘要
几个示例问答（QA）目标在机器问答中实现满意的结果，只需要几个训练样本。现代进步主要依靠大型自然语言模型（LLM）的力量和特定设置的精细调整。虽然预训练阶段已经把LLM们具备了强大的推理能力，但LLM们仍需要调整以适应特定领域以达到最佳结果。在这篇论文中，我们提议选择最有用的数据进行调整，从而提高调整过程的效率，同时保持比较或更好的准确率在开放领域QA任务中。我们提出了一个名为MinPrompt的最小数据扩展框架，基于approximate graph算法和无监督问题生成。我们将原始文本转换成图结构，建立不同事实句子之间的连接，然后应用图算法选择最小的句子集，以覆盖raw文本中的最多信息。我们然后根据选择的句子集生成QA对，并在选择的句子上训练模型，从而获得最终模型。实验结果表明，MinPrompt可以与基准相比或更好的达到准确率，提高F-1分数的提升达27.5%。

Self-Knowledge Guided Retrieval Augmentation for Large Language Models

paper_url: http://arxiv.org/abs/2310.05002
repo_url: https://github.com/THUNLP-MT/SKR
paper_authors: Yile Wang, Peng Li, Maosong Sun, Yang Liu
for: 提高大语言模型（LLM）的性能，不需要任务特定的精度调整。
methods: 使用自我认知指导的检索增强（SKR）方法，让 LLM 能够识别自己所知道和所不知道，并适应新问题。
results: SKR 在多个数据集上表现出色，比 chain-of-thought 和完整检索基本方法高效，使用 InstructGPT 或 ChatGPT 进行评估。

Abstract
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Despite the success, the knowledge stored in the parameters of LLMs could still be incomplete and difficult to update due to the computational costs. As complementary, retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. However, we find that the retrieved knowledge does not always help and even has a negative impact on original responses occasionally. To better make use of both internal knowledge and external world knowledge, we investigate eliciting the model's ability to recognize what they know and do not know (which is also called self-knowledge) and propose Self-Knowledge guided Retrieval augmentation (SKR), a simple yet effective method which can let LLMs refer to the questions they have previously encountered and adaptively call for external resources when dealing with new questions. We evaluate SKR on multiple datasets and demonstrate that it outperforms chain-of-thought based and fully retrieval-based methods by using either InstructGPT or ChatGPT.

摘要

TopicAdapt- An Inter-Corpora Topics Adaptation Approach

paper_url: http://arxiv.org/abs/2310.04978
repo_url: None
paper_authors: Pritom Saha Akash, Trisha Das, Kevin Chen-Chuan Chang
for: 本研究提出了一种基于神经网络的话题模型，用于改进话题模型在实际场景中的表现。
methods: 本研究使用了一种基于神经网络的话题模型，可以从相关的源корpus中挖掘有用的话题，同时还可以在目标корpus中找到缺失的话题。
results: 实验结果表明，提出的话题模型在多个不同领域的数据集上具有较高的表现，比对state-of-the-art话题模型更好。

Abstract
Topic models are popular statistical tools for detecting latent semantic topics in a text corpus. They have been utilized in various applications across different fields. However, traditional topic models have some limitations, including insensitivity to user guidance, sensitivity to the amount and quality of data, and the inability to adapt learned topics from one corpus to another. To address these challenges, this paper proposes a neural topic model, TopicAdapt, that can adapt relevant topics from a related source corpus and also discover new topics in a target corpus that are absent in the source corpus. The proposed model offers a promising approach to improve topic modeling performance in practical scenarios. Experiments over multiple datasets from diverse domains show the superiority of the proposed model against the state-of-the-art topic models.

摘要

Exploring the Usage of Chinese Pinyin in Pretraining

paper_url: http://arxiv.org/abs/2310.04960
repo_url: None
paper_authors: Baojun Wang, Kun Xu, Lifeng Shang
for: 这篇论文主要是为了提高中文语音识别错误稳定性。
methods: 这篇论文使用了多种预训练方法，包括使用字符和拼音并行预训练，以增强错误识别的稳定性。
results: 实验结果表明，这种新预训练方法可以提高中文语音识别模型的稳定性，并且在公共错误纠正数据集上达到了最高的表现。

Abstract
Unlike alphabetic languages, Chinese spelling and pronunciation are different. Both characters and pinyin take an important role in Chinese language understanding. In Chinese NLP tasks, we almost adopt characters or words as model input, and few works study how to use pinyin. However, pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors. Most of these errors are caused by the same or similar pronunciation words, and we refer to this type of error as SSP(the same or similar pronunciation) errors for short. In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT. Our method uses characters and pinyin in parallel for pretraining. Through delicate pretraining tasks, the characters and pinyin representation are fused, which can enhance the error tolerance for SSP errors. We do comprehensive experiments and ablation tests to explore what makes a robust phonetic enhanced Chinese language model. The experimental results on both the constructed noise-added dataset and the public error-correction dataset demonstrate that our model is more robust compared to SOTA models.

摘要
不同的字母语言和中文拼写、发音之间存在差异。中文NLU任务中，大多数作品是直接使用字符或词作为模型输入，而忽略了拼音。然而，拼音在许多场景中具有重要性，如错误纠正和ASR引入错误的稳定性。大多数这些错误是由同或相似的发音单词引起的，我们称这种错误为SSP（同或相似的发音）错误。在这种工作中，我们探索了使用拼音的不同方法，并提出了一种新的预训练方法called PmBERT。我们的方法在平行预训练中使用字符和拼音，通过细腻的预训练任务，字符和拼音表示被融合，从而提高了SSP错误的承受能力。我们进行了广泛的实验和割除测试，以探索使一个强大的中文语言模型具有哪些特点。实验结果表明，我们的模型在constructed noise-added dataset和公共错误纠正dataset上比SOTA模型更加稳定。

Towards Better Chain-of-Thought Prompting Strategies: A Survey

paper_url: http://arxiv.org/abs/2310.04959
repo_url: None
paper_authors: Zihan Yu, Liang He, Zhen Wu, Xinyu Dai, Jiajun Chen
for: 本文旨在探讨Chain-of-Thought（CoT）提示Strategy的效果，并系统地分析其关键因素以及如何更好地应用于不同应用场景。
methods: 本文通过审查广泛的当前研究，提供了系统的和全面的分析，涵盖了CoT提示的各种因素的影响，以及如何更好地应用其在不同应用场景。
results: 本文提出了一些挑战和未来发展方向，以帮助读者更好地理解和应用CoT提示。

Abstract
Chain-of-Thought (CoT), a step-wise and coherent reasoning chain, shows its impressive strength when used as a prompting strategy for large language models (LLM). Recent years, the prominent effect of CoT prompting has attracted emerging research. However, there still lacks of a systematic summary about key factors of CoT prompting and comprehensive guide for prompts utilizing. For a deeper understanding about CoT prompting, we survey on a wide range of current research, presenting a systematic and comprehensive analysis on several factors that may influence the effect of CoT prompting, and introduce how to better apply it in different applications under these discussions. We further analyze the challenges and propose some future directions about CoT prompting. This survey could provide an overall reference on related research.

摘要
Chain-of-Thought（CoT），一种逐步逻辑推理链，在大语言模型（LLM）中作为提示策略显示出了惊人的力量。近年来，CoT提示的明显效果吸引了学术界的关注。然而，当前还缺乏一个系统化的总结和完整的指南，用于解释CoT提示的关键因素和如何更好地应用它们。为了深入了解CoT提示，我们在广泛的当前研究中进行了系统化和完整的分析，并对各种因素的影响进行了分析，以及如何在不同应用中更好地使用它们。我们还分析了挑战和提出了未来的发展方向。这种调查可以为相关研究提供一个总体参考。Note: Please note that the translation is in Simplified Chinese, and some words or phrases may have different translations in Traditional Chinese.

Domain Knowledge Graph Construction Via A Simple Checker

paper_url: http://arxiv.org/abs/2310.04949
repo_url: None
paper_authors: Yueling Zeng, Li-C. Wang
for: 这项研究的目的是为Semiconductor chip设计公司提供一种基于语言模型的知识图构建方法，以满足公司的两个重要考虑因素：保密性和可扩展性。
methods: 本文提出了一种oracle-checker方法，利用GPT3.5的力量来解决知识图构建问题。该方法包括一个验证过程，用于检查域专家的背景知识是否已经满足了构建知识图的需求。
results: 本文使用RISC-V无权ISA规范为例，解释了关键想法和讨论了实践中的 oracle-checker方法的可能性。

Abstract
With the availability of large language models, there is a growing interest for semiconductor chip design companies to leverage the technologies. For those companies, deployment of a new methodology must include two important considerations: confidentiality and scalability. In this context, this work tackles the problem of knowledge graph construction from hardware-design domain texts. We propose an oracle-checker scheme to leverage the power of GPT3.5 and demonstrate that the essence of the problem is in distillation of domain expert's background knowledge. Using RISC-V unprivileged ISA specification as an example, we explain key ideas and discuss practicality of our proposed oracle-checker approach.

摘要
现在大型语言模型成为可用的，半导体封包设计公司开始关注这些技术的应用。为这些公司而办理新方法时，需要考虑两个重要因素：保密和可扩展性。在这个上下文中，本文解决半导体设计领域文本知识图构建的问题。我们提议使用GPT3.5的力量，并证明知识的核心问题在封包专家背景知识的精炼中。使用RISC-V不具有特权ISA规范为例，我们介绍关键想法并讨论我们的 oracle-checker方法的实用性。

TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting

paper_url: http://arxiv.org/abs/2310.04948
repo_url: None
paper_authors: Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, Yan Liu
for: 本研究旨在开发一种新的时间序列表示学习框架，以提高时间序列预测的准确性。
methods: 该框架基于两个关键的强制性理念：（一）分解复杂的时间序列任务中的趋势、季度和差异部分的交互作用；以及（二）通过选择性的提示来促进非站点时间序列的分布适应。
results: 对多个时间序列benchmark datasets进行实验，TEMPO模型表现出了与现有方法相比的显著性能提升，不仅在标准的指导学习 Setting中，而且在未经见过数据集和多模式输入的情况下也能够获得出色的表现。这一结果表明TEMPO具有成为基础模型构建框架的潜力。

Abstract
The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. Meanwhile, for natural language processing, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements. In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposition of the complex interaction between trend, seasonal and residual components; and (ii) introducing the selection-based prompts to facilitate distribution adaptation in non-stationary time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO over state-of-the-art methods on a number of time series benchmark datasets. This performance gain is observed not only in standard supervised learning settings but also in scenarios involving previously unseen datasets as well as in scenarios with multi-modal inputs. This compelling finding highlights TEMPO's potential to constitute a foundational model-building framework.

摘要
过去一个 décennie witnessed significant advances in time series modeling with deep learning. Although the best-performing architectures vary greatly across applications and domains, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance by training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements.In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposing the complex interaction between trend, seasonal, and residual components; and (ii) introducing selection-based prompts to facilitate distribution adaptation in non-stationary time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO over state-of-the-art methods on a number of time series benchmark datasets. This performance gain is observed not only in standard supervised learning settings but also in scenarios involving previously unseen datasets as well as in scenarios with multi-modal inputs. This compelling finding highlights TEMPO's potential to constitute a foundational model-building framework.