cs.CL - 2023-07-07

Testing the Predictions of Surprisal Theory in 11 Languages

paper_url: http://arxiv.org/abs/2307.03667
repo_url: None
paper_authors: Ethan Gotlieb Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, Roger P. Levy
for: investigate the relationship between surprisal and reading times in eleven different languages, distributed across five language families.
methods: derive estimates from language models trained on monolingual and multilingual corpora, and test three predictions associated with surprisal theory.
results: all three predictions are borne out crosslinguistically, offering the most robust link to-date between information theory and incremental language processing across languages.Here’s the Chinese translation of the three information points:
for: investigate the relationship between surprisal和阅读时间在 eleven different languages中，分布在 five language families中。
methods: 使用语言模型在 monolingual和多语言 corpus 上 derivation estimates, 并测试 three predictions associated with surprisal theory.
results: 所有 three predictions 在 crosslinguistics 中得到证实，提供了最为稳固的 link 到 date между信息理论和语言处理过程中的 language。

Abstract
A fundamental result in psycholinguistics is that less predictable words take a longer time to process. One theoretical explanation for this finding is Surprisal Theory (Hale, 2001; Levy, 2008), which quantifies a word's predictability as its surprisal, i.e. its negative log-probability given a context. While evidence supporting the predictions of Surprisal Theory have been replicated widely, most have focused on a very narrow slice of data: native English speakers reading English texts. Indeed, no comprehensive multilingual analysis exists. We address this gap in the current literature by investigating the relationship between surprisal and reading times in eleven different languages, distributed across five language families. Deriving estimates from language models trained on monolingual and multilingual corpora, we test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

摘要
一个基本的心理语言学结论是，更难预测的词语需要更长的时间来处理。一种理论解释是《不意外性理论》（Hale, 2001；Levy, 2008），它量化了一个词语在上下文中的难度为其不意外性，即其负梯度邻近概率。尽管这些预测得到了广泛的复制，但大多数研究都集中在了一个非常窄的数据集上：英语Native speaker reading English texts。实际上，没有一个全面的多语言分析。我们在现有文献中填补这个空白，通过 investigate the relationship between surprisal and reading times in eleven different languages, distributed across five language families. We derive estimates from language models trained on monolingual and multilingual corpora, and test three predictions associated with surprisal theory: (i) whether surprisal is predictive of reading times; (ii) whether expected surprisal, i.e. contextual entropy, is predictive of reading times; (iii) and whether the linking function between surprisal and reading times is linear. We find that all three predictions are borne out crosslinguistically. By focusing on a more diverse set of languages, we argue that these results offer the most robust link to-date between information theory and incremental language processing across languages.

The distribution of discourse relations within and across turns in spontaneous conversation

paper_url: http://arxiv.org/abs/2307.03645
repo_url: None
paper_authors: S. Magalí López Cortez, Cassandra L. Jacobs
for: 这篇论文是关于如何在快速对话中使用语言关系（DR）的。
methods: 这篇论文使用了一系列的语言模型和人工标注来适应快速对话中的语言关系。
results: 研究发现，不同的对话上下文会导致不同的语言关系分布，单个转折创造了最多的不确定性。此外，研究还发现，基于演示单元的嵌入可以预测语言关系。

Abstract
Time pressure and topic negotiation may impose constraints on how people leverage discourse relations (DRs) in spontaneous conversational contexts. In this work, we adapt a system of DRs for written language to spontaneous dialogue using crowdsourced annotations from novice annotators. We then test whether discourse relations are used differently across several types of multi-utterance contexts. We compare the patterns of DR annotation within and across speakers and within and across turns. Ultimately, we find that different discourse contexts produce distinct distributions of discourse relations, with single-turn annotations creating the most uncertainty for annotators. Additionally, we find that the discourse relation annotations are of sufficient quality to predict from embeddings of discourse units.

摘要
时间压力和话题谈判可能会对人们在协说性谈话中使用语言关系（DR）所带来限制。在这个工作中，我们将写作语言系统的DR适用于精神对话使用拼写的观众标注。然后我们将检查DR在不同的多句子背景下是否被使用不同。我们比较说话者和说话之间的DR标注，以及说话者和说话之间的转折中的DR标注。最终，我们发现不同的谈话背景会生成不同的语言关系分布，单一说话标注最多对annotator造成不确定性。此外，我们发现DR标注足够高质量以预测对话单位的嵌入。

Text Simplification of Scientific Texts for Non-Expert Readers

paper_url: http://arxiv.org/abs/2307.03569
repo_url: None
paper_authors: Björn Engelmann, Fabian Haak, Christin Katharina Kreutz, Narjes Nikzad Khasmakhi, Philipp Schaer
for: 这个研究是为了帮助非专家读者更好地理解科学报告摘要中的核心信息。
methods: 这个研究使用了三种现成的摘要模型（两个基于T5，一个基于PEGASUS）和一个使用复杂短语识别的ChatGPT模型来简化科学报告摘要。
results: 这些模型可以帮助非专家读者更好地理解报告摘要中的核心信息，并且可以帮助您更快地理解这些信息。

Abstract
Reading levels are highly individual and can depend on a text's language, a person's cognitive abilities, or knowledge on a topic. Text simplification is the task of rephrasing a text to better cater to the abilities of a specific target reader group. Simplification of scientific abstracts helps non-experts to access the core information by bypassing formulations that require domain or expert knowledge. This is especially relevant for, e.g., cancer patients reading about novel treatment options. The SimpleText lab hosts the simplification of scientific abstracts for non-experts (Task 3) to advance this field. We contribute three runs employing out-of-the-box summarization models (two based on T5, one based on PEGASUS) and one run using ChatGPT with complex phrase identification.

摘要
阅读水平是非常个人化的，它可能受到文本的语言、读者的认知能力以及主题知识的影响。文本简化是将文本重新推理以更好地适应target读者群的能力。在科学报告中简化Abstract可以帮助非专家访问核心信息，这特别有 relevance для例如，癌症患者阅读新的治疗方案。我们在SimpleText lab中为非专家（任务3）进行科学报告简化，以推动这一领域的发展。我们提供了三个运行，其中两个基于T5摘要模型，一个基于PEGASUS摘要模型，以及一个使用ChatGPT复杂短语识别。

DWReCO at CheckThat! 2023: Enhancing Subjectivity Detection through Style-based Data Sampling

paper_url: http://arxiv.org/abs/2307.03550
repo_url: None
paper_authors: Ipek Baris Schlicht, Lynn Khellaf, Defne Altiok
for: 这篇论文描述了我们在CheckThat! Lab中的主观检测任务提交。
methods: 为了解决任务中的分类偏见，我们使用GPT-3模型生成了不同风格的提示，基于新闻观点的主观检查表。我们使用了这些扩展训练集来练化语言特定的转换器模型。
results: 我们在英语、德语和土耳其语的实验中发现，不同的主观风格都能够在所有语言上得到效果。此外，我们发现在土耳其语和英语中，风格基本检测比重塑化更好。最后，GPT-3模型在非英语语言中生成风格基本文本时 occasional lacklustre 的结果。

Abstract
This paper describes our submission for the subjectivity detection task at the CheckThat! Lab. To tackle class imbalances in the task, we have generated additional training materials with GPT-3 models using prompts of different styles from a subjectivity checklist based on journalistic perspective. We used the extended training set to fine-tune language-specific transformer models. Our experiments in English, German and Turkish demonstrate that different subjective styles are effective across all languages. In addition, we observe that the style-based oversampling is better than paraphrasing in Turkish and English. Lastly, the GPT-3 models sometimes produce lacklustre results when generating style-based texts in non-English languages.

摘要
这篇论文描述了我们在CheckThat! Lab中对主观偏见检测任务的提交。为了解决任务中的类别不均衡，我们使用GPT-3模型生成了更多的训练材料，使用基于新闻媒体的主观检查列表中的不同风格的提示。我们使用扩展的训练集来精度调整语言特定的转换器模型。我们的实验表明，不同的主观风格在所有语言中都有效。此外，我们发现在土耳其语和英语中，风格基于的增加 sampling 比较有效，而在非英语语言中，GPT-3模型 sometimes produce lacklustre results when generating style-based texts。

Quantifying the perceptual value of lexical and non-lexical channels in speech

paper_url: http://arxiv.org/abs/2307.03534
repo_url: None
paper_authors: Sarenne Wallbridge, Peter Bell, Catherine Lai
for: 研究对话中非语言信息的值
methods: 引入一种通用的研究方法，利用准确率和信息 entropy 来衡量非语言信息的影响
results: 研究发现，非语言信息在对话中产生一致的影响，即使其不如语言内容alone 导致更好的分类性turn 判断，但是它们仍然能够提高参与者的一致性。

Abstract
Speech is a fundamental means of communication that can be seen to provide two channels for transmitting information: the lexical channel of which words are said, and the non-lexical channel of how they are spoken. Both channels shape listener expectations of upcoming communication; however, directly quantifying their relative effect on expectations is challenging. Previous attempts require spoken variations of lexically-equivalent dialogue turns or conspicuous acoustic manipulations. This paper introduces a generalised paradigm to study the value of non-lexical information in dialogue across unconstrained lexical content. By quantifying the perceptual value of the non-lexical channel with both accuracy and entropy reduction, we show that non-lexical information produces a consistent effect on expectations of upcoming dialogue: even when it leads to poorer discriminative turn judgements than lexical content alone, it yields higher consensus among participants.

摘要
文本中的演讲是一种基本的交流方式，可以看作提供两个信息传输通道：言语上的字句，以及语言上的演讲方式。两个通道都会影响听众对后续交流的期望;然而，直接量化这两个通道之间的相对效果是困难的。先前的尝试需要使用语言上的变体或明显的声音修饰来实现对话的变化。本文介绍了一种通用的研究方法，用于研究对话中非语言信息的价值。通过量化非语言信息的听众对话的准确性和 entropy 减少，我们发现，非语言信息会在对话中产生一致的效果：即使导致语言内容alone 的较差分类判断，也会得到参与者的高度一致。

AI-UPV at EXIST 2023 – Sexism Characterization Using Large Language Models Under The Learning with Disagreements Regime

paper_url: http://arxiv.org/abs/2307.03385
repo_url: https://github.com/angelfelipemp/sexism-llm-learning-with-disagreement
paper_authors: Angel Felipe Magnossão de Paula, Giulia Rizzi, Elisabetta Fersini, Damiano Spina
for: The paper aims to develop an automated system for detecting sexism and other hateful behaviors on social media to promote a more inclusive and respectful online environment.
methods: The proposed approach uses large language models (mBERT and XLM-RoBERTa) and ensemble strategies to identify and classify sexism in English and Spanish, without relying on aggregated labels.
results: The system achieved fourth place in Task 2 at EXIST and first place in Task 3, with the highest ICM-Soft of -2.32 and a normalized ICM-Soft of 0.79, outperforming the individual large language models.Here’s the simplified Chinese text for the three information points:
for: 本研究旨在开发一种自动检测社交媒体上的性别歧视和其他仇恨行为，以促进在线环境的包容性和尊重。
methods: 该方法使用大型自然语言模型（mBERT和XLM-RoBERTa）和集成策略来识别和分类社会性别歧视，不使用汇总标签。
results: 系统在EXIST Lab中取得了第四名的成绩（ Task 2）和第一名的成绩（ Task 3），ICM-Soft最高达{-2.32）和正常化ICM-Soft为0.79，超过了单独的大型自然语言模型。

Abstract
With the increasing influence of social media platforms, it has become crucial to develop automated systems capable of detecting instances of sexism and other disrespectful and hateful behaviors to promote a more inclusive and respectful online environment. Nevertheless, these tasks are considerably challenging considering different hate categories and the author's intentions, especially under the learning with disagreements regime. This paper describes AI-UPV team's participation in the EXIST (sEXism Identification in Social neTworks) Lab at CLEF 2023. The proposed approach aims at addressing the task of sexism identification and characterization under the learning with disagreements paradigm by training directly from the data with disagreements, without using any aggregated label. Yet, performances considering both soft and hard evaluations are reported. The proposed system uses large language models (i.e., mBERT and XLM-RoBERTa) and ensemble strategies for sexism identification and classification in English and Spanish. In particular, our system is articulated in three different pipelines. The ensemble approach outperformed the individual large language models obtaining the best performances both adopting a soft and a hard label evaluation. This work describes the participation in all the three EXIST tasks, considering a soft evaluation, it obtained fourth place in Task 2 at EXIST and first place in Task 3, with the highest ICM-Soft of -2.32 and a normalized ICM-Soft of 0.79. The source code of our approaches is publicly available at https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement.

摘要
随着社交媒体平台的普及，已经成为必要的发展自动化系统，能够检测社交媒体上的性别歧视和其他不尊重和仇恨行为，以促进更加包容和尊重的在线环境。然而，这些任务非常困难，因为不同的仇恨类别和作者的意图，尤其是在学习各自意见的情况下。这篇文章描述了AI-UPV团队在CLEF 2023年的EXIST（性别歧视 Identification in Social neTworks）实验室中的参与。提出的方法是通过直接从数据中学习，不使用任何汇总标签，来解决性别歧视标识和分类问题。然而，我们还是报告了使用软和硬评估方法的性能。我们的系统使用了大型自然语言模型（i.e., mBERT和XLM-RoBERTa）和集成策略进行性别歧视标识和分类。具体来说，我们的系统由三个不同的管道组成。集成方法在使用软和硬标签评估方法时表现出色，在EXIST任务中获得了第四名（Task 2）和第一名（Task 3），其ICM-Soft=-2.32和 normalized ICM-Soft为0.79。我们的源代码可以在https://github.com/AngelFelipeMP/Sexism-LLM-Learning-With-Disagreement上获得。

A Side-by-side Comparison of Transformers for English Implicit Discourse Relation Classification

paper_url: http://arxiv.org/abs/2307.03378
repo_url: None
paper_authors: Bruce W. Lee, BongSeok Yang, Jason Hyung-Jong Lee
for: 这个论文的目的是对多种自然语言处理领域中的隐式 дискурс关系分类进行比较研究，以便研究人员可以充分利用公共可用的模型进行дискурс分析。
methods: 这篇论文使用了七种预训练语言模型，并通过对这些模型进行精细调整来进行比较性能测试。这些模型包括NSP、SBO、SOP等句子级预训练目标，以及MLM和全注意力等方法。
results: 这篇论文的结果显示，与之前报道的不同（Shi和Demberg，2019b），使用 sentence-level 预训练目标（NSP、SBO、SOP）并不总是生成最佳的隐式 дискурс关系分类模型。相反，使用相同大小的 PLMs WITH MLM AND full attention 可以达到更高的性能（ACC = 0.671）。

Abstract
Though discourse parsing can help multiple NLP fields, there has been no wide language model search done on implicit discourse relation classification. This hinders researchers from fully utilizing public-available models in discourse analysis. This work is a straightforward, fine-tuned discourse performance comparison of seven pre-trained language models. We use PDTB-3, a popular discourse relation annotated dataset. Through our model search, we raise SOTA to 0.671 ACC and obtain novel observations. Some are contrary to what has been reported before (Shi and Demberg, 2019b), that sentence-level pre-training objectives (NSP, SBO, SOP) generally fail to produce the best performing model for implicit discourse relation classification. Counterintuitively, similar-sized PLMs with MLM and full attention led to better performance.

摘要
“对话分析可以帮助多个自然语言处理（NLP）领域，但是对于不直接的话语关系分类仍没有广泛的语言模型搜索。这限制了研究人员对话分析中的全面利用已有的模型。这项工作是一个简单、精确地 fine-tune 多个预训练语言模型的表现比较。我们使用 PDTB-3，一个受欢迎的话语关系标注数据集。通过我们的模型搜索，我们提高了ACC的最高分为0.671，并获得了新的观察。一些与过去报告不同（Shi和Demberg，2019b），具体是内置式预训练目标（NSP、SBO、SOP）通常无法生成最佳的模型 для implicit discourse relation classification。反意外地，相同大小的PLMs WITH MLM和全域注意力可以获得更好的表现。”

Mitigating Negative Transfer with Task Awareness for Sexism, Hate Speech, and Toxic Language Detection

paper_url: http://arxiv.org/abs/2307.03377
repo_url: https://github.com/angelfelipemp/mitigating-negative-transfer-with-ta
paper_authors: Angel Felipe Magnossão de Paula, Paolo Rosso, Damiano Spina
for: 这篇论文的目的是如何 Mitigate the negative transfer problem in Multi-Task Learning (MTL)。
methods: 该论文提出了一种基于任务意识概念的新方法，使得避免了负性传递问题，同时提高了性能。这种方法基于在多个任务之间共享信息的思想。
results: 该论文在EXIST-2021和HatEval-2019测试准则上实现了新的状态态-of-the-art，并且在识别性别歧视、仇恨言语和恶意言语等领域中达到了最高的性能。

Abstract
This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are unavailable or expensive to gather. Therefore another solution, based on the sharing of information between tasks, has been developed: Multi-Task Learning (MTL). Despite the recent developments regarding MTL, the problem of negative transfer has still to be solved. Negative transfer is a phenomenon that occurs when noisy information is shared between tasks, resulting in a drop in performance. This paper proposes a new approach to mitigate the negative transfer problem based on the task awareness concept. The proposed approach results in diminishing the negative transfer together with an improvement of performance over classic MTL solution. Moreover, the proposed approach has been implemented in two unified architectures to detect Sexism, Hate Speech, and Toxic Language in text comments. The proposed architectures set a new state-of-the-art both in EXIST-2021 and HatEval-2019 benchmarks.

摘要

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

paper_url: http://arxiv.org/abs/2307.03354
repo_url: None
paper_authors: Sara Papi, Peidong Wan, Junkun Chen, Jian Xue, Jinyu Li, Yashesh Gaur
for: 这篇论文主要用于提高实时涂抹翻译和自动听写的质量和效率。
methods: 这篇论文提出了一种串行传播变换器-变把（Transformer-Transducer），该模型同时生成自动听写（ASR）和翻译（ST）输出，使用单个解码器进行joint训练。
results: 实验结果表明，这种方法在单语言（it-en）和多语言（de,es,it）的设置下都能够实现最佳的质量-延迟平衡。模型的平均ASR延迟为1秒，ST延迟为1.3秒，而且与分开的ASR和ST模型相比，输出质量没有下降，甚至有所提高，增加了1.1个word error rate和0.4个bleu在多语言情况下。

Abstract
In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings demonstrate that our approach achieves the best quality-latency balance. With an average ASR latency of 1s and ST latency of 1.3s, our model shows no degradation or even improves output quality compared to separate ASR and ST models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the multilingual case.

摘要
在实际应用场景中，用户经常需要同时获得翻译和转写的语音识别，特别在流处理方面，需要实时生成。这篇论文介绍了一种流处理Transformer-Transducer，可同时生成自动语音识别（ASR）和语音翻译（ST）输出，使用单个解码器。为了在最小的延迟下生成ASR和ST内容，我们提议了一种共同序列化输出训练方法，通过利用商业化的文本对齐器来扫描源和目标词语。实验表明，我们的方法在单语言（it-en）和多语言（de,es,it-en）设置下都可以达到最佳的质量-延迟平衡。我们的模型的平均ASR延迟为1秒，ST延迟为1.3秒，而且与分离ASR和ST模型不同，我们的模型无减性或甚至提高输出质量，平均提高1.1个WRR和0.4个BLEU在多语言情况下。

BiPhone: Modeling Inter Language Phonetic Influences in Text

paper_url: http://arxiv.org/abs/2307.03322
repo_url: None
paper_authors: Abhirut Gupta, Ananya B. Sai, Richard Sproat, Yuri Vasilevski, James S. Ren, Ambarish Jash, Sukhdeep S. Sodhi, Aravindan Raghuveer
for: 这个论文是为了研究在使用第二语言（L2）时，因技术不匹配而受到强制使用Web的人群中，受到语言一低文化水平的影响而导致的文本错误的问题。
methods: 这个论文使用了一种方法来挖掘L1和L2之间的音节混淆（即L1 speaker可能会混淆的L2音节），并将这些混淆音节输入到一个生成模型（Bi-Phone）中，以生成受混淆的L2文本。
results: 通过人工评估，这个方法可以生成具有各种L1特征的受混淆L2文本，并且在Web上有广泛的应用。此外，这个论文还将这种方法应用于SuperGLUE语言理解 benchmark 上，并证明了SoTA语言理解模型在受混淆情况下的表现不佳。此外，这个论文还提出了一种新的音节预测预训练任务，可以帮助字节模型重新获得SuperGLUE水平的表现。最后，这个论文还发布了FunGLUE benchmark，以便进一步研究具有phonetically robust的语言模型。

Abstract
A large number of people are forced to use the Web in a language they have low literacy in due to technology asymmetries. Written text in the second language (L2) from such users often contains a large number of errors that are influenced by their native language (L1). We propose a method to mine phoneme confusions (sounds in L2 that an L1 speaker is likely to conflate) for pairs of L1 and L2. These confusions are then plugged into a generative model (Bi-Phone) for synthetically producing corrupted L2 text. Through human evaluations, we show that Bi-Phone generates plausible corruptions that differ across L1s and also have widespread coverage on the Web. We also corrupt the popular language understanding benchmark SuperGLUE with our technique (FunGLUE for Phonetically Noised GLUE) and show that SoTA language understating models perform poorly. We also introduce a new phoneme prediction pre-training task which helps byte models to recover performance close to SuperGLUE. Finally, we also release the FunGLUE benchmark to promote further research in phonetically robust language models. To the best of our knowledge, FunGLUE is the first benchmark to introduce L1-L2 interactions in text.

摘要
很多人被迫使用第二语言（L2）进行网络交互，但是由于技术不均衡，他们的written L2文本经常含有大量的错误，这些错误受到他们的Native Language（L1）的影响。我们提议一种方法， mines phoneme confusions（L2中的声音混淆），并将其与L1进行对应。这些混淆被用于生成Synthetically produced corrupted L2文本。我们通过人类评估表明，Bi-Phone生成的混淆是有可能的，并且在不同的L1上具有广泛的coverage。我们还将这种技术应用于SuperGLUE的人工语言理解 benchmark（FunGLUE for Phonetically Noised GLUE），并证明了SoTA语言理解模型在这种情况下表现不佳。我们还提出了一种新的声音预测预训练任务，帮助Byte模型在SuperGLUE中恢复性能。最后，我们还发布了FunGLUE benchmark，以便进一步研究在声音稳定的语言模型方面。我们知道，FunGLUE是首个引入L1-L2交互的文本 benchmark。

Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

paper_url: http://arxiv.org/abs/2307.03319
repo_url: None
paper_authors: Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson
for: The paper is written for generating gap-focused questions (GFQs) in educational dialogues to create a rich and interactive learning experience.
methods: The paper proposes a model that uses natural language processing techniques to generate GFQs automatically, with a focus on key desired aspects such as relevance, specificity, and engagement.
results: The paper provides an evaluation of the generated questions against human-generated questions, demonstrating competitive performance and the effectiveness of the proposed model in generating GFQs.Here’s the same information in Simplified Chinese text:
for: 这篇论文是为了自动生成教育对话中的差距关注问题（GFQ），以创造一种丰富和互动的学习经验。
methods: 论文提出了一种使用自然语言处理技术来生成GFQ，注重关键所需的方面，如相关性、特定性和参与度。
results: 论文通过人工标注者对生成的问题和人类生成的问题进行评估，表明了提案模型的竞争力和生成GFQ的效果。

Abstract
Human communication often involves information gaps between the interlocutors. For example, in an educational dialogue, a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational experience. We focus on the problem of generating such gap-focused questions (GFQs) automatically. We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these. Finally, we provide an evaluation by human annotators of our generated questions compared against human generated ones, demonstrating competitive performance.

摘要
人际交流经常会出现信息差距 между交流方。例如，在教学对话中，学生可能提供不够的答案，而教师期望的完整答案与此存在差距。成功的对话受到教师以有效的方式询问这个差距，从而创造出丰富且互动的教学经验。我们关注于自动生成这些差距关注的问题（GFQ）的问题。我们定义任务、标出了好的GFQ所应具备的关键特征，并提议一种满足这些特征的模型。最后，我们通过人类标注员对我们生成的问题与人类生成的问题进行评估，展示了竞争力强的性能。

InfoSync: Information Synchronization across Multilingual Semi-structured Tables

paper_url: http://arxiv.org/abs/2307.03313
repo_url: https://github.com/Info-Sync/InfoSync
paper_authors: Siddharth Khincha, Chelsi Jain, Vivek Gupta, Tushar Kataria, Shuo Zhang
for: 本研究旨在解决语言间 semi-结构化数据的信息同步问题，例如wikipedia 表格的同步化。
methods: 提出了一种新的数据集 InfoSyncC 和一种两步方法 для tabular 同步化。InfoSync 包含 100K 实体中心表格（wikipedia Infobox） Across 14 种语言，其中一部分（3.5K 对）是手动注释。提出的方法包括信息对齐和信息更新两个步骤。
results: 在 InfoSync 上进行了信息对齐，信息对齐得分为 87.91（en <-> non-en）。为了评估信息更新，我们对 Infoboxes 进行了603 个表格对的人工帮助编辑。我们的方法得到了wikipedia 上的77.28% 的接受率，表明了提出的方法的有效性。

Abstract
Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset InfoSyncC and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en <-> non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 603 table pairs. Our approach obtains an acceptance rate of 77.28% on Wikipedia, showing the effectiveness of the proposed method.

摘要
信息同步问题在不结构化数据中是挑战。例如，wikipedia 表格在一种语言中应该与其他语言的表格进行同步。为解决这个问题，我们介绍了一个新的数据集 InfoSyncC 和一种两步方法 для表格同步。InfoSync 包含 100 万个实体中心表格（Wikipedia 信息框） Across 14 种语言，其中一 subset（3.5 千对）是 manually annotated。我们提议的方法包括 1) 信息对应和 2) 信息更新。当 evaluated on InfoSync 时，信息对应得到了 F1 分数为 87.91（en <-> non-en）。为评估信息更新，我们对 Infoboxes 进行了人工协助的 Wikipedia 编辑 603 对。我们的方法获得了 Wikipedia 上的接受率为 77.28%，显示了我们提议的方法的有效性。

Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

paper_url: http://arxiv.org/abs/2307.03296
repo_url: https://github.com/areffarhadi/gammatonegram_cnn_dysarthric_speech
paper_authors: Aref Farhadipour, Hadi Veisi
for: 这个研究旨在开发一个基于 convolutional neural network (CNN) 的语音识别系统，以提高智能家居中的语音识别率。
methods: 该研究使用 gammatonegram 方法将语音文件转换为图像，并使用 pre-trained Alexnet 基于 transfer learning 方法进行语音识别。
results: 根据 UA 数据集的结果，提议的语音识别系统在 speaker-dependent 模式下达到了 91.29% 的准确率，语音识别系统在 text-dependent 模式下达到了 87.74% 的准确率，而两类智能评估系统在 two-class 模式下达到了 96.47% 的准确率。

Abstract
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.

摘要
《干扰性 speech 识别系统的设计》Introduction:难以说话（dysarthria）是一种影响人类语音系统的残疾，导致语音质量和可读性减退。由于这种影响，常规的语音处理系统无法正常工作。这种残疾通常与物理残疾相关。因此，设计一个可以通过声音命令在智能家居中进行一些任务的系统可以是一项重要的成就。在这项工作中，我们介绍了一种有效的方法，即《干扰性 grammatonegram》，用于将语音文件转换成可识别的图像，并提出了一种基于转移学习方法的 convolutional neural network（CNN）来分类不同场景的语音。Methodology:我们将每个语音文件转换成一幅图像，并使用pre-trained Alexnet进行转移学习。在这项研究中，我们评估了提案的语音识别、 speaker identification和可读性评估系统的效率。根据UA数据集的结果，提案的语音识别系统在 speaker-dependent 模式下达到了91.29%的准确率，speaker identification系统在 text-dependent 模式下达到了87.74%的准确率，而可读性评估系统在 two-class 模式下达到了96.47%的准确率。Results:我们还提出了一种多网络语音识别系统，其中每个语音识别网络都是通过两类可读性评估系统的输出来活化。这种架构可以达到92.3%的WRR精度。Conclusion:本文介绍了一种基于干扰性 grammatonegram 和转移学习的语音识别系统的设计。该系统可以在智能家居中进行一些任务，并且可以提高语音识别、 speaker identification和可读性评估的精度。ources code of this paper is available.

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

paper_url: http://arxiv.org/abs/2307.04765
repo_url: None
paper_authors: Oyku Berfin Mercan, Sercan Cepni, Davut Emre Tasar, Sukru Ozan
for: 这个研究是为了测试两种预训练的多语言模型（Whisper-Small和Wav2Vec2-XLS-R-300M）在土耳其语言上的表现。
methods: 这个研究使用了Mozilla Common Voice版本11.0，这是一个在土耳其语言上制作的开源数据集。研究人员将这两个模型在这个数据集上进行了微调。
results: 研究人员计算了WER值，得到的结果是0.28和0.16，分别对应于Wav2Vec2-XLS-R-300M和Whisper-Small模型。此外，研究人员还测试了这两个模型在没有包含在训练和验证数据集中的回呼记录上的表现。

Abstract
In this study, the performances of the Whisper-Small and Wav2Vec2-XLS-R-300M models which are two pre-trained multilingual models for speech to text were examined for the Turkish language. Mozilla Common Voice version 11.0 which is prepared in Turkish language and is an open-source data set, was used in the study. The multilingual models, Whisper- Small and Wav2Vec2-XLS-R-300M were fine-tuned with this data set which contains a small amount of data. The speech to text performance of the two models was compared. WER values are calculated as 0.28 and 0.16 for the Wav2Vec2-XLS- R-300M and the Whisper-Small models respectively. In addition, the performances of the models were examined with the test data prepared with call center records that were not included in the training and validation dataset.

摘要
在这项研究中，我们对两种预训练的多语言模型（Whisper-Small和Wav2Vec2-XLS-R-300M）进行了对 Turkish 语言的评估。我们使用了 Mozilla Common Voice 版本 11.0，这是一个开源的 Turkish 语言数据集。我们对这些数据集进行了精度的 fine-tuning，并计算了这两个模型在这些数据集上的 speech-to-text 性能。我们计算出的 WER 值为 0.28 和 0.16，对应的是 Whisper-Small 和 Wav2Vec2-XLS-R-300M 模型。此外，我们还对使用测试数据集，这些数据集不包括在训练和验证集中，进行了模型的评估。

Lost in the Middle: How Language Models Use Long Contexts

paper_url: http://arxiv.org/abs/2307.03172
repo_url: https://github.com/nelson-liu/lost-in-the-middle
paper_authors: Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang
for: 本研究探讨了语言模型在长文本上的表现，以及它们如何使用长文本中的信息。
methods: 本研究使用了多文档问答和关键值检索两个任务来分析语言模型在长文本上的表现。
results: 研究发现，语言模型在长文本上的表现通常最高时 relevante信息出现在输入文本的开头或结尾，并且当模型需要在长文本中检索 relevante信息时，表现会明显下降。此外，研究还发现，even explicitly long-context models 的表现会随输入文本的长度增长而下降。这些发现可以帮助我们更好地理解语言模型如何使用输入文本，并提供新的评估协议 для未来的长文本模型。

Abstract
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze language model performance on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.

摘要
Recent language models have the ability to take long contexts as input, but little is known about how well they use longer contexts. We analyze the performance of language models on two tasks that require identifying relevant information within their input contexts: multi-document question answering and key-value retrieval. We find that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context models.Here's the text in Traditional Chinese:现代语言模型具有处理长文本上下文的能力，但知道它们如何使用长文本上下文的情况相对少。我们分析了语言模型在多文档问题回答和关键值搜寻两个任务中表现的情况，发现表现通常在输入上下文中的开头或结尾的位置最高，而在中间部分搜寻时表现则明显下降。此外，随着输入上下文的长度增加，表现也会随之下降，即使使用长文本模型。我们的分析可以帮助我们更好地理解语言模型如何使用输入上下文，并提供未来长文本模型的新评估协议。

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

paper_url: http://arxiv.org/abs/2307.03132
repo_url: https://github.com/locuslab/t-mars
paper_authors: Pratyush Maini, Sachin Goyal, Zachary C. Lipton, J. Zico Kolter, Aditi Raghunathan
for: 这篇论文主要目标是提出一种新的数据筛选方法，以提高计算机视觉领域的模型学习效果。
methods: 这篇论文使用了一种新的数据筛选方法，即T-MARS（文本蒙版和重新分配），它首先将文本蒙版出现的图像，然后使用CLIP相似性分数来筛选图像。
results: 实验表明，T-MARS在DataComp数据筛选benchmark中的中等规模上，与最佳方法的差距为6.5%（在ImageNet上）和4.7%（在VTAB上）。此外，在不同的数据池大小从2M到64M时，T-MARS的准确率随着数据和计算的扩展呈线性增长。

Abstract
Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS.

摘要
大量网络源的多模式数据集已经推动了一些新的方法来学习通用视觉表示，提高计算机视觉的状态艺术。一个重要的决策是如何CURATE这些越来越大的数据集。例如，LAION-5B数据集的创建者选择了只保留具有 CLIP 相似性分数超过设定的阈值的图像-标签对。在这篇论文中，我们提出了一种新的数据筛选方法， motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. 码可以在 https://github.com/locuslab/T-MARS 上获取。

BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training

paper_url: http://arxiv.org/abs/2307.03131
repo_url: https://github.com/powerpuffpomelo/fairseq_mrt
paper_authors: Yiming Yan, Tao Wang, Chengqi Zhao, Shujian Huang, Jiajun Chen, Mingxuan Wang
for: 这个研究旨在系统地分析和比较各种主流和前沿自动评价 metric，以了解它们在训练机器翻译系统时的导向性。
methods: 通过 Minimum Risk Training (MRT) 方法，研究发现了一些 metric 具有不稳定性问题，如 BLEURT 和 BARTScore 中的通用敌对翻译。经过深入分析，发现这些不稳定性的两个主要原因是训练数据集的分布偏见，以及评价metric的 парадиг。通过加入token级别的约束，提高了评价 metric 的稳定性，从而提高了机器翻译系统的性能。
results: 研究发现，通过提高评价 metric 的稳定性，可以提高机器翻译系统的性能。 codes 可以在 \url{https://github.com/powerpuffpomelo/fairseq_mrt} 上获取。

Abstract
Automatic metrics play a crucial role in machine translation. Despite the widespread use of n-gram-based metrics, there has been a recent surge in the development of pre-trained model-based metrics that focus on measuring sentence semantics. However, these neural metrics, while achieving higher correlations with human evaluations, are often considered to be black boxes with potential biases that are difficult to detect. In this study, we systematically analyze and compare various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. Through Minimum Risk Training (MRT), we find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm. By incorporating token-level constraints, we enhance the robustness of evaluation metrics, which in turn leads to an improvement in the performance of machine translation systems. Codes are available at \url{https://github.com/powerpuffpomelo/fairseq_mrt}.

摘要

VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering

paper_url: http://arxiv.org/abs/2307.03130
repo_url: None
paper_authors: Zijun Yao, Yuanyong Chen, Xin Lv, Shulin Cao, Amy Xin, Jifan Yu, Hailong Jin, Jianjun Xu, Peng Zhang, Lei Hou, Juanzi Li
for: 这个论文是关于Visual Knowledge oriented Programming platform（VisKoP），一种基于人工智能的知识基本问题回答（KBQA）系统，它可以将自然语言问题转化为知识导向程序语言（KoPL），并将 KoPL 程序映射到图形元素中，以便使用图形操作来编辑和调试知识基本（KB）查询。
methods: 这个论文使用了人工智能的神经网络程序生成模块，将自然语言问题转化为 KoPL 程序，并提供了一个高效的 KoPL 执行引擎，以便在大规模知识基本中进行实用KBQA。
results: 实验结果显示，VisKoP 可以高效地解决大规模知识基本中的问题，并且通过人工交互可以修复大量错误的 KoPL 程序，以获得正确的答案。

Abstract
We present Visual Knowledge oriented Programming platform (VisKoP), a knowledge base question answering (KBQA) system that integrates human into the loop to edit and debug the knowledge base (KB) queries. VisKoP not only provides a neural program induction module, which converts natural language questions into knowledge oriented program language (KoPL), but also maps KoPL programs into graphical elements. KoPL programs can be edited with simple graphical operators, such as dragging to add knowledge operators and slot filling to designate operator arguments. Moreover, VisKoP provides auto-completion for its knowledge base schema and users can easily debug the KoPL program by checking its intermediate results. To facilitate the practical KBQA on a million-entity-level KB, we design a highly efficient KoPL execution engine for the back-end. Experiment results show that VisKoP is highly efficient and user interaction can fix a large portion of wrong KoPL programs to acquire the correct answer. The VisKoP online demo https://demoviskop.xlore.cn (Stable release of this paper) and https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL engine https://pypi.org/project/kopl-engine, and screencast video https://youtu.be/zAbJtxFPTXo are now publicly available.

摘要
我们介绍Visual Knowledge oriented Programming平台（VisKoP），是一个基于问题回答（KBQA）系统，具有人类在循环中参与修改和验证知识库（KB）问题的功能。VisKoP不仅提供神经网络问题化模组，将自然语言问题转换为知识导向程式语言（KoPL），并将KoPL程式映射为图形元素。KoPL程式可以通过简单的图形操作进行修改，例如拖曳添加知识操作和填写操作符据。此外，VisKoP提供KBSchema自动完成和用户可以轻松地在KB中验证KoPL程式的中间结果。为了实现实用的KBQA，我们设计了高效的KoPL执行引擎。实验结果显示，VisKoP具有高效性，并且用户互动可以解决大量的错误KoPL程式，以获取正确答案。VisKoP在线demo：https://demoviskop.xlore.cn（稳定版本）和https://viskop.xlore.cn（beta版本，具有新功能），高效的KoPL执行引擎：https://pypi.org/project/kopl-engine，和萤幕录影影片：https://youtu.be/zAbJtxFPTXo现在公开available。

PREADD: Prefix-Adaptive Decoding for Controlled Text Generation

paper_url: http://arxiv.org/abs/2307.03214
repo_url: https://github.com/jonnypei/acl23-preadd
paper_authors: Jonathan Pei, Kevin Yang, Dan Klein
for: 控制文本生成
methods: prefix-adaptive decoding（PREADD）
results: 在三个任务中（抑制恶意输出、减少性别偏见、控制情感），PREADD比基eline和auxiliary-expert控制方法提高12%或更多的相对提升。

Abstract
We propose Prefix-Adaptive Decoding (PREADD), a flexible method for controlled text generation. Unlike existing methods that use auxiliary expert models to control for attributes, PREADD does not require an external model, instead relying on linearly combining output logits from multiple prompts. Specifically, PREADD contrasts the output logits generated using a raw prompt against those generated using a prefix-prepended prompt, enabling both positive and negative control with respect to any attribute encapsulated by the prefix. We evaluate PREADD on three tasks -- toxic output mitigation, gender bias reduction, and sentiment control -- and find that PREADD outperforms not only prompting baselines, but also an auxiliary-expert control method, by 12% or more in relative gain on our main metrics for each task.

摘要
我们提出了预先适应编码（PREADD）方法，这是一种灵活的文本生成控制方法。与现有方法不同，PREADD不需要外部模型，而是通过将多个提示的输出拟合成为一个Linear Combination来实现控制。具体来说，PREADD比较使用 Raw Prompt 和Prefix-prepended Prompt两个提示生成的输出拟合，从而实现对任何Attributes所含的控制。我们对三个任务进行评估：毒瘤输出减少、性别偏见减少和 sentiment控制，并发现PREADD在每个任务上比基elinePrompting和auxiliary-expert控制方法提高12%或更多的相对提升。

Extracting Multi-valued Relations from Language Models

paper_url: http://arxiv.org/abs/2307.03122
repo_url: https://github.com/snehasinghania/multi_valued_slot_filling
paper_authors: Sneha Singhania, Simon Razniewski, Gerhard Weikum
for: 这篇论文是为了探讨隐藏语言表示的多个对象关系知识是否可以提取出来的。
methods: 这篇论文使用了现有的提示技术和新的域知识 incorporating 提示技术来评价候选对象。
results: 研究发现，通过选择对象的可能性大于学习关系特定的阈值得分，可以达到49.5%的 F1 分数。这些结果表明使用LM进行多值槽填任务是具有挑战性，并且激励进一步研究提取隐藏语言表示中的关系知识。

Abstract
The widespread usage of latent language representations via pre-trained language models (LMs) suggests that they are a promising source of structured knowledge. However, existing methods focus only on a single object per subject-relation pair, even though often multiple objects are correct. To overcome this limitation, we analyze these representations for their potential to yield materialized multi-object relational knowledge. We formulate the problem as a rank-then-select task. For ranking candidate objects, we evaluate existing prompting techniques and propose new ones incorporating domain knowledge. Among the selection methods, we find that choosing objects with a likelihood above a learned relation-specific threshold gives a 49.5% F1 score. Our results highlight the difficulty of employing LMs for the multi-valued slot-filling task and pave the way for further research on extracting relational knowledge from latent language representations.

摘要
广泛的语言表现库使用预训语言模型（LM）表明它们是有前途的结构知识来源。然而，现有的方法仅专注在单一物件之间的主题关系对，即使有多个物件是正确的。为了解决这个限制，我们分析这些表现的潜在可以产生实体多个物件关系知识。我们将这个问题推理为排名选择任务。为选择候选物件，我们评估现有的提示技术和新提出的内容知识技术。我们发现，选择关系特定阈值上的可能性大于学习的relation-specific阈值会获得49.5%的F1分数。我们的结果显示使用LM进行多値构造填充任务是具有挑战性的，并且点出了进一步研究抽取语言表现中的关系知识的可能性。

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

paper_url: http://arxiv.org/abs/2307.03115
repo_url: https://github.com/thu-keg/korc
paper_authors: Zijun Yao, Yantao Liu, Xin Lv, Shulin Cao, Jifan Yu, Lei Hou, Juanzi Li
for: 本文提出了一个新的 benchmark，以便检测深度文本理解的能力。
methods: 本文使用了大量知识库来引导注释或大型自然语言处理器（LLM）构建知识问题。
results: 实验结果显示，使用最佳基线方法只能在收odge Distribution test set中 achieve 68.3%和30.0% F1 measure。这表明深度文本理解仍然是一个未解决的挑战。

Abstract
Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.

摘要
深层文本理解，需要文档与知识之间的连接，在过去几年中得到了许多benchmark的注意。然而，这些benchmark都面临了两个主要的限制：一方面，大多数它们需要人工标注知识，导致知识覆盖率受限；另一方面，它们通常使用文本中的选择或范围作为答案，这导致答案空间过于窄。为了突破这些限制，我们在这篇论文中构建了一个新的挑战性benchmark名为KoRC。相比之前的benchmark，KoRC具有两个优势：一是广泛的知识覆盖率，二是灵活的答案格式。具体来说，我们利用大量知识库来引导注urger或大语言模型（LLM）构建知识问题。此外，我们使用知识库中的标签而不是选择或范围作为答案。我们对state-of-the-art模型进行测试，实验结果表明，最强基eline只能达到68.3%和30.0%的F1度在分布式和分布式测试集上。这些结果表明，深层文本理解仍然是一个未解决的挑战。benchmark dataset、排名和基eline方法在https://github.com/THU-KEG/KoRC中发布。