results: 这篇论文的研究发现,通过使用自然语言处理任务的公共数据集进行自我超vised学习,可以更好地提高模型的下游任务能力。此外,研究还发现了字体选择、数据集规模和初始模型选择等因素对实验结果的影响。Abstract
In the context of the rapid development of large language models, we have meticulously trained and introduced the GujiBERT and GujiGPT language models, which are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters, allowing them to effectively handle various natural language processing tasks related to ancient books, including but not limited to automatic sentence segmentation, punctuation, word segmentation, part-of-speech tagging, entity recognition, and automatic translation. Notably, these models have exhibited exceptional performance across a range of validation tasks using publicly available datasets. Our research findings highlight the efficacy of employing self-supervised methods to further train the models using classical text corpora, thus enhancing their capability to tackle downstream tasks. Moreover, it is worth emphasizing that the choice of font, the scale of the corpus, and the initial model selection all exert significant influence over the ultimate experimental outcomes. To cater to the diverse text processing preferences of researchers in digital humanities and linguistics, we have developed three distinct categories comprising a total of nine model variations. We believe that by sharing these foundational language models specialized in the domain of ancient texts, we can facilitate the intelligent processing and scholarly exploration of ancient literary works and, consequently, contribute to the global dissemination of China's rich and esteemed traditional culture in this new era.
摘要
在大语言模型的快速发展背景下,我们仔细训练并引入了古稿BERT和古稿GPT语言模型,这些基础模型专门为古稿智能处理设计。这些模型在包括简化和传统中文字符的广泛数据集上进行了仔细训练,因此能够有效地处理各种古稿自然语言处理任务,包括自动句子分 segmentation、括号、词 segmentation、部件标注、实体识别和自动翻译等。特别是,这些模型在多个验证任务中表现出色,使用公开的数据集。我们的研究发现显示,使用自动学习方法进一步训练这些模型,使其在下游任务中表现更出色。此外,我们发现字体选择、文本规模和初始模型选择等因素均对实验结果产生重要影响。为满足数字人文学和语言学研究人员对文本处理的多样化需求,我们开发了三个分类,共九个模型变种。我们认为,通过分享这些专门为古稿领域设计的基础语言模型,可以促进古稿智能处理和学术探索,从而为中国古代文化的全球传播做出贡献。
Explaining Competitive-Level Programming Solutions using LLMs
results: 我们的解释生成方法可以为问题提供结构化的解释,包括描述和分析。我们对CodeContests dataset进行实验,结果显示,虽然GPT3.5和GPT-4在描述解决方案方面表现相似,但GPT-4更好地理解解决方案的核心思想。Abstract
In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language explanations to \textit{} pairs. We show that despite poor performance in solving competitive-level programming problems, state-of-the-art LLMs exhibit a strong capacity in describing and explaining solutions. Our explanation generation methodology can generate a structured solution explanation for the problem containing descriptions and analysis. To evaluate the quality of the annotated explanations, we examine their effectiveness in two aspects: 1) satisfying the human programming expert who authored the oracle solution, and 2) aiding LLMs in solving problems more effectively. The experimental results on the CodeContests dataset demonstrate that while LLM GPT3.5's and GPT-4's abilities in describing the solution are comparable, GPT-4 shows a better understanding of the key idea behind the solution.
摘要
在这篇论文中,我们将竞赛水平编程问题的解决视为一种复合任务,包括理解和代码生成。我们提出了一种新的方法,用于自动标注自然语言说明到\(\textit{<问题, 解决方案>}\)对。我们发现,当前的LLMs,虽然在竞赛水平编程问题上表现不佳,但它们在描述和解释解决方案方面具有强大的能力。我们的说明生成方法可以生成一个结构化的解决方案说明,包括描述和分析。为评估注解的质量,我们对两个方面进行评估:1)满足由人工编程专家撰写的oracle解决方案作者的期望,2)帮助LLMs更好地解决问题。我们在CodeContests数据集上进行了实验,结果表明,虽然GPT3.5和GPT-4在描述解决方案方面的能力相似,但GPT-4更好地理解解决方案的关键想法。
results: 对于流行的开放领域对话 dataset 进行了实验,并通过自动评估指标和人工评估来评估结果,得到的结论是,我们的方法比基eline prompting 和精确调教的基eline parameter 的6%-7%。Abstract
Prompt-tuning has become an increasingly popular parameter-efficient method for adapting large pretrained language models to downstream tasks. However, both discrete prompting and continuous prompting assume fixed prompts for all data samples within a task, neglecting the fact that inputs vary greatly in some tasks such as open-domain dialogue generation. In this paper, we present a novel, instance-specific prompt-tuning algorithm for dialogue generation. Specifically, we generate prompts based on instance-level control code, rather than the conversation history, to explore their impact on controlled dialogue generation. Experiments on popular open-domain dialogue datasets, evaluated on both automated metrics and human evaluation, demonstrate that our method is superior to prompting baselines and comparable to fine-tuning with only 5%-6% of total parameters.
摘要
Prompt-tuning已成为大型预训练语言模型适应下游任务的常用方法之一,但是 both discrete prompting和continuous prompting假设所有数据样本内一个任务中的Fixed prompts,忽略了一些任务,如开放领域对话生成,输入可能很大。在这篇论文中,我们提出了一种新的、实例特定的Prompt-tuning算法,Specifically,我们基于实例级控制代码而不是对话历史来生成Prompts,以探索它们在控制对话生成中的影响。我们在流行的开放领域对话集上进行了实验,并通过自动评价指标和人工评价来评估我们的方法。结果显示,我们的方法在比Prompting基线和只使用5%-6%总参数进行精度的情况下具有优异性。
Better Handling Coreference Resolution in Aspect Level Sentiment Classification by Fine-Tuning Language Models
results: 研究发现,这种方法可以提高 LLM 的 CR 能力,并且发布了一个新的数据集,该数据集专门关注 CR 在核心参照解决中的问题。Abstract
Customer feedback is invaluable to companies as they refine their products. Monitoring customer feedback can be automated with Aspect Level Sentiment Classification (ALSC) which allows us to analyse specific aspects of the products in reviews. Large Language Models (LLMs) are the heart of many state-of-the-art ALSC solutions, but they perform poorly in some scenarios requiring Coreference Resolution (CR). In this work, we propose a framework to improve an LLM's performance on CR-containing reviews by fine tuning on highly inferential tasks. We show that the performance improvement is likely attributed to the improved model CR ability. We also release a new dataset that focuses on CR in ALSC.
摘要
客户反馈是企业发展产品的重要来源,实时监控客户反馈可以通过层面情感分类(ALSC)进行自动化。大型语言模型(LLM)是许多现代ALSC解决方案的核心,但它们在一些需要核心对话(CR)的情况下表现不佳。在这个工作中,我们提出了一个框架,可以将LML的表现提升到CR包含的评价中。我们显示,表现提升的原因可能是改善的CR能力。我们还发布了一个新的数据集, concentrate on CR在ALSC中。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.
Writer adaptation for offline text recognition: An exploration of neural network-based methods
paper_authors: Tobias van der Werff, Maruf A. Dhali, Lambert Schomaker
for: 这 paper 的目的是提高手写文本识别(HTR)模型的适应性,使其能够更好地处理新的写作风格。
methods: 这 paper 使用了两种方法来使 HTR 模型变得Writer-adaptive:1) 模型共享学习(MAML),一种通常用于少量示例分类 зада务中的算法,和2) 作者代码(writer codes),一种来自自动语音识别领域的想法。
results: results 表明,使用 HTR-specific MAML(MetaHTR)可以提高表现,相比基eline 有1.4到2.0个word error rate(WER)的提升。作者修改的提升为0.2到0.7 WER,深度模型通过MetaHTR进行修改显得更加适合。但是,应用 MetaHTR 到更大的 HTR 模型或句子级 HTR 可能会变得计算和内存需求过高。最后,基于学习特征或边均特征的作者代码没有提高识别性能。Abstract
Handwriting recognition has seen significant success with the use of deep learning. However, a persistent shortcoming of neural networks is that they are not well-equipped to deal with shifting data distributions. In the field of handwritten text recognition (HTR), this shows itself in poor recognition accuracy for writers that are not similar to those seen during training. An ideal HTR model should be adaptive to new writing styles in order to handle the vast amount of possible writing styles. In this paper, we explore how HTR models can be made writer adaptive by using only a handful of examples from a new writer (e.g., 16 examples) for adaptation. Two HTR architectures are used as base models, using a ResNet backbone along with either an LSTM or Transformer sequence decoder. Using these base models, two methods are considered to make them writer adaptive: 1) model-agnostic meta-learning (MAML), an algorithm commonly used for tasks such as few-shot classification, and 2) writer codes, an idea originating from automatic speech recognition. Results show that an HTR-specific version of MAML known as MetaHTR improves performance compared to the baseline with a 1.4 to 2.0 improvement in word error rate (WER). The improvement due to writer adaptation is between 0.2 and 0.7 WER, where a deeper model seems to lend itself better to adaptation using MetaHTR than a shallower model. However, applying MetaHTR to larger HTR models or sentence-level HTR may become prohibitive due to its high computational and memory requirements. Lastly, writer codes based on learned features or Hinge statistical features did not lead to improved recognition performance.
摘要
深度学习已经在手写识别方面取得了 significative 成功。然而,深度神经网络在面临不同数据分布时表现不佳,特别是在手写文本识别(HTR)领域。为了解决这个问题,我们在本文中 explore 如何使 HTR 模型变得作者适应的。我们使用了两种 HTR 架构,它们都使用 ResNet 背景网络,并且使用 LSTM 或 Transformer 序列解码器。使用这些基本模型,我们考虑了两种方法来使其变得作者适应:1)模型独立多学习(MAML),这是一种通常用于几何数据集上的几拟分类任务,以及2)作者码,这是一种起源于自动语音识别的想法。结果表明,使用 HTR 专有的 MAML 版本(MetaHTR)可以提高表现,相比于基eline 的 1.4 到 2.0 个词错率(WER)。作者适应的改进为 0.2 到 0.7 WER,深度模型似乎更适合使用 MetaHTR 进行适应。然而,对于更大的 HTR 模型或句子级 HTR,使用 MetaHTR 可能会变得繁琐和占用计算和内存资源。最后,基于学习特征或折衔统计特征的作者码并不导致改进的识别性能。
Mao-Zedong At SemEval-2023 Task 4: Label Represention Multi-Head Attention Model With Contrastive Learning-Enhanced Nearest Neighbor Mechanism For Multi-Label Text Classification
results: 该方法在测试集上 achievied an F1 score of 0.533,并在领先者榜单上排名第四。Abstract
The study of human values is essential in both practical and theoretical domains. With the development of computational linguistics, the creation of large-scale datasets has made it possible to automatically recognize human values accurately. SemEval 2023 Task 4\cite{kiesel:2023} provides a set of arguments and 20 types of human values that are implicitly expressed in each argument. In this paper, we present our team's solution. We use the Roberta\cite{liu_roberta_2019} model to obtain the word vector encoding of the document and propose a multi-head attention mechanism to establish connections between specific labels and semantic components. Furthermore, we use a contrastive learning-enhanced K-nearest neighbor mechanism\cite{su_contrastive_2022} to leverage existing instance information for prediction. Our approach achieved an F1 score of 0.533 on the test set and ranked fourth on the leaderboard.
摘要
study of human values 是在实践和理论领域都是非常重要的。 With the development of computational linguistics, the creation of large-scale datasets has made it possible to automatically recognize human values with high accuracy. SemEval 2023 Task 4\cite{kiesel:2023} provides a set of arguments and 20 types of human values that are implicitly expressed in each argument. In this paper, we present our team's solution. We use the Roberta\cite{liu_roberta_2019} model to obtain the word vector encoding of the document and propose a multi-head attention mechanism to establish connections between specific labels and semantic components. Furthermore, we use a contrastive learning-enhanced K-nearest neighbor mechanism\cite{su_contrastive_2022} to leverage existing instance information for prediction. Our approach achieved an F1 score of 0.533 on the test set and ranked fourth on the leaderboard.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing.
Overview of BioASQ 2023: The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
paper_authors: Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Salvador Lima López, Eulália Farré-Maduell, Luis Gasco, Martin Krallinger, Georgios Paliouras
results: 本研究では、28チームが150以上のシステムの结果を提供しました。大多数のシステムが竞争的な性能を示しているため、fieldの状况が継続的に进歩していることを示しています。Abstract
This is an overview of the eleventh edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2023. BioASQ is a series of international challenges promoting advances in large-scale biomedical semantic indexing and question answering. This year, BioASQ consisted of new editions of the two established tasks b and Synergy, and a new task (MedProcNER) on semantic annotation of clinical content in Spanish with medical procedures, which have a critical role in medical practice. In this edition of BioASQ, 28 competing teams submitted the results of more than 150 distinct systems in total for the three different shared tasks of the challenge. Similarly to previous editions, most of the participating systems achieved competitive performance, suggesting the continuous advancement of the state-of-the-art in the field.
摘要
这是 BioASQ 挑战的第十一版简介,发生在 Conference and Labs of the Evaluation Forum (CLEF) 2023 年。BioASQ 是一系列国际挑战,推动大规模生物医学semantic indexing和问答技术的发展。本年的 BioASQ 包括两个已有任务(b 和 Synergy)的新版本,以及一个新任务(MedProcNER),涉及医学语言描述中的医疗程序,这些程序在医学实践中具有重要作用。本年的 BioASQ 有 28 支参赛队伍提交了超过 150 个不同的系统,共同参与三个不同的共同任务。与过去的版本一样,大多数参与的系统表现竞争力强,这表明了该领域技术的不断发展。
Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark
results: 实验结果显示,人类表现大大超越了当前最佳语言模型在探测推理benchmark中的表现,而自我问题推广框架也被证明是提高语言模型非正式思维能力的最有效的引言工程。Abstract
Informal reasoning ability is the ability to reason based on common sense, experience, and intuition.Humans use informal reasoning every day to extract the most influential elements for their decision-making from a large amount of life-like information.With the rapid development of language models, the realization of general artificial intelligence has emerged with hope. Given the outstanding informal reasoning ability of humans, how much informal reasoning ability language models have has not been well studied by scholars.In order to explore the gap between humans and language models in informal reasoning ability, this paper constructs a Detective Reasoning Benchmark, which is an assembly of 1,200 questions gathered from accessible online resources, aims at evaluating the model's informal reasoning ability in real-life context.Considering the improvement of the model's informal reasoning ability restricted by the lack of benchmark, we further propose a Self-Question Prompt Framework that mimics human thinking to enhance the model's informal reasoning ability.The goals of self-question are to find key elements, deeply investigate the connections between these elements, encourage the relationship between each element and the problem, and finally, require the model to reasonably answer the problem.The experimental results show that human performance greatly outperforms the SoTA Language Models in Detective Reasoning Benchmark.Besides, Self-Question is proven to be the most effective prompt engineering in improving GPT-4's informal reasoning ability, but it still does not even surpass the lowest score made by human participants.Upon acceptance of the paper, the source code for the benchmark will be made publicly accessible.
摘要
人类日常生活中使用非正式逻辑能力来解决问题,包括基于经验、直觉和常识的推理。随着人工智能语言模型的快速发展,实现普通人工智能的希望也在上升。然而,学者们对语言模型的非正式逻辑能力的研究尚未得到充分的关注。为了探讨人类和语言模型在非正式逻辑能力方面的差距,本文提出了探测推理benchmark,该benchmark包括1,200道问题,从可达性的在线资源中收集而来,用于评估模型的非正式逻辑能力在实际生活中的表现。由于模型的非正式逻辑能力受限于缺乏benchmark,我们进一步提出了基于人类思维的自我问题推动框架,以提高模型的非正式逻辑能力。自我问题的目标是找到关键元素,深入探究这些元素之间的连接,促使每个元素与问题之间的关系,并最后做出合理的答案。实验结果显示,人类参与者在探测推理benchmark中表现出色,而SoTA语言模型则表现落后。此外,我们发现自我问题推动框架可以有效地提高GPT-4的非正式逻辑能力,但它仍然不足以超越人类参与者的最低得分。接受本文后,benchmark的源代码将公开发布。
for: The paper aims to address the lack of a diverse and high-quality corpus for Bangla NLP tasks, which has hindered the development of state-of-the-art NLP models for the language.
methods: The authors build a corpus of Bangla literature, called Vacaspati, by collecting literary works from various websites and leveraging the public availability of these works without copyright violations or restrictions. They also build a word embedding model, Vac-FT, using FastText, and an Electra model, Vac-BERT, using the corpus.
results: The authors show that Vac-FT outperforms other FastText-based models on multiple downstream tasks, and Vac-BERT performs either better or similar to other state-of-the-art transformer models, despite having far fewer parameters and requiring fewer resources. They also demonstrate the efficacy of Vacaspati as a corpus by showing that models built from other corpora are not as effective.Abstract
Bangla (or Bengali) is the fifth most spoken language globally; yet, the state-of-the-art NLP in Bangla is lagging for even simple tasks such as lemmatization, POS tagging, etc. This is partly due to lack of a varied quality corpus. To alleviate this need, we build Vacaspati, a diverse corpus of Bangla literature. The literary works are collected from various websites; only those works that are publicly available without copyright violations or restrictions are collected. We believe that published literature captures the features of a language much better than newspapers, blogs or social media posts which tend to follow only a certain literary pattern and, therefore, miss out on language variety. Our corpus Vacaspati is varied from multiple aspects, including type of composition, topic, author, time, space, etc. It contains more than 11 million sentences and 115 million words. We also built a word embedding model, Vac-FT, using FastText from Vacaspati as well as trained an Electra model, Vac-BERT, using the corpus. Vac-BERT has far fewer parameters and requires only a fraction of resources compared to other state-of-the-art transformer models and yet performs either better or similar on various downstream tasks. On multiple downstream tasks, Vac-FT outperforms other FastText-based models. We also demonstrate the efficacy of Vacaspati as a corpus by showing that similar models built from other corpora are not as effective. The models are available at https://bangla.iitk.ac.in/.
摘要
孟加拉语(或孟加利语)是全球第五大常用语言,但是当前的NPLP在孟加拉语方面落后,尤其是对于基本任务如lemmatization、POS标注等。这主要是因为缺乏多样化的质量词库。为了解决这个问题,我们建立了Vacaspati词库,这是一个多样化的孟加拉语文学词库。我们从多个网站上收集了各种文学作品,只收集了公有领域的作品,以避免版权问题和限制。我们认为,出版物更好地捕捉语言特征,因为它们不仅限于某些文学模式,而且包含更多的语言多样性。Vacaspati词库包含超过1100万句子和115亿个字。我们还使用FastText建立了Vac-FT模型,以及使用词库进行训练了Electra模型,即Vac-BERT。Vac-BERT具有较少的参数,需要资源的一小部分,但在多种下游任务中表现和其他当前顶峰变换模型相当或更好。此外,我们还证明了Vacaspati词库的有用性,显示其他词库上的模型不如Vacaspati词库上的模型。这些模型可以在https://bangla.iitk.ac.in/上下载。
Argumentative Segmentation Enhancement for Legal Summarization
results: 研究表明,使用该技术可以生成更高质量的辩论摘要,同时忽略不重要的背景信息。Abstract
We use the combination of argumentative zoning [1] and a legal argumentative scheme to create legal argumentative segments. Based on the argumentative segmentation, we propose a novel task of classifying argumentative segments of legal case decisions. GPT-3.5 is used to generate summaries based on argumentative segments. In terms of automatic evaluation metrics, our method generates higher quality argumentative summaries while leaving out less relevant context as compared to GPT-4 and non-GPT models.
摘要
我们使用辩论zoning(1)和法律辩论模式组合创建法律辩论段落。基于辩论段落分割,我们提出了一种将法律案例判决的辩论段落分类任务。使用GPT-3.5生成辩论摘要,与GPT-4和非GPT模型相比,我们的方法生成了更高质量的辩论摘要,同时减少了 menos相关的 контекст。Note:* "辩论zoning" (argumentative zoning) is a term used to describe the process of identifying and segmenting argumentative passages in text.* "法律辩论模式" (legal argumentative scheme) refers to the patterns and structures used in legal arguments.* "辩论段落" (argumentative segments) refers to the individual passages or sections of text that contain argumentative content.* "辩论摘要" (argumentative summaries) are summaries of text that focus on the argumentative content and leave out less relevant information.
Separate-and-Aggregate: A Transformer-based Patch Refinement Model for Knowledge Graph Completion
results: 我们在四个流行的KGC benchmark上进行了实验,结果显示PatReFormer在标准KGC评价指标中(如MRR和H@n)表现出了显著的性能提升,较existings KGC方法更好。我们的分析表明PatReFormer的设计方式的有效性,并且发现PatReFormer可以更好地捕捉大的关系嵌入维度中的KG信息。最后,我们示出PatReFormer在复杂关系类型上的强大性,相比其他KGC模型。Abstract
Knowledge graph completion (KGC) is the task of inferencing missing facts from any given knowledge graphs (KG). Previous KGC methods typically represent knowledge graph entities and relations as trainable continuous embeddings and fuse the embeddings of the entity $h$ (or $t$) and relation $r$ into hidden representations of query $(h, r, ?)$ (or $(?, r, t$)) to approximate the missing entities. To achieve this, they either use shallow linear transformations or deep convolutional modules. However, the linear transformations suffer from the expressiveness issue while the deep convolutional modules introduce unnecessary inductive bias, which could potentially degrade the model performance. Thus, we propose a novel Transformer-based Patch Refinement Model (PatReFormer) for KGC. PatReFormer first segments the embedding into a sequence of patches and then employs cross-attention modules to allow bi-directional embedding feature interaction between the entities and relations, leading to a better understanding of the underlying KG. We conduct experiments on four popular KGC benchmarks, WN18RR, FB15k-237, YAGO37 and DB100K. The experimental results show significant performance improvement from existing KGC methods on standard KGC evaluation metrics, e.g., MRR and H@n. Our analysis first verifies the effectiveness of our model design choices in PatReFormer. We then find that PatReFormer can better capture KG information from a large relation embedding dimension. Finally, we demonstrate that the strength of PatReFormer is at complex relation types, compared to other KGC models
摘要
知识图完成(KGC)是指从知识图中推理缺失的事实。先前的KGC方法通常将知识图实体和关系表示为可训练的连续嵌入,并将实体$h$(或$t$)和关系$r$的嵌入进行拼接,以便在查询 $(h, r, ?)$(或$(?, r, t)$)中估计缺失的实体。为此,它们可以使用浅层的线性变换或深度的卷积模块。然而,线性变换受到表达能力问题的限制,而卷积模块可能会引入不必要的抽象假设,可能会降低模型性能。因此,我们提出了一种基于Transformer的PatReFormer模型来解决KGC问题。PatReFormer首先将嵌入分解成一系列的补丁,然后使用 Kreuz-感知模块来允许实体和关系之间的bi-向 embedding特征交互,从而更好地理解下面的KG。我们在四个流行的KGC benchmark上进行了实验,包括WN18RR、FB15k-237、YAGO37和DB100K。实验结果显示,PatReFormer与现有KGC方法在标准KGC评价指标(例如MRR和H@n)上表现出显著的性能改进。我们的分析首先证明了我们的模型设计选择的有效性。然后,我们发现PatReFormer可以更好地捕捉KG信息,特别是在大关系嵌入维度上。最后,我们表明PatReFormer在复杂的关系类型上表现更优异于其他KGC模型。
Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
results: 发现 zero-shot 设置下 NLI 模型表现不佳,特别是对修改后句子中的否定和universal modifier 的表现较差。经过 fine-tuning 后,模型仍然表现不佳于否定、universal 和existential modifier。Abstract
We introduce a synthetic dataset called Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We produce 1,304 sentence pairs by modifying 15 examples from the SICK dataset (Marelli et al., 2014). To this end, we modify the original texts using a set of phrases - modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We use these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. We found that the performance of NLI models under the zero-shot setting is poor, especially for modified sentences with negation and existential quantifiers. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers.
摘要
我们介绍了一个人工数据集 called Sentences Involving Complex Compositional Knowledge (SICCK) 和一种新的分析方法,用于 Investigating the performance of Natural Language Inference (NLI) models in understanding compositionality in logic. 我们生成了1,304个句子对,通过修改15个例子从 SICK 数据集 (Marelli et al., 2014)。为此,我们使用了一组词组 - 修饰词,这些词组与自然逻辑 (NL) 中的通用量词、存在量词、否定和其他概念修饰词相对应。我们使用这些词组修改句子的主语、谓语和补语部分。最后,我们将这些修改后的句子与相应的推理标签 annotate 以NL规则。我们进行了一些预liminary verification,以确定 neural NLI 模型在零shot 和 fine-tuned 情况下对修改后的句子的表现是否能够 Capture 其结构和Semantic Composition 的变化。我们发现,在零shot 设置下,NLI 模型的性能很差,特别是对修改后的句子中的否定和存在量词的表现。经过 fine-tuning 这个数据集后,我们发现,模型仍然对修改后的句子中的否定、存在量词和通用量词表现很差。
results: 这篇论文的结果显示,使用LookAhead技术可以实现5%-20%的相对减少字误率,在内Domain和外Domain评估集上均有显著改善。Abstract
RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.
摘要
DyCL: Dynamic Neural Network Compilation Via Program Rewriting and Graph Optimization
results: 论文的实验表明,使用该方法可以成功编译所有的动态神经网络,并且生成的执行代码 exhibits significatively improved performance,在特定的情况下,运行速度可以达到1.12倍至20.21倍。Abstract
DL compiler's primary function is to translate DNN programs written in high-level DL frameworks such as PyTorch and TensorFlow into portable executables. These executables can then be flexibly executed by the deployed host programs. However, existing DL compilers rely on a tracing mechanism, which involves feeding a runtime input to a neural network program and tracing the program execution paths to generate the computational graph necessary for compilation. Unfortunately, this mechanism falls short when dealing with modern dynamic neural networks (DyNNs) that possess varying computational graphs depending on the inputs. Consequently, conventional DL compilers struggle to accurately compile DyNNs into executable code. To address this limitation, we propose \tool, a general approach that enables any existing DL compiler to successfully compile DyNNs. \tool tackles the dynamic nature of DyNNs by introducing a compilation mechanism that redistributes the control and data flow of the original DNN programs during the compilation process. Specifically, \tool develops program analysis and program transformation techniques to convert a dynamic neural network into multiple sub-neural networks. Each sub-neural network is devoid of conditional statements and is compiled independently. Furthermore, \tool synthesizes a host module that models the control flow of the DyNNs and facilitates the invocation of the sub-neural networks. Our evaluation demonstrates the effectiveness of \tool, achieving a 100\% success rate in compiling all dynamic neural networks. Moreover, the compiled executables generated by \tool exhibit significantly improved performance, running between $1.12\times$ and $20.21\times$ faster than the original DyNNs executed on general-purpose DL frameworks.
摘要
DL编译器的主要功能是将深度学习(DL)程序从高级DL框架 such as PyTorch和TensorFlow转换为可移植的执行程序。这些执行程序可以在部署的主机程序上灵活执行。然而,现有的DL编译器都是通过跟踪机制来进行编译,即将运行时输入传递给神经网络程序,并跟踪程序执行的路径来生成必要的计算图来进行编译。然而,这种机制无法正确编译动态神经网络(DyNNs),因为DyNNs的计算图会根据输入而变化。因此,传统的DL编译器无法正确编译DyNNs。为解决这种限制,我们提出了\tool,一种通用的方法,可以使任何现有的DL编译器成功编译DyNNs。\tool对神经网络程序进行分析和转换,将动态神经网络转换为多个子神经网络。每个子神经网络都是无条件语句的,可以独立编译。此外,\tool Synthesize主机模块,用于模拟DyNNs的控制流并调用子神经网络。我们的评估表明,\tool可以成功编译所有动态神经网络,并且生成的执行程序展现出了明显的性能提升,在特定的情况下,运行速度可以达到1.12倍至20.21倍。
SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation
results: SimpleMTOD 在 Response Generation 子任务中 achiev 了 SIMMC 2.0 test-std 数据集中的state-of-the-art BLEU 分数 (0.327),并在其他多模式子任务中(Disambiguation、Coreference Resolution、Dialog State Tracking)表现在 par。这些成绩尽管使用了一种 minimalist 的方法来提取视觉(和非视觉)信息,并没有采用任务特有的结构变化,如分类头。Abstract
SimpleMTOD is a simple language model which recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is built on a large-scale transformer-based auto-regressive architecture, which has already proven to be successful in uni-modal task-oriented dialogues, and effectively leverages transfer learning from pre-trained GPT-2. In-order to capture the semantics of visual scenes, we introduce both local and de-localized tokens for objects within a scene. De-localized tokens represent the type of an object rather than the specific object itself and so possess a consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0 test-std dataset while performing on par in other multimodal sub-tasks: Disambiguation, Coreference Resolution, and Dialog State Tracking. This is despite taking a minimalist approach for extracting visual (and non-visual) information. In addition the model does not rely on task-specific architectural changes such as classification heads.
摘要
SimpleMTOD 是一种简单的语言模型,它将多Modal 任务对话中的多个子任务转化为序列预测任务。 SimpleMTOD 基于大规模 transformer 自动生成架构,这种架构在单Modal 任务对话中已经证明成功,并且有效地利用了预训练 GPT-2 的转移学习。为了捕捉视觉场景的 semantics,我们引入了场景中对象的本地和 decomposite токен。 decomposite токен表示对象的类型而不是具体的对象,因此具有数据集中的一致性。 SimpleMTOD 在 Response Generation 子任务中取得了 SIMMC 2.0 测试标准数据集中的state-of-the-art BLEU 分数(0.327),并在其他多Modal 子任务中(歧义解决、核心寻引和对话状态跟踪)表现相当。这是不withdrawing 任务特有的 architectural 变化,如分类头。
Entity Identifier: A Natural Text Parsing-based Framework For Entity Relation Extraction
results: 我们的研究表明,使用自然语言处理技术可以高效地提取需求描述中的结构化信息,并生成高质量的CRUD类代码。Abstract
The field of programming has a diversity of paradigms that are used according to the working framework. While current neural code generation methods are able to learn and generate code directly from text, we believe that this approach is not optimal for certain code tasks, particularly the generation of classes in an object-oriented project. Specifically, we use natural language processing techniques to extract structured information from requirements descriptions, in order to automate the generation of CRUD (Create, Read, Update, Delete) class code. To facilitate this process, we introduce a pipeline for extracting entity and relation information, as well as a representation called an "Entity Tree" to model this information. We also create a dataset to evaluate the effectiveness of our approach.
摘要
programming的领域有多种程序模式,根据工作框架进行使用。现有的神经网络代码生成方法可以直接从文本学习和生成代码,但我们认为这种方法不适用于某些代码任务,特别是对象封装项目中的类生成。我们使用自然语言处理技术来提取需求描述中的结构化信息,以自动生成CRUD(创建、读取、更新、删除)类代码。为此,我们提出了一个抽取实体和关系信息的管道,以及一种叫“实体树”的表示方式。我们还创建了一个评估效果的数据集。
SITTA: A Semantic Image-Text Alignment for Image Captioning
results: 使用本研究的semantic映射技术,LM可以在没有梯度信息的情况下进行图像描述,并在MS-COCO和Flickr30k数据集上达到了强的描述性能。而且,even with limited data, our method partly exceeds the performance of other zero-shot and even finetuned competitors.Abstract
Textual and semantic comprehension of images is essential for generating proper captions. The comprehension requires detection of objects, modeling of relations between them, an assessment of the semantics of the scene and, finally, representing the extracted knowledge in a language space. To achieve rich language capabilities while ensuring good image-language mappings, pretrained language models (LMs) were conditioned on pretrained multi-modal (image-text) models that allow for image inputs. This requires an alignment of the image representation of the multi-modal model with the language representations of a generative LM. However, it is not clear how to best transfer semantics detected by the vision encoder of the multi-modal model to the LM. We introduce two novel ways of constructing a linear mapping that successfully transfers semantics between the embedding spaces of the two pretrained models. The first aligns the embedding space of the multi-modal language encoder with the embedding space of the pretrained LM via token correspondences. The latter leverages additional data that consists of image-text pairs to construct the mapping directly from vision to language space. Using our semantic mappings, we unlock image captioning for LMs without access to gradient information. By using different sources of data we achieve strong captioning performance on MS-COCO and Flickr30k datasets. Even in the face of limited data, our method partly exceeds the performance of other zero-shot and even finetuned competitors. Our ablation studies show that even LMs at a scale of merely 250M parameters can generate decent captions employing our semantic mappings. Our approach makes image captioning more accessible for institutions with restricted computational resources.
摘要
文本和semantic comprehension of images是图像描述的关键。这需要检测对象,模型对其之间的关系,评估场景的 semantics,并将提取的知识表示在语言空间。为了实现良好的语言功能而确保好的图像-语言映射,我们使用预训练的语言模型(LM) conditioned on 预训练的多Modal(图像-文本)模型,允许图像输入。但是,将视觉模型中的semantics传递到LM是一个问题。我们介绍了两种新的方法构建线性映射,成功地传递semantics между两个预训练模型的embedding空间。第一种将图像语言encoder的embedding空间与预训练LM的embedding空间对应。第二种利用额外数据,包括图像-文本对,直接从视觉空间构建映射到语言空间。使用我们的semantic mapping,我们可以让LM在 gradient information 不可用的情况下进行图像描述。通过使用不同的数据源,我们在 MS-COCO 和 Flickr30k 数据集上实现了强大的描述性能。即使面临有限的数据,我们的方法在其他零 shot 和精度调整的竞争者之上 partly exceeds 的表现。我们的abbilation study 表明,只有250M parameters 的LM可以生成不错的描述。我们的方法使图像描述更加可 accessible для有限的计算资源的机构。
results: 这种网络可以回归特定类型的数据,但它的学习过程可能受到样本的影响,而且只能回归特定类型的数据。Abstract
A phoneme-retrieval technique is proposed, which is due to the particular way of the construction of the network. An initial set of neurons is given. The number of these neurons is approximately equal to the number of typical structures of the data. For example if the network is built for voice retrieval then the number of neurons must be equal to the number of characteristic phonemes of the alphabet of the language spoken by the social group to which the particular person belongs. Usually this task is very complicated and the network can depend critically on the samples used for the learning. If the network is built for image retrieval then it works only if the data to be retrieved belong to a particular set of images. If the network is built for voice recognition it works only for some particular set of words. A typical example is the words used for the flight of airplanes. For example a command like the "airplane should make a turn of 120 degrees towards the east" can be easily recognized by the network if a suitable learning procedure is used.
摘要
提出一种phoneme-retrieval技术,归功于网络的特定构造方式。初始化一组神经元,其数量约等于数据的典型结构数量。例如,如果建立语音 retrieve 网络,则神经元数量必须等于语言所用的字母系统中的特征Phoneemes数量。通常,这个任务非常复杂,网络的学习过程 critically 取决于使用的样本。如果建立图像 retrieve 网络,则只能对特定集合的图像进行 Retrieval。如果建立语音识别网络,则只能识别特定集合的单词。例如,一个常见的命令如“飞机 Should Make a turn of 120 degrees towards the east”可以由网络轻松地认可,只要使用适当的学习过程。