cs.CL - 2023-08-23

The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation Detection

  • paper_url: http://arxiv.org/abs/2308.12215
  • repo_url: https://github.com/ramybaly/News-Media-Reliability
  • paper_authors: Madelyne Xiao, Jonathan Mayer
  • for: 这个论文旨在探讨机器学习在信任和安全问题上的应用,使用假信息检测作为 caso study。
  • methods: 作者系мати化了关于自动检测假信息的文献,并对270篇最具影响力的论文进行分析。
  • results: 研究发现现有文献中存在 significiant 缺陷,包括数据和代码可用性差、设计错误、可重现性和泛化能力差。这些缺陷使得现有的模型在实际应用中效果不佳。
    Abstract We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We systematize literature on automated detection of misinformation across a corpus of 270 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. We find significant shortcomings in the literature that call into question claimed performance and practicality. Detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. Data and code availability is poor. Models do not generalize well to out-of-domain data. Based on these results, we offer recommendations for evaluating machine learning applications to trust and safety problems. Our aim is for future work to avoid the pitfalls that we identify.
    摘要 我团队研究机器学习应用于信任和安全问题上的偏误,使用假信息检测为案例研究。我们系统化了 relate to 270 篇引用论文中的自动检测假信息方法。然后,我们分析了这些论文中的数据和代码可用性、设计异常、可重现性和泛化性问题。我们发现了 significiant 的缺陷,质疑了已经宣称的性能和实用性。检测任务经常与实际场景不同,数据集和模型评估不符合实际情况,评估方法常常与模型训练无关。模型对尝试数据的泛化性也很差。根据这些结果,我们提出了评估机器学习应用于信任和安全问题的建议。我们希望未来的研究可以避免我们所identify的坑。

Curriculum Learning with Adam: The Devil Is in the Wrong Details

  • paper_url: http://arxiv.org/abs/2308.12202
  • repo_url: None
  • paper_authors: Lucas Weber, Jaap Jumelet, Paul Michel, Elia Bruni, Dieuwke Hupkes
  • for: 这篇论文主要研究了机器学习模型在不同学习阶段上的学习效果,以及如何使机器学习模型更加高效地学习。
  • methods: 作者们使用了许多现有的CURRICULUM学习方法,包括手动设计的CL方法和自动生成的CL方法,以评估它们在自然语言处理(NLP)领域的效果。
  • results: 作者们发现,当CURRICULUM方法与流行的Adam优化算法结合使用时,它们经常会适应不合适的优化参数,导致学习效果下降。作者们通过多个实验案例来证明这一点,并发现无论使用哪种CL方法,都无法超越仅使用Adam优化器和合适的Hyperparameter的学习效果。
    Abstract Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of recent curriculum methods, we find that their results are surprisingly brittle when applied to NLP. A deep dive into the (in)effectiveness of the curricula in some scenarios shows us why: when curricula are employed in combination with the popular Adam optimisation algorithm, they oftentimes learn to adapt to suboptimally chosen optimisation parameters for this algorithm. We present a number of different case studies with different common hand-crafted and automated CL approaches to illustrate this phenomenon, and we find that none of them outperforms optimisation with only Adam with well-chosen hyperparameters. As such, our results contribute to understanding why CL methods work, but at the same time urge caution when claiming positive results.
    摘要

Instruction Position Matters in Sequence Generation with Large Language Models

  • paper_url: http://arxiv.org/abs/2308.12097
  • repo_url: https://github.com/adaxry/post-instruction
  • paper_authors: Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou
  • for: 提高大型自然语言模型(LLM)的条件序列生成能力,包括翻译和摘要等任务。
  • methods: 通过修改模型的 instruciton 排序来增强 LLM 的 instruction 遵循能力。
  • results: 对多种模型规模(1B / 7B / 13B)和不同的序列生成任务(翻译和摘要)进行了实验,并且在零基eline情况下显著提高了 conditional sequence generation 的性能,例如在 WMT zero-shot 翻译任务上提高了最高达 9.7 BLEU 点。
    Abstract Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization, through instruction fine-tuning. The fine-tuning data is generally sequentially concatenated from a specific task instruction, an input sentence, and the corresponding response. Considering the locality modeled by the self-attention mechanism of LLMs, these models face the risk of instruction forgetting when generating responses for long input sentences. To mitigate this issue, we propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences. Theoretical analysis suggests that our straightforward method can alter the model's learning focus, thereby emphasizing the training of instruction-following capabilities. Concurrently, experimental results demonstrate that our approach consistently outperforms traditional settings across various model scales (1B / 7B / 13B) and different sequence generation tasks (translation and summarization), without any additional data or annotation costs. Notably, our method significantly improves the zero-shot performance on conditional sequence generation, e.g., up to 9.7 BLEU points on WMT zero-shot translation tasks.
    摘要

Hybrid Retrieval and Multi-stage Text Ranking Solution at TREC 2022 Deep Learning Track

  • paper_url: http://arxiv.org/abs/2308.12039
  • repo_url: None
  • paper_authors: Guangwei Xu, Yangzhao Zhang, Longhui Zhang, Dingkun Long, Pengjun Xie, Ruijie Guo
  • for: 本文是提交到TREC 2022 Deep Learning Track的系统描述。
  • methods: 本文采用混合文本 Retrieval和多阶段文本排名方法。 Retrieval阶段结合了传统稀疏检索和神经积累检索两种结构。 排名阶段除了基于大型预训练语言模型的全交互式排名模型之外,还提出了轻量级子排名模块以进一步提高文本排名性能。
  • results: 评估结果表明我们提出的方法有效。我们的模型在试用集上 achieved the 1st和4th rank for passage ranking and document ranking respectively。
    Abstract Large-scale text retrieval technology has been widely used in various practical business scenarios. This paper presents our systems for the TREC 2022 Deep Learning Track. We explain the hybrid text retrieval and multi-stage text ranking method adopted in our solution. The retrieval stage combined the two structures of traditional sparse retrieval and neural dense retrieval. In the ranking stage, in addition to the full interaction-based ranking model built on large pre-trained language model, we also proposes a lightweight sub-ranking module to further enhance the final text ranking performance. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 1st and 4th rank on the test set of passage ranking and document ranking respectively.
    摘要 大规模文本检索技术在各种实际业务场景中广泛应用。本文介绍我们在TREC 2022深度学习轨道上的系统。我们解释了我们采用的混合文本检索和多stage文本排名方法。检索阶段组合了传统稀疏检索和神经 dense检索两种结构。排名阶段除了基于大型预训练语言模型构建的全面互动型排名模型外,我们还提出了轻量级副排名模块,以进一步提高文本排名性能。评估结果表明我们提出的方法效果。我们的模型在测试集上取得了文章排名和文档排名的1st和4th名。

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

  • paper_url: http://arxiv.org/abs/2308.12038
  • repo_url: https://github.com/openbmb/viscpm
  • paper_authors: Jinyi Hu, Yuan Yao, Chongyi Wang, Shan Wang, Yinxu Pan, Qianyu Chen, Tianyu Yu, Hanghao Wu, Yue Zhao, Haoye Zhang, Xu Han, Yankai Lin, Jiao Xue, Dahai Li, Zhiyuan Liu, Maosong Sun
  • for: 本研究旨在提出一种有效的训练方法,以便在低资源语言中训练大型多modal模型。
  • methods: 本研究使用的方法是基于强大的多语言大型语言模型,将英语Only的图像文本数据使用零 shot学习 transferred to other languages,并 achieved state-of-the-art performance in Chinese。
  • results: 研究表明,基于英语Only的图像文本数据进行零 shot学习 transferred to other languages,可以在多语言多modal learning中取得优秀的表现,并在中文场景中达到了开源最佳性能。
    Abstract Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose MPM, an effective training paradigm for training large multimodal models in low-resource languages. MPM demonstrates that Multilingual language models can Pivot zero-shot Multimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner for both image-to-text and text-to-image generation, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of MPM, we build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://github.com/OpenBMB/VisCPM.git.
    摘要 近期 Multimodal 学习在图像到文本和文本到图像生成方面发生了明显的增长,但成功通常受限于英语,其他语言剩下来。建立竞争力强的对手在其他语言是非常困难,因为非英语多模态数据的资源短缺(即图像-文本数据的大规模高质量数据缺乏)。在这项工作中,我们提出了 MPM,一种有效的训练方法,用于在低资源语言中训练大型多模态模型。MPM表明,多语言语言模型可以在零shot多模态学习中作为中转站。具体来说,基于一个强大的多语言大语言模型,我们在英语只有图像-文本数据上进行预训练,然后在零shot情况下,我们的多模态模型可以很好地泛化到其他语言,包括图像-文本生成和文本-图像生成。我们选择中文作为MPM的实践,并在图像-文本和文本-图像生成方面建立了 VisCPM 大型多模态模型,其性能与开源数据集中的状态机器达到了领先水平。为便于未来的研究,我们将模型权重和代码开源在 GitHub 上,请参考

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

  • paper_url: http://arxiv.org/abs/2308.12032
  • repo_url: None
  • paper_authors: Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao
  • for: 提高 Large Language Model 的优化效率和资源利用率
  • methods: 自动从开源数据集中选择 “cherry” 样本,使用 Instruction-Following Difficulty 指标对模型自动生成能力进行评估
  • results: 在 Alpaca 和 WizardLM 等著名数据集上实践 Validation 结果显示,只使用 10% 的传统数据输入,我们的策略可以 дости到更好的结果
    Abstract In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.
    摘要

Knowledge-injected Prompt Learning for Chinese Biomedical Entity Normalization

  • paper_url: http://arxiv.org/abs/2308.12025
  • repo_url: None
  • paper_authors: Songhua Yang, Chenghao Zhang, Hongfei Xu, Yuxiang Jia
    for:这个论文的目的是提高生物医学数据的一致性,通过将raw的医学实体规范化为标准实体,以便更好地应用医学应用程序。methods:该论文提出了一种新的知识注入推断(PL-Knowledge)方法,具体来说是一种五个阶段的方法:候选实体匹配、知识提取、知识编码、知识注入和预测输出。该方法通过有效地编码医学实体中含义的知识项并将其 integrate into我们自己的定制知识注入模板,以提高模型捕捉医学实体之间的潜在关系,从而更好地匹配标准实体。results:该论文对一个 benchmark 数据集进行了广泛的评估,并在 few-shot 和 full-scale scenarios 中比较了现有的基eline。结果表明,我们的方法在 few-shot enario中平均提高了12.96%的准确率,而在 full-data enario中平均提高了0.94%的准确率,这都证明了我们的方法在 BEN 任务中的优秀性。
    Abstract The Biomedical Entity Normalization (BEN) task aims to align raw, unstructured medical entities to standard entities, thus promoting data coherence and facilitating better downstream medical applications. Recently, prompt learning methods have shown promising results in this task. However, existing research falls short in tackling the more complex Chinese BEN task, especially in the few-shot scenario with limited medical data, and the vast potential of the external medical knowledge base has yet to be fully harnessed. To address these challenges, we propose a novel Knowledge-injected Prompt Learning (PL-Knowledge) method. Specifically, our approach consists of five stages: candidate entity matching, knowledge extraction, knowledge encoding, knowledge injection, and prediction output. By effectively encoding the knowledge items contained in medical entities and incorporating them into our tailor-made knowledge-injected templates, the additional knowledge enhances the model's ability to capture latent relationships between medical entities, thus achieving a better match with the standard entities. We extensively evaluate our model on a benchmark dataset in both few-shot and full-scale scenarios. Our method outperforms existing baselines, with an average accuracy boost of 12.96\% in few-shot and 0.94\% in full-data cases, showcasing its excellence in the BEN task.
    摘要 文本翻译:生物医学实体 Normalization(BEN)任务的目标是将原始、未结构化医学实体与标准实体进行对应,从而提高数据准确性并促进下游医学应用。现在,提前学习方法在这个任务中已经显示出了promising的结果。然而,现有的研究仍然缺乏在中文BEN任务中更加复杂的挑战,特别是在有限的医学数据下的少量学习情况下,以及外部医学知识库的庞大潜力尚未得到完全利用。为了解决这些挑战,我们提出了一种新的知识注入推理(PL-Knowledge)方法。具体来说,我们的方法包括以下五个阶段:候选实体匹配、知识提取、知识编码、知识注入和预测输出。通过有效地编码医学实体中包含的知识项和将其注入到我们自定义的知识注入模板中,我们可以使得模型更好地捕捉医学实体之间的潜在关系,从而实现更好的匹配标准实体。我们在一个标准 benchmark dataset 上进行了广泛的评估,并在少量学习和全量数据两种情况下进行了比较。我们的方法在少量学习情况下平均提高了12.96%,而在全量数据情况下平均提高了0.94%,这显示了我们在BEN任务中的优秀表现。

Reranking Passages with Coarse-to-Fine Neural Retriever using List-Context Information

  • paper_url: http://arxiv.org/abs/2308.12022
  • repo_url: None
  • paper_authors: Hongyin Zhu
  • for: 提高大规模文档中答案选取的精度
  • methods: 利用列Context注意力机制增强文段表示,并将列Context模型分解成两个子过程,以提高效率
  • results: 实验表明提出的方法有效地提高了答案选取的精度
    Abstract Passage reranking is a crucial task in many applications, particularly when dealing with large-scale documents. Traditional neural architectures are limited in retrieving the best passage for a question because they usually match the question to each passage separately, seldom considering contextual information in other passages that can provide comparison and reference information. This paper presents a list-context attention mechanism to augment the passage representation by incorporating the list-context information from other candidates. The proposed coarse-to-fine (C2F) neural retriever addresses the out-of-memory limitation of the passage attention mechanism by dividing the list-context modeling process into two sub-processes, allowing for efficient encoding of context information from a large number of candidate answers. This method can be generally used to encode context information from any number of candidate answers in one pass. Different from most multi-stage information retrieval architectures, this model integrates the coarse and fine rankers into the joint optimization process, allowing for feedback between the two layers to update the model simultaneously. Experiments demonstrate the effectiveness of the proposed approach.
    摘要

Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature

  • paper_url: http://arxiv.org/abs/2308.12008
  • repo_url: None
  • paper_authors: Frederick Riemenschneider, Anette Frank
  • for: 本研究旨在开发一种适用于古希腊和拉丁文学研究的多语言BERT模型,以便自动发现古希腊和拉丁文本之间的文本相似性。
  • methods: 本研究使用了一种多语言RoBERTa模型,并通过自动将英文文本翻译成古希腊文本来生成新的训练数据。
  • results: 研究表明,SPhilBERTa模型在跨语言语义理解和找到古希腊和拉丁文本中相同的句子方面表现出色,并可以自动检测古希腊和拉丁文本之间的文本相似性。
    Abstract Intertextual allusions hold a pivotal role in Classical Philology, with Latin authors frequently referencing Ancient Greek texts. Until now, the automatic identification of these intertextual references has been constrained to monolingual approaches, seeking parallels solely within Latin or Greek texts. In this study, we introduce SPhilBERTa, a trilingual Sentence-RoBERTa model tailored for Classical Philology, which excels at cross-lingual semantic comprehension and identification of identical sentences across Ancient Greek, Latin, and English. We generate new training data by automatically translating English texts into Ancient Greek. Further, we present a case study, demonstrating SPhilBERTa's capability to facilitate automated detection of intertextual parallels. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
    摘要 古典文学中的文本相互参照占据着重要地位,拉丁作家 часто参照古希腊文本。在本研究中,我们引入SPhilBERTa,一种适用于古典文学的三语句子BERT模型,能够强大地捕捉跨语言semantic comprehension和identical sentences的同义 sentences。我们生成了新的训练数据,通过自动将英文文本翻译成古希腊语。此外,我们还提供了一个案例研究,证明SPhilBERTa能够自动检测文本相互参照。我们的模型和资源可以在https://github.com/Heidelberg-NLP/ancient-language-models中找到。

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

  • paper_url: http://arxiv.org/abs/2308.11971
  • repo_url: None
  • paper_authors: Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
  • for: 本研究旨在开发一种可扩展的视觉语言模型,以便从多Modal的数据中学习。
  • methods: 本研究使用了一种名为EVE的高效的视觉语言基础模型,该模型使用了一个共享的Transformer网络,并将视觉和语言编码在一起。具体来说,EVE使用了一种模态感知的零噪Module,以捕捉不同的感知信息。
  • results: 本研究表明,EVE可以快速地在训练过程中进行训练,并且在多种视觉语言下沉淀任务中表现出色,包括视觉问答、视觉理解和图像文本检索等。
    Abstract Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
    摘要 建立可扩展的视觉语言模型,从多Modal的数据中学习,仍然是一个开放的挑战。在这篇论文中,我们介绍了一个高效的视觉语言基础模型,称为EVE,它是一个共享Transformer网络中的一个多modal Mixture-of-Experts(MoE)模块,可以同时处理视觉信息。特别是,EVE通过在不同专家中选择性地 switching来捕捉不同的modal信息。为了统一视觉和语言预训练任务,EVE在图像文本对中进行遮盖信号模型,即图像像素和文本符号的重建。这种简单 yet 有效的预训练目标可以加速训练,比Image-Text Contrastive和Image-Text Matching损失快3.5倍。由于EVE的共享架构和预训练任务的组合,它可以轻松扩展,以便在更多的资源和更快的训练速度下 достичь更好的下游性能。尽管其简单,EVE可以达到视觉语言下游任务的状态码性能。

Audio Generation with Multiple Conditional Diffusion Model

  • paper_url: http://arxiv.org/abs/2308.11940
  • repo_url: None
  • paper_authors: Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
  • for: 提高现有预训练文本到Audio模型的可控性,使其能够更好地控制音频的时间顺序、抑噪和音高。
  • methods: 提出一种新的模型,将额外的内容(时间戳)和风格(抑噪和音高)作为文本模型的补充条件,以提高音频生成的可控性。使用可调式控制条件编码器和Fusion-Net将额外条件编码并融合到文本模型中,保持预训练模型的权重冰结。
  • results: 实验结果表明,我们的模型成功实现了细致的控制,以达到可控的音频生成。
    Abstract Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/
    摘要 文本基于的音频生成模型存在限制,因为它们无法包含所有音频信息,导致仅仅基于文本的控制性有限。为解决这个问题,我们提出了一种新的模型,它可以增强现有的预训练文本到Audio模型的控制性,通过添加内容(时间戳)和风格(折射和能量折射)等补充条件。这种方法可以实现精细的控制时间顺序、折射和能量等 audio 生成的属性。为保持生成的多样性,我们使用可训练的控制条件编码器,并使用大语言模型和可训练的融合网来编码和融合更多的条件,而不论预训练的文本到Audio模型的 weights 保持冻结。由于缺乏适合的数据集和评价指标,我们将现有数据集合并成一个新的数据集,并使用一系列的评价指标来评估控制性性能。实验结果表明,我们的模型成功实现了精细的控制,以完成可控的 audio 生成。音频样本和我们的数据集可以在 上公开获取。

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

  • paper_url: http://arxiv.org/abs/2308.11923
  • repo_url: None
  • paper_authors: Daiki Takeuchi, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada, Kunio Kashino
  • for: 本研究旨在描述输入对应的 Audio Clip 之间的semantic difference,而不是只是描述它们的同义性。
  • methods: 本研究提出了 Audio Difference Captioning (ADC) 任务,使用了 cross-attention-concentrated transformer encoder 抽取对比两个 Audio Clip 的差异,并使用了 similarity-discrepancy disentanglement 强调在latent space中提取差异。
  • results: 实验表明,提出的方法可以有效地解决 ADC 任务,并使得 transformer encoder 中的注意力权重更加集中在差异EXTRACTION 上。
    Abstract We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
    摘要 我们提出了听音差异描述(ADC)作为audio描述的新扩展任务,用于描述输入对的类似 yet slightly different 听音clip之间的semantic差异。ADC解决了传统的听音描述经常生成类似的描述,而不能描述听音clip之间的内容差异。我们还提出了一种基于cross-attention的transformer编码器,用于比较一对听音clip,并且提出了一种similarity-discrepancy拟合来强调在latent space中的差异。为评估我们提出的方法,我们建立了AudioDiffCaps dataset,该dataset包含了类似 yet slightly different 的听音clip,以及人工标注了这些差异的描述。实验结果表明,我们的方法能够有效解决ADC任务,并且可以在transformer编码器中更好地强调差异。

Diagnosing Infeasible Optimization Problems Using Large Language Models

  • paper_url: http://arxiv.org/abs/2308.12923
  • repo_url: None
  • paper_authors: Hao Chen, Gonzalo E. Constante-Flores, Can Li
  • for: 这篇论文是为了帮助决策问题的解决带来了帮助,通过使用自然语言对话系统来解释无法满足的优化模型。
  • methods: 这篇论文使用了GPT-4和优化解决方案来识别优化模型中的不可能性源,并提供了一些减少不可能性的建议。
  • results: 实验表明,使用OptiChat可以帮助专家和非专家用户更好地理解优化模型,快速地找到优化模型中的不可能性源。
    Abstract Decision-making problems can be represented as mathematical optimization models, finding wide applications in fields such as economics, engineering and manufacturing, transportation, and health care. Optimization models are mathematical abstractions of the problem of making the best decision while satisfying a set of requirements or constraints. One of the primary barriers to deploying these models in practice is the challenge of helping practitioners understand and interpret such models, particularly when they are infeasible, meaning no decision satisfies all the constraints. Existing methods for diagnosing infeasible optimization models often rely on expert systems, necessitating significant background knowledge in optimization. In this paper, we introduce OptiChat, a first-of-its-kind natural language-based system equipped with a chatbot GUI for engaging in interactive conversations about infeasible optimization models. OptiChat can provide natural language descriptions of the optimization model itself, identify potential sources of infeasibility, and offer suggestions to make the model feasible. The implementation of OptiChat is built on GPT-4, which interfaces with an optimization solver to identify the minimal subset of constraints that render the entire optimization problem infeasible, also known as the Irreducible Infeasible Subset (IIS). We utilize few-shot learning, expert chain-of-thought, key-retrieve, and sentiment prompts to enhance OptiChat's reliability. Our experiments demonstrate that OptiChat assists both expert and non-expert users in improving their understanding of the optimization models, enabling them to quickly identify the sources of infeasibility.
    摘要 决策问题可以表示为数学优化模型,找到广泛应用于经济、工程和生产、交通和医疗等领域。优化模型是决策问题的数学抽象,它的目标是找到满足一系列要求或限制的最佳决策。但是现有的方法用于诊断无法满足限制的优化模型通常需要较高的背景知识。在这篇论文中,我们介绍了OptiChat,一个新的自然语言基于系统,它通过交互对话来描述无法满足限制的优化模型,并提供可能的原因和解决方案。OptiChat的实现基于GPT-4,它与优化解除器结合以确定整个优化问题中的最小不可能集(IIS)。我们使用了少量学习、专家链条思考、关键提取和情感提示来提高OptiChat的可靠性。我们的实验表明,OptiChat可以帮助专家和非专家用户更好地理解优化模型,快速地确定ources of infeasibility。

Towards an On-device Agent for Text Rewriting

  • paper_url: http://arxiv.org/abs/2308.11807
  • repo_url: None
  • paper_authors: Yun Zhu, Yinxiao Liu, Felix Stahlberg, Shankar Kumar, Yu-hui Chen, Liangchen Luo, Lei Shu, Renjie Liu, Jindong Chen, Lei Meng
  • for: 这篇论文是为了开发一个轻量级的语言模型(LLM),用于在设备上进行文本重写。
  • methods: 作者提出了一种新的指令优化方法,以生成高质量的训练数据无需人工标注。此外,他们还提出了一种决策回归学习框架,可以大幅提高性能无需偏好数据。
  • results: 经验表明,作者的在设备上的模型超过了现有的状态艺术LLMs在文本重写任务上的表现,同时具有显著减少的模型大小。此外,他们还提出了一种有效的缓存方法,可以更好地衔接服务器端模型。
    Abstract Large Language Models (LLMs) have demonstrated impressive capabilities for text rewriting. Nonetheless, the large sizes of these models make them impractical for on-device inference, which would otherwise allow for enhanced privacy and economical inference. Creating a smaller yet potent language model for text rewriting presents a formidable challenge because it requires balancing the need for a small size with the need to retain the emergent capabilities of the LLM, that requires costly data collection. To address the above challenge, we introduce a new instruction tuning approach for building a mobile-centric text rewriting model. Our strategies enable the generation of high quality training data without any human labeling. In addition, we propose a heuristic reinforcement learning framework which substantially enhances performance without requiring preference data. To further bridge the performance gap with the larger server-side model, we propose an effective approach that combines the mobile rewrite agent with the server model using a cascade. To tailor the text rewriting tasks to mobile scenarios, we introduce MessageRewriteEval, a benchmark that focuses on text rewriting for messages through natural language instructions. Through empirical experiments, we demonstrate that our on-device model surpasses the current state-of-the-art LLMs in text rewriting while maintaining a significantly reduced model size. Notably, we show that our proposed cascading approach improves model performance.
    摘要 大型语言模型(LLM)已经展示了抽象文本重写的卓越能力。然而,这些大型模型的大小使得在设备上进行推理变得不切实际,这会导致隐私和经济性推理的问题。为了解决这个挑战,我们提出了一种新的指令调整方法,用于在移动设备上建立一个高质量的文本重写模型。我们的策略可以生成高质量的训练数据,而无需人工标注。此外,我们提出了一种归纳学习框架,可以在不需要偏好数据的情况下,大幅提高性能。为了补偿大型服务器端模型的性能差距,我们提出了一种有效的级联方法,将移动重写代理与服务器模型结合使用。为了适应移动设备上的文本重写任务,我们介绍了MessageRewriteEval,一个专门针对文本重写的自然语言指令 benchmark。通过实验证明,我们的在设备上运行的模型可以胜过当前状态的各种LLMs在文本重写任务中,同时具有显著减少的模型大小。此外,我们还证明了我们的归纳方法可以提高模型性能。

Few-shot Anomaly Detection in Text with Deviation Learning

  • paper_url: http://arxiv.org/abs/2308.11780
  • repo_url: None
  • paper_authors: Anindya Sundar Das, Aravind Ajay, Sriparna Saha, Monowar Bhuyan
  • for: 本文旨在提出一种基于深度几个示例学习的方法,以便利用有限的异常示例来直接学习异常分数,并在整个过程中使用偏移学习来学习异常行为。
  • methods: 本文使用的方法包括深度几个示例学习、偏移学习和多头自注意力层,以及多个实例学习方法。
  • results: 经过实验表明,本文提出的方法可以在多个标准 benchmark 数据集上达到新的州OF-the-art性能水平。
    Abstract Most current methods for detecting anomalies in text concentrate on constructing models solely relying on unlabeled data. These models operate on the presumption that no labeled anomalous examples are available, which prevents them from utilizing prior knowledge of anomalies that are typically present in small numbers in many real-world applications. Furthermore, these models prioritize learning feature embeddings rather than optimizing anomaly scores directly, which could lead to suboptimal anomaly scoring and inefficient use of data during the learning process. In this paper, we introduce FATE, a deep few-shot learning-based framework that leverages limited anomaly examples and learns anomaly scores explicitly in an end-to-end method using deviation learning. In this approach, the anomaly scores of normal examples are adjusted to closely resemble reference scores obtained from a prior distribution. Conversely, anomaly samples are forced to have anomalous scores that considerably deviate from the reference score in the upper tail of the prior. Additionally, our model is optimized to learn the distinct behavior of anomalies by utilizing a multi-head self-attention layer and multiple instance learning approaches. Comprehensive experiments on several benchmark datasets demonstrate that our proposed approach attains a new level of state-of-the-art performance.
    摘要 现有的方法 для检测文本中的异常都集中在建立仅靠无标示资料的模型上。这些模型假设没有具有标示异常的例子存在,这限制了它们使用实际世界应用中通常存在的异常小量知识。此外,这些模型专注于学习特征嵌入而不是直接优化异常分数,这可能导致异常分数不佳和数据学习过程中的数据使用不燥。在这篇论文中,我们介绍了FATE,一个深度几何学习基础架构,它利用有限异常例子来直接学习异常分数,并使用偏差学习方法。在这种方法中,正常示例的异常分数被调整,以接近对待分布中的参考分数。相反,异常示例的异常分数需要与参考分数在Upper tail上有大幅度的偏差。此外,我们的模型还利用多头自我注意和多个实例学习方法来学习异常的特别行为。我们在多个benchmark dataset上进行了充分的实验,结果显示我们的提议方法可以达到新的州立顶点性能。

  • paper_url: http://arxiv.org/abs/2308.11773
  • repo_url: None
  • paper_authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson, Nicholas Cummins, RADAR-CNS consortium
  • for: 这个研究是为了检测听力语言与抑郁的关系,并采用了大规模验证的方法。
  • methods: 这个研究使用了自然语言处理技术,特别是BERTopic模型,对3919个手机采集的语音记录进行分析,并从中提取了29个话题。
  • results: 研究发现,患有抑郁的人更容易提到“没有期望”、“睡眠”、“心理治疗”、“头发”、“学习”和“课程”等话题,这些话题可能是抑郁的指标。此外,研究还发现了语言使用和行为特征之间的相关性,以及语言使用的变化和抑郁程度之间的相关性。
    Abstract Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research.
    摘要 研究表明语言使用与抑郁有相关性,但大规模验证还需要进行。传统方法如临床研究过于昂贵。因此,人工智能技术在社交媒体上进行语言预测,但存在限制:无效验证标签、偏向用户样本和无Context。我们的研究在265名参与者的3919则语音记录中发现了29个话题,使用Whisper工具和BERTopic模型。6个话题的中值PHQ-8大于或等于10被视为抑郁风险话题:无期望、睡眠、心理治疗、剪发、学习和课程。为了详细描述话题的出现和与抑郁相关性,我们比较了语音和行为特征。我们还 investigate了话题变化和抑郁严重度的时间变化的相关性,表明重要监测语言使用的长期变化。此外,我们在相似的小数据集上测试了BERTopic模型,获得了一些一致的结果。总之,我们的发现表明特定的语音话题可能指示抑郁严重度。我们提供的数据驱动的工作流程为数字健康研究提供了实用的方法。

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

  • paper_url: http://arxiv.org/abs/2308.11606
  • repo_url: https://github.com/google/storybench
  • paper_authors: Emanuele Bugliarello, Hernan Moraldo, Ruben Villegas, Mohammad Babaeizadeh, Mohammad Taghi Saffar, Han Zhang, Dumitru Erhan, Vittorio Ferrari, Pieter-Jan Kindermans, Paul Voigtlaender
  • for: 这个论文的目的是提出一个新的多任务Benchmark,用于评估未来的文本到视频模型。
  • methods: 这个论文使用了三个视频生成任务,包括行动执行、续写故事和故事生成。它还使用了人类标注来评估模型的性能。
  • results: 研究人员通过使用这些任务和人类标注,证明了小 yet 强的文本到视频基eline的好处。此外,他们还提出了一种新的评估方法,以便更好地评估视频生成模型的性能。
    Abstract Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area.
    摘要 生成视频故事从文本提示是一个复杂的任务。除了具有高质量的视觉外,视频还需要在文本提示的时间序列中准确遵循,并在帧中保持一致。为了填补这个空白,我们收集了大量人类标注数据,并引入了StoryBench:一个新的、挑战性的多任务 bench mark,用于可靠地评估未来的文本到视频模型。我们的benchmark包括三个视频生成任务:行动执行、故事续写和故事生成。我们评估了一些小 yet 强大的文本到视频基线,并显示了使用 Algorithmically 生成的故事数据的好处。最后,我们确立了人类评估视频故事的指南,并重申了自动度量的改进。StoryBench 的目标是鼓励未来的研究努力在这一新领域。

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

  • paper_url: http://arxiv.org/abs/2308.11596
  • repo_url: https://github.com/facebookresearch/seamless_communication
  • paper_authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang
  • for: 该研究目的是创建一个能够将任何两种语言的语音翻译成另一种语言的工具,即babel fish。
  • methods: 该研究使用了100种语言的自动对时抽象的语音数据,并使用了w2v-BERT 2.0来学习自我监督的语音表示。然后,他们创建了一个多Modal的译文库,并将其与人工标注和 Pseudo标注数据进行了混合。
  • results: 该研究实现了一个可以同时支持语音译文、文本译文、语音识别和文本识别的多语言模型,可以在100种语言之间进行同时翻译。相比之前的最佳实现,该模型在FLEURS上实现了20%的BLEU提升,在直接语音译文任务上实现了1.3个BLEU点的提升,在语音译文任务上实现了2.6个ASR-BLEU点的提升。此外,该模型在干扰背景和 speaker变化的情况下也表现更加稳定。
    Abstract What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
    摘要 To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text.On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model.Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication.

Using ChatGPT as a CAT tool in Easy Language translation

  • paper_url: http://arxiv.org/abs/2308.11563
  • repo_url: https://github.com/katjakaterina/chatgpt4easylang
  • paper_authors: Silvana Deilen, Sergio Hernández Garrido, Ekaterina Lapshinova-Koltunski, Christiane Maaß
  • for: investigate the feasibility of using ChatGPT to translate citizen-oriented administrative texts into German Easy Language
  • methods: use ChatGPT to translate selected texts from websites of German public authorities using two strategies, i.e. linguistic and holistic
  • results: the generated texts are easier than the standard texts, but still do not fully meet the established Easy Language standards, and the content is not always rendered correctly.Here’s the format you requested:
  • for: <what are the paper written for?>
  • methods: <what methods the paper use?>
  • results: <what results the paper get?>
    Abstract This study sets out to investigate the feasibility of using ChatGPT to translate citizen-oriented administrative texts into German Easy Language, a simplified, controlled language variety that is adapted to the needs of people with reading impairments. We use ChatGPT to translate selected texts from websites of German public authorities using two strategies, i.e. linguistic and holistic. We analyse the quality of the generated texts based on different criteria, such as correctness, readability, and syntactic complexity. The results indicated that the generated texts are easier than the standard texts, but that they still do not fully meet the established Easy Language standards. Additionally, the content is not always rendered correctly.
    摘要 这项研究旨在探讨使用ChatGPT来翻译公民面向的行政文本into German Easy Language,一种简化、控制的语言变体,适应Allemagne人读取障碍者的需求。我们使用ChatGPT翻译选择的公共机构网站上的文本,使用两种策略,即语言和整体。我们分析生成的文本质量,包括正确性、可读性和 синтакситиче复杂性等多个标准。结果表明,生成的文本比标准文本更易读,但并不完全符合Established Easy Language标准。此外,内容并不总是正确地表达。

BELB: a Biomedical Entity Linking Benchmark

  • paper_url: http://arxiv.org/abs/2308.11537
  • repo_url: https://github.com/sg-wbi/belb-exp
  • paper_authors: Samuele Garda, Leon Weber-Genzel, Robert Martin, Ulf Leser
  • for: 本研究旨在提供一个 Biomedical Entity Linking(BEL) benchmark,以测试不同系统在多个 corpora 上的性能。
  • methods: 本研究使用了不同的方法,包括rule-based系统和基于预训练语言模型的 neural方法。
  • results: 研究结果显示,基于预训练语言模型的 neural方法在不同的entity type上表现不一致, highlighting the need of further studies towards entity-agnostic models。
    Abstract Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models.
    摘要 生物医学实体链接(BEL)是将实体提及链接到知识库的任务。它在生物医学文献抽取管道中扮演着重要的角色。我们回顾了领域的最新研究,发现现有的生物医学文献检索标准不包含BEL任务,不同的研究采用不同的实验设置,使得基于已发布的数字进行比较困难。另外,神经系统主要在使用UMLS广泛覆盖知识库下进行测试,忽略了更专业的实体类型,例如基因或变异。为解决这一问题,我们开发了BELB,一个生物医学实体链接准则,提供11个 корпу和7个知识库的联合格式,涵盖6种实体类型:基因、疾病、化学物质、物种、细胞系和变异。BELB可以减少测试BEL系统的前处理开销,提供标准化的测试床,为可重复的实验提供了一个统一的格式。使用BELB,我们进行了广泛的BEL系统和神经网络方法的评估,结果表明,神经网络方法在不同的实体类型上表现不一致,需要进一步的研究以实现实体无关的模型。